Reconcile stack: bring Curation (#10) + Data Browser (#11) to main by mrjunos · Pull Request #12 · mrjunos/almendra

mrjunos · 2026-05-26T15:40:02Z

Why

The Step 1–3 PRs were stacked (#9 ← #10 ← #11). #9 merged to main, but #10 and #11 were merged into their stacked base branches (data-catalog, data-curation) rather than main. So main currently has only the catalog (#9); the curation (#10) and Data Browser (#11) work is stranded in data-curation.

This PR promotes data-curation → main, landing the missing work. It's purely additive over main (no catalog changes are touched):

Curation (Step 2): db curate (dedup, quality, lossy-label trust), enriched db audit, per-source datasheets, Settings UI status.
Data Browser (Step 3 UI): the Data page + db.queries + the E2E Data sanity step.
CI installs --extra catalog.

Everything here was already reviewed in #10 and #11.

After merging

main will be whole (catalog + curation + browser). Future work (multi-label model) branches off this. Going forward: avoid deep stacks — merge bottom-up, or use one branch.

🤖 Generated with Claude Code

…datasheets Adds curation passes over the catalog (no files deleted — verdicts written to the DB and reversible): - `almendra db curate` runs three passes: - dedup: perceptual-hash (pHash) near-duplicate detection, greedy by representative; flags later copies is_good=false (`duplicate of <id>`). - quality: flags too-small / near-blank (low pixel std-dev) crops. - lossy labels: lowers trust on documented lossy source mappings (roboflow Scorched→defect_unspecified→0.2, Empty→hull_husk→0.3). - `--dry-run` reports without writing; reasons stored in Bean.notes. - On the real data this flags 196 near-duplicates (1507→1311 good) and down-weights 222 labels; export drops the flagged beans. - `db audit` enriched: not-good-by-reason breakdown + label-trust histogram. - Per-source datasheets in docs/datasheets/ (full for the ingested roboflow_robusta_defects incl. curation findings; briefs for the rest); README index links them. kaggle_17defects documented as license-blocked. - Settings UI: source status/license summary table + a banner flagging license-blocked sources. Tests in tests/test_curate.py (flagging + export exclusion, idempotency, dry-run writes nothing). Stacked on the Step-1 catalog branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New Streamlit "Data" page to browse the catalog and manually check the data: - Filters: source, split, primary defect, provenance, quality (good/not-good), and label-trust bucket. - Thumbnail gallery (paginated) with class + ext_id captions; not-good beans flagged ⚠️. - Per-bean detail: all views, every defect (class/primary/label_source/trust), and the lot provenance (species, variety, process, farm, altitude, dates…). Read queries live in `almendra.db.queries` (Streamlit-free, unit-tested). The page degrades gracefully if the `catalog` extra or the DB file is absent. Also: CI now installs `--extra catalog` so the catalog/curation/browse tests actually run (they were importorskip-skipped before). Tests: tests/test_browse_queries.py (filters, pagination, detail) + browse added to the UI smoke set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The E2E sandbox now builds its catalog from the fixture manifest (harness `build_catalog` runs `almendra db migrate`), and the flow navigates to the new Data page after Predict, asserting the browser renders with beans ("Data browser" + "Showing N of M"). Exercises `db migrate` end-to-end too. Passes in ~34s; recording covers the added step. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Data Browser UI — visually inspect & spot-check the catalog

mrjunos and others added 4 commits May 26, 2026 09:47

Merge pull request #11 from mrjunos/data-step3

0739dd5

Data Browser UI — visually inspect & spot-check the catalog

mrjunos mentioned this pull request May 26, 2026

Multi-label defect classification (Step 3B) #13

Merged

mrjunos merged commit 0f84542 into main May 26, 2026
2 checks passed

mrjunos deleted the data-curation branch May 26, 2026 15:52

This was referenced May 26, 2026

CI: manual dataset-ingest workflow using ROBOFLOW_API_KEY secret #14

Merged

Step 3A: classification ingester + coffee-adik & arabica-washed adapters #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile stack: bring Curation (#10) + Data Browser (#11) to main#12

Reconcile stack: bring Curation (#10) + Data Browser (#11) to main#12
mrjunos merged 4 commits into
mainfrom
data-curation

mrjunos commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrjunos commented May 26, 2026

Why

After merging

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant