Step 3A: classification ingester + coffee-adik & arabica-washed adapters by mrjunos · Pull Request #15 · mrjunos/almendra

mrjunos · 2026-05-28T12:20:28Z

Branches off the now-complete main (#12, #13, #14 merged). Independent of any open work.

What

The last piece of Step 3 — the missing classification ingester plus adapters for the two new datasets you flagged. (The model went multi-label in #13; this brings the data side level.)

New classification ingester in src/almendra/datasets/ingest.py:
- Two layouts: split subdirs (Roboflow train/valid/test → canonical train/val/test) and flat <class>/*.png with a deterministic hash-based split (default 70/15/15).
- class_map values can be a plain canonical name OR a dict carrying both defect: and morphology: (USK-COFFEE style — both axes in one map).
- Source classes not in class_map and canonical names not in the taxonomy are skipped, never silently mislabelled.
data/sources/coffee_adik_defects.yaml ⭐ — 10 defect classes mapped to the SCA taxonomy where possible. green is genuinely ambiguous (immature vs the unroasted state) and goes to defect_unspecified with a note to verify on first ingest.
data/sources/roboflow_arabica_washed.yaml — Arabica appearance baseline; uses the existing COCO instance-segmentation ingester. The dataset doesn't label per-bean defects, so each annotated bean maps to sound (low-trust on the defect axis; primary value is Arabica visual coverage).

What's NOT in this PR (intentional)

Actual downloads. Real ingest runs through the Ingest datasets workflow (workflow_dispatch) using your ROBOFLOW_API_KEY secret — adapters carry verify on first download notes (exact class strings, licence, version) that we tighten after the first real run.
detect_then_crop ingester. Deferred — no concrete dataset blocks on it (mendeley is reference_only and would need a bean detector model). Easy to add later.

Verification

uv run pytest -m "not e2e" → 88 passed (82 baseline + 6 new); ruff + format clean.
New tests/test_ingest.py: flat + split-subdir layouts, dict class_map / morphology, unmapped-class skip, taxonomy-miss skip, deterministic hash-split.
E2E happy-path passes (30s) — the full UI flow is unaffected.

Step 3 status after this merges

✅ 3A (datasets) — this PR
✅ 3B (multi-label model) — merged via Multi-label defect classification (Step 3B) #13
✅ UI Data Browser — merged via Data Browser UI — visually inspect & spot-check the catalog #11/Reconcile stack: bring Curation (#10) + Data Browser (#11) to main #12
✅ CI ingest workflow + ROBOFLOW_API_KEY — merged via CI: manual dataset-ingest workflow using ROBOFLOW_API_KEY secret #14
🟡 Remaining loose ends: pick a DVC remote (ADR-0006 deferral), and after the first real ingest, tighten the new adapters' class_map / licence / version and write full datasheets.

🤖 Generated with Claude Code

Adds the missing classification ingester (one bean per image, label = class directory) and two new dataset adapters; coffee-adik (object detection, ~10 near-SCA classes) and arabica-washed (instance segmentation, Arabica) use the existing COCO ingester. - `_ingest_classification_source` in datasets/ingest.py: - Two layouts: split subdirs (train/valid/test → train/val/test) and flat (`<class>/*.png`) with a deterministic hash-based train/val/test split. - class_map values can be a plain canonical name OR a dict carrying both a `defect:` and a `morphology:` (USK-COFFEE style). - Unmapped source classes and canonical names not in the taxonomy are skipped, never silently mislabelled. - data/sources/coffee_adik_defects.yaml: 10 defect classes mapped to the SCA taxonomy where possible; ambiguous ("green") flagged as defect_unspecified with a note to verify on first ingest. - data/sources/roboflow_arabica_washed.yaml: Arabica appearance baseline — each annotated bean → `sound` (no per-bean defect labels in the dataset). Real download/ingest is run via the manual `Ingest datasets` workflow (workflow_dispatch, ROBOFLOW_API_KEY secret). detect_then_crop is deferred until a bean detector model exists — no concrete dataset blocks on it. Tests (tests/test_ingest.py): flat + split layouts, dict class_map / morphology, unmapped class skip, taxonomy-miss skip, deterministic hash-split. 88 passed overall, lint clean, E2E green (30s). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…, CI/CD The previous README froze at "Phase 6 — local UI" and didn't mention the centralised catalog, multi-label model, Data browser, curation, ingest CI/CD, or the visual E2E gate. Rewritten as a full operational manual *and* a clear pitch — problem statement, the 5 flows (Capture / Ingest / Curate / Train+ Quantize / Predict), engineering rigor (ADRs, datasheets, tests, CI/CD), updated roadmap (Phases 7–8), and a concrete deploy pick with numbers. Visuals refreshed against the live UI on the real 1,507-bean catalog: - docs/images/ui-home.png (dataset stats + health panel + recent runs) - docs/images/ui-data-browser.png (NEW — filtered gallery on the catalog) - docs/images/ui-train.png Makefile: new `db-init / db-migrate / db-export / db-curate / db-audit` convenience targets so the README's commands are real. Tiny help-regex fix so `make e2e` shows up in `make help` too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

mrjunos and others added 2 commits May 28, 2026 07:20

mrjunos merged commit 823708b into main May 28, 2026
3 of 4 checks passed

mrjunos deleted the step3a-new-datasets branch May 28, 2026 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 3A: classification ingester + coffee-adik & arabica-washed adapters#15

Step 3A: classification ingester + coffee-adik & arabica-washed adapters#15
mrjunos merged 2 commits into
mainfrom
step3a-new-datasets

mrjunos commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrjunos commented May 28, 2026

What

What's NOT in this PR (intentional)

Verification

Step 3 status after this merges

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant