Skip to content

Step 3A: classification ingester + coffee-adik & arabica-washed adapters#15

Merged
mrjunos merged 2 commits into
mainfrom
step3a-new-datasets
May 28, 2026
Merged

Step 3A: classification ingester + coffee-adik & arabica-washed adapters#15
mrjunos merged 2 commits into
mainfrom
step3a-new-datasets

Conversation

@mrjunos

@mrjunos mrjunos commented May 28, 2026

Copy link
Copy Markdown
Owner

Branches off the now-complete main (#12, #13, #14 merged). Independent of any open work.

What

The last piece of Step 3 — the missing classification ingester plus adapters for the two new datasets you flagged. (The model went multi-label in #13; this brings the data side level.)

  • New classification ingester in src/almendra/datasets/ingest.py:

    • Two layouts: split subdirs (Roboflow train/valid/test → canonical train/val/test) and flat <class>/*.png with a deterministic hash-based split (default 70/15/15).
    • class_map values can be a plain canonical name OR a dict carrying both defect: and morphology: (USK-COFFEE style — both axes in one map).
    • Source classes not in class_map and canonical names not in the taxonomy are skipped, never silently mislabelled.
  • data/sources/coffee_adik_defects.yaml ⭐ — 10 defect classes mapped to the SCA taxonomy where possible. green is genuinely ambiguous (immature vs the unroasted state) and goes to defect_unspecified with a note to verify on first ingest.

  • data/sources/roboflow_arabica_washed.yaml — Arabica appearance baseline; uses the existing COCO instance-segmentation ingester. The dataset doesn't label per-bean defects, so each annotated bean maps to sound (low-trust on the defect axis; primary value is Arabica visual coverage).

What's NOT in this PR (intentional)

  • Actual downloads. Real ingest runs through the Ingest datasets workflow (workflow_dispatch) using your ROBOFLOW_API_KEY secret — adapters carry verify on first download notes (exact class strings, licence, version) that we tighten after the first real run.
  • detect_then_crop ingester. Deferred — no concrete dataset blocks on it (mendeley is reference_only and would need a bean detector model). Easy to add later.

Verification

  • uv run pytest -m "not e2e"88 passed (82 baseline + 6 new); ruff + format clean.
  • New tests/test_ingest.py: flat + split-subdir layouts, dict class_map / morphology, unmapped-class skip, taxonomy-miss skip, deterministic hash-split.
  • E2E happy-path passes (30s) — the full UI flow is unaffected.

Step 3 status after this merges

🤖 Generated with Claude Code

mrjunos and others added 2 commits May 28, 2026 07:20
Adds the missing classification ingester (one bean per image, label = class
directory) and two new dataset adapters; coffee-adik (object detection, ~10
near-SCA classes) and arabica-washed (instance segmentation, Arabica) use the
existing COCO ingester.

- `_ingest_classification_source` in datasets/ingest.py:
  - Two layouts: split subdirs (train/valid/test → train/val/test) and flat
    (`<class>/*.png`) with a deterministic hash-based train/val/test split.
  - class_map values can be a plain canonical name OR a dict carrying both a
    `defect:` and a `morphology:` (USK-COFFEE style).
  - Unmapped source classes and canonical names not in the taxonomy are
    skipped, never silently mislabelled.
- data/sources/coffee_adik_defects.yaml: 10 defect classes mapped to the SCA
  taxonomy where possible; ambiguous ("green") flagged as defect_unspecified
  with a note to verify on first ingest.
- data/sources/roboflow_arabica_washed.yaml: Arabica appearance baseline —
  each annotated bean → `sound` (no per-bean defect labels in the dataset).

Real download/ingest is run via the manual `Ingest datasets` workflow
(workflow_dispatch, ROBOFLOW_API_KEY secret). detect_then_crop is deferred
until a bean detector model exists — no concrete dataset blocks on it.

Tests (tests/test_ingest.py): flat + split layouts, dict class_map / morphology,
unmapped class skip, taxonomy-miss skip, deterministic hash-split. 88 passed
overall, lint clean, E2E green (30s).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, CI/CD

The previous README froze at "Phase 6 — local UI" and didn't mention the
centralised catalog, multi-label model, Data browser, curation, ingest CI/CD,
or the visual E2E gate. Rewritten as a full operational manual *and* a clear
pitch — problem statement, the 5 flows (Capture / Ingest / Curate / Train+
Quantize / Predict), engineering rigor (ADRs, datasheets, tests, CI/CD),
updated roadmap (Phases 7–8), and a concrete deploy pick with numbers.

Visuals refreshed against the live UI on the real 1,507-bean catalog:
- docs/images/ui-home.png (dataset stats + health panel + recent runs)
- docs/images/ui-data-browser.png (NEW — filtered gallery on the catalog)
- docs/images/ui-train.png

Makefile: new `db-init / db-migrate / db-export / db-curate / db-audit`
convenience targets so the README's commands are real. Tiny help-regex fix
so `make e2e` shows up in `make help` too.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mrjunos mrjunos merged commit 823708b into main May 28, 2026
3 of 4 checks passed
@mrjunos mrjunos deleted the step3a-new-datasets branch May 28, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant