Step 3A: classification ingester + coffee-adik & arabica-washed adapters#15
Merged
Conversation
Adds the missing classification ingester (one bean per image, label = class
directory) and two new dataset adapters; coffee-adik (object detection, ~10
near-SCA classes) and arabica-washed (instance segmentation, Arabica) use the
existing COCO ingester.
- `_ingest_classification_source` in datasets/ingest.py:
- Two layouts: split subdirs (train/valid/test → train/val/test) and flat
(`<class>/*.png`) with a deterministic hash-based train/val/test split.
- class_map values can be a plain canonical name OR a dict carrying both a
`defect:` and a `morphology:` (USK-COFFEE style).
- Unmapped source classes and canonical names not in the taxonomy are
skipped, never silently mislabelled.
- data/sources/coffee_adik_defects.yaml: 10 defect classes mapped to the SCA
taxonomy where possible; ambiguous ("green") flagged as defect_unspecified
with a note to verify on first ingest.
- data/sources/roboflow_arabica_washed.yaml: Arabica appearance baseline —
each annotated bean → `sound` (no per-bean defect labels in the dataset).
Real download/ingest is run via the manual `Ingest datasets` workflow
(workflow_dispatch, ROBOFLOW_API_KEY secret). detect_then_crop is deferred
until a bean detector model exists — no concrete dataset blocks on it.
Tests (tests/test_ingest.py): flat + split layouts, dict class_map / morphology,
unmapped class skip, taxonomy-miss skip, deterministic hash-split. 88 passed
overall, lint clean, E2E green (30s).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, CI/CD The previous README froze at "Phase 6 — local UI" and didn't mention the centralised catalog, multi-label model, Data browser, curation, ingest CI/CD, or the visual E2E gate. Rewritten as a full operational manual *and* a clear pitch — problem statement, the 5 flows (Capture / Ingest / Curate / Train+ Quantize / Predict), engineering rigor (ADRs, datasheets, tests, CI/CD), updated roadmap (Phases 7–8), and a concrete deploy pick with numbers. Visuals refreshed against the live UI on the real 1,507-bean catalog: - docs/images/ui-home.png (dataset stats + health panel + recent runs) - docs/images/ui-data-browser.png (NEW — filtered gallery on the catalog) - docs/images/ui-train.png Makefile: new `db-init / db-migrate / db-export / db-curate / db-audit` convenience targets so the README's commands are real. Tiny help-regex fix so `make e2e` shows up in `make help` too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The last piece of Step 3 — the missing classification ingester plus adapters for the two new datasets you flagged. (The model went multi-label in #13; this brings the data side level.)
New
classificationingester insrc/almendra/datasets/ingest.py:train/valid/test→ canonicaltrain/val/test) and flat<class>/*.pngwith a deterministic hash-based split (default 70/15/15).class_mapvalues can be a plain canonical name OR a dict carrying bothdefect:andmorphology:(USK-COFFEE style — both axes in one map).class_mapand canonical names not in the taxonomy are skipped, never silently mislabelled.data/sources/coffee_adik_defects.yaml⭐ — 10 defect classes mapped to the SCA taxonomy where possible.greenis genuinely ambiguous (immature vs the unroasted state) and goes todefect_unspecifiedwith a note to verify on first ingest.data/sources/roboflow_arabica_washed.yaml— Arabica appearance baseline; uses the existing COCO instance-segmentation ingester. The dataset doesn't label per-bean defects, so each annotated bean maps tosound(low-trust on the defect axis; primary value is Arabica visual coverage).What's NOT in this PR (intentional)
Ingest datasetsworkflow (workflow_dispatch) using yourROBOFLOW_API_KEYsecret — adapters carryverify on first downloadnotes (exact class strings, licence, version) that we tighten after the first real run.detect_then_cropingester. Deferred — no concrete dataset blocks on it (mendeley isreference_onlyand would need a bean detector model). Easy to add later.Verification
uv run pytest -m "not e2e"→ 88 passed (82 baseline + 6 new); ruff + format clean.tests/test_ingest.py: flat + split-subdir layouts, dict class_map / morphology, unmapped-class skip, taxonomy-miss skip, deterministic hash-split.Step 3 status after this merges
🤖 Generated with Claude Code