mm exists because AI tools are getting better at following instructions, but humans aren't getting better at giving them. The bottleneck moved from "talking to AI well" to "knowing what you want before AI starts working." mm operationalizes the four disciplines of AI input — Prompt Craft, Context Engineering, Intent Engineering, Specification Engineering — with measurement.
We optimize for measurable improvement in AI output quality, not convenience features.
When these values conflict, resolve in this order:
-
Correctness of methodology — The prompts are research-backed. Don't simplify them for UX convenience. A 7-domain interview that produces a good CLAUDE.md beats a quick wizard that produces a mediocre one.
-
Measurability — Every artifact mm produces should be evaluable. If we can't measure whether a skill, spec, or intent doc actually improves output, it's decoration. A/B test with skill vs without — that's the north star.
-
Developer experience — CLI should be fast, clear, and Unix-idiomatic. But never sacrifice #1 or #2 for smoother UX. A correct tool that's slightly harder to use beats a polished tool that produces wrong artifacts.
-
Simplicity of implementation — ~2,500 lines total. 4 production deps. If a feature requires more than 100 lines to implement, question whether it belongs in v1.
- Output formatting (tables, colors, markdown)
- File naming conventions for generated artifacts
- Error message wording
- Test fixture content
- Adding a new dependency (must document why in commit message)
- Changing the interview engine's conversation flow
- Modifying eval scoring thresholds
- Modifying the prompt templates (they are carefully calibrated)
- Adding a new command beyond the 16 specified
- Changing the eval YAML format (downstream compatibility)
- Any feature that requires a hosted service or account
- Tests pass
- No regressions in existing commands
- Follows existing patterns in the codebase
- Tests pass with edge cases covered
- Manually verify against the original prompt templates
- Run the command end-to-end (not just unit tests)
- Check that generated artifacts match the expected format from the source prompt
A task is high-stakes if it touches src/engine/ or src/eval/ — these are the core engines that everything else depends on.
-
Over-engineering the interview engine — It's tempting to add NLP, branching logic, or complex state machines. The engine is simple: send system prompt to Claude, route stdin/stdout, collect artifact. Claude does the interviewing.
-
Diverging from the prompt templates — The templates should remain as-is. Paraphrasing loses the carefully calibrated question sequences and guardrails.
-
Making evals too complex — The eval engine is ~350 lines for a reason. It sends prompts, collects outputs, and uses Claude to judge. No custom ML, no vector databases, no embeddings.
-
Forgetting the A/B pattern — Every eval must compare WITH skill vs WITHOUT skill. An eval that only runs "with skill" tells you nothing.
-
Scope creep toward SaaS — v1 is a CLI tool. No hosted services, no accounts, no dashboards. Local files, git-controlled. Enterprise features are Phase 3+ of go-to-market.
Before finalizing a decision, verify: are we optimizing for developer convenience at the expense of methodology correctness?
Specifically:
- Does this change make the tool easier to use but produce worse artifacts?
- Are we simplifying an interview because it "takes too long" even though the full version produces better results?
- Are we adding a shortcut that bypasses the eval measurement step?
If yes to any: stop and reconsider.
- We don't generate SKILL.md content from codebase analysis (humans write skills)
- We don't host a skill marketplace (git repos are the distribution mechanism)
- We don't support non-Claude models in v1 (BYOK with Anthropic API key)
- We don't build a VS Code extension (CLI first, extension later via StdinIO abstraction)
- We don't customize interview templates in v1 (Enterprise feature)