mm CLI — Intent & Delegation Framework

Core Intent

mm exists because AI tools are getting better at following instructions, but humans aren't getting better at giving them. The bottleneck moved from "talking to AI well" to "knowing what you want before AI starts working." mm operationalizes the four disciplines of AI input — Prompt Craft, Context Engineering, Intent Engineering, Specification Engineering — with measurement.

We optimize for measurable improvement in AI output quality, not convenience features.

Priority Hierarchy

When these values conflict, resolve in this order:

Correctness of methodology — The prompts are research-backed. Don't simplify them for UX convenience. A 7-domain interview that produces a good CLAUDE.md beats a quick wizard that produces a mediocre one.
Measurability — Every artifact mm produces should be evaluable. If we can't measure whether a skill, spec, or intent doc actually improves output, it's decoration. A/B test with skill vs without — that's the north star.
Developer experience — CLI should be fast, clear, and Unix-idiomatic. But never sacrifice #1 or #2 for smoother UX. A correct tool that's slightly harder to use beats a polished tool that produces wrong artifacts.
Simplicity of implementation — ~2,500 lines total. 4 production deps. If a feature requires more than 100 lines to implement, question whether it belongs in v1.

Decision Authority Map

Decide Autonomously

Output formatting (tables, colors, markdown)
File naming conventions for generated artifacts
Error message wording
Test fixture content

Decide with Notification

Adding a new dependency (must document why in commit message)
Changing the interview engine's conversation flow
Modifying eval scoring thresholds

Escalate Before Acting

Modifying the prompt templates (they are carefully calibrated)
Adding a new command beyond the 16 specified
Changing the eval YAML format (downstream compatibility)
Any feature that requires a hosted service or account

Quality Thresholds

Routine Work (single commands, bug fixes)

Tests pass
No regressions in existing commands
Follows existing patterns in the codebase

High-Stakes Work (interview engine, eval engine, new commands)

Tests pass with edge cases covered
Manually verify against the original prompt templates
Run the command end-to-end (not just unit tests)
Check that generated artifacts match the expected format from the source prompt

The Boundary

A task is high-stakes if it touches src/engine/ or src/eval/ — these are the core engines that everything else depends on.

Common Failure Modes

Over-engineering the interview engine — It's tempting to add NLP, branching logic, or complex state machines. The engine is simple: send system prompt to Claude, route stdin/stdout, collect artifact. Claude does the interviewing.
Diverging from the prompt templates — The templates should remain as-is. Paraphrasing loses the carefully calibrated question sequences and guardrails.
Making evals too complex — The eval engine is ~350 lines for a reason. It sends prompts, collects outputs, and uses Claude to judge. No custom ML, no vector databases, no embeddings.
Forgetting the A/B pattern — Every eval must compare WITH skill vs WITHOUT skill. An eval that only runs "with skill" tells you nothing.
Scope creep toward SaaS — v1 is a CLI tool. No hosted services, no accounts, no dashboards. Local files, git-controlled. Enterprise features are Phase 3+ of go-to-market.

The Rigor Test

Before finalizing a decision, verify: are we optimizing for developer convenience at the expense of methodology correctness?

Specifically:

Does this change make the tool easier to use but produce worse artifacts?
Are we simplifying an interview because it "takes too long" even though the full version produces better results?
Are we adding a shortcut that bypasses the eval measurement step?

If yes to any: stop and reconsider.

What We Explicitly Don't Do

We don't generate SKILL.md content from codebase analysis (humans write skills)
We don't host a skill marketplace (git repos are the distribution mechanism)
We don't support non-Claude models in v1 (BYOK with Anthropic API key)
We don't build a VS Code extension (CLI first, extension later via StdinIO abstraction)
We don't customize interview templates in v1 (Enterprise feature)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mm CLI — Intent & Delegation Framework

Core Intent

Priority Hierarchy

Decision Authority Map

Decide Autonomously

Decide with Notification

Escalate Before Acting

Quality Thresholds

Routine Work (single commands, bug fixes)

High-Stakes Work (interview engine, eval engine, new commands)

The Boundary

Common Failure Modes

The Rigor Test

What We Explicitly Don't Do

FilesExpand file tree

INTENT.md

Latest commit

History

INTENT.md

File metadata and controls

mm CLI — Intent & Delegation Framework

Core Intent

Priority Hierarchy

Decision Authority Map

Decide Autonomously

Decide with Notification

Escalate Before Acting

Quality Thresholds

Routine Work (single commands, bug fixes)

High-Stakes Work (interview engine, eval engine, new commands)

The Boundary

Common Failure Modes

The Rigor Test

What We Explicitly Don't Do