mm-cli — the tool that manages skills, runs evals, and scaffolds ALL four disciplines of AI input. The core insight: skills without evals are just vibes, and evals without intent/spec engineering are just benchmarks.
The prompt kit contains 10 interview templates covering all 4 disciplines. This spec integrates ALL of them.
| Command | Template | Input | Output File | Interview? |
|---|---|---|---|---|
mm preflight |
Pre-flight Checklist | None (prints checklist) | stdout (optionally PREFLIGHT.md) |
No |
mm diagnose |
Rapid Diagnostic | 5-question interview | CONTEXT.md (starter) |
Yes (5 Qs) |
mm diagnose --deep |
Deep Diagnostic | 12-question interview, 6 groups | DIAGNOSTIC.md + ROADMAP.md |
Yes (12 Qs) |
mm rewrite |
Problem Statement Rewriter | stdin/file with vague requests | stdout or REWRITE.md |
Partial (clarifying Qs) |
mm context build |
Context Doc Builder | 7-domain deep interview | Smart: see below | Yes (7 domains) |
mm spec new [name] |
Specification Engineer | 3-phase interview | SPEC.md or specs/<name>.md |
Yes (3 phases) |
mm intent init |
Intent & Delegation Framework | 3-phase interview | INTENT.md |
Yes (3 phases) |
mm eval new <skill> |
Eval Harness Builder | 2-phase interview | evals/<skill>/eval.yaml |
Yes (2 phases) |
mm eval new <skill> --quick |
Eval (auto mode) | Reads SKILL.md | evals/<skill>/eval.yaml |
No |
mm eval run <skill> |
Eval engine | eval.yaml + Claude API | evals/<skill>/results/<ts>.json |
No |
mm eval compare <skill> |
Multi-Axis A/B | Two result sets | Comparison table stdout | No |
mm constraint <task> |
Constraint Architecture | 3-phase interview | constraints/<task>.md |
Yes (3 phases) |
mm skill new <name> |
Skill management | Interactive prompts | .claude/skills/<name>/SKILL.md + tile.json |
Minimal |
mm skill list |
Skill management | Reads filesystem | stdout table | No |
mm skill validate [name] |
Skill management | Reads SKILL.md files | stdout report | No |
mm skill export --format cursor |
Multi-format export | Reads .claude/skills/ |
.cursorrules or equivalent |
No |
16 commands. Zero gaps. Every prompt mapped.
CLAUDE.md is a routing table (30-60 lines max, ~100 ceiling). It points to skills. Skills hold the actual knowledge. mm context build respects this:
| Scenario | Output | Why |
|---|---|---|
| No CLAUDE.md exists | Scaffold CLAUDE.md (~60 lines: overview, skills table, key commands, git rules) |
Bootstrap a new project |
| CLAUDE.md exists, no business-context skill | .claude/skills/business-context/SKILL.md + add row to CLAUDE.md routing table |
Don't bloat the router — add a skill |
| CLAUDE.md exists + business-context skill | Update .claude/skills/business-context/SKILL.md |
Refresh existing skill |
The interview output (7-domain personal context) always becomes a skill in existing projects. The skill activates on: strategy, marketing, audience, business, goals, priorities, planning.
When scaffolding a fresh CLAUDE.md, the command also:
- Detects project type (package.json → Node, Cargo.toml → Rust, etc.)
- Suggests key commands based on detected tooling
- Creates the
.claude/skills/directory - Adds the business-context skill as the first entry in the routing table
- Custom eval engine (NOT promptfoo) —
@anthropic-ai/sdkonly, ~350 lines - 4 production deps —
@anthropic-ai/sdk,commander,yaml,chalk - Claude-as-interviewer — send the prompt template as system prompt, let Claude drive the conversation adaptively. The engine just routes between Claude and stdin.
- Default model: Sonnet — fast + cheap for interviews.
--modelflag overrides. - OAuth-first auth — Priority:
CLAUDE_CODE_OAUTH_TOKEN>ANTHROPIC_API_KEY. OAuth tokens (sk-ant-oat*) useauthTokenparam + beta headers. API keys useapiKeyparam.
mm-cli/
├── package.json # bin: { mm: "./dist/cli.js" }
├── tsconfig.json
├── vitest.config.ts
├── CLAUDE.md # Dogfooding
├── INTENT.md # Dogfooding
├── SPEC.md # This file
│
├── src/
│ ├── index.ts # Commander.js entry point
│ │
│ ├── commands/
│ │ ├── preflight.ts # mm preflight
│ │ ├── diagnose.ts # mm diagnose [--deep]
│ │ ├── rewrite.ts # mm rewrite
│ │ ├── context.ts # mm context build
│ │ ├── spec.ts # mm spec new [name]
│ │ ├── intent.ts # mm intent init
│ │ ├── eval.ts # mm eval new|run|compare
│ │ ├── constraint.ts # mm constraint <task>
│ │ └── skill.ts # mm skill new|list|validate|export
│ │
│ ├── engine/ # SHARED INTERVIEW ENGINE (core)
│ │ ├── interview.ts # runInterview() orchestrator (~200 lines)
│ │ ├── interview-types.ts # InterviewConfig, Phase, QuestionGroup
│ │ ├── interview-templates.ts # ALL 8 prompt templates as InterviewConfig data (~500 lines)
│ │ ├── claude-client.ts # @anthropic-ai/sdk wrapper (~80 lines)
│ │ ├── artifact-writer.ts # Write SPEC.md, INTENT.md, etc. (~60 lines)
│ │ └── stdin-io.ts # readline I/O for interviews (~50 lines)
│ │
│ ├── eval/ # EVAL ENGINE
│ │ ├── types.ts # EvalSuite, EvalCase, ManifoldScore (~40 lines)
│ │ ├── runner.ts # Execute evals with/without skill (~150 lines)
│ │ ├── scorer.ts # Quality checks + Multi-Axis 5-dim (~100 lines)
│ │ └── comparator.ts # A/B comparison table (~50 lines)
│ │
│ ├── skill/ # SKILL MANAGEMENT
│ │ ├── manager.ts # CRUD for .claude/skills/
│ │ ├── validator.ts # Check SKILL.md structure + tile.json
│ │ └── exporter.ts # Convert to cursor/windsurf formats
│ │
│ ├── templates/ # Static scaffolds
│ │ ├── preflight.md # The 7 questions
│ │ ├── skill-scaffold.md
│ │ ├── tile-scaffold.json
│ │ └── eval-scaffold.yaml
│ │
│ └── util/
│ ├── fs.ts # Project root detection, paths
│ ├── format.ts # Tables, markdown rendering
│ └── config.ts # .mmrc handling
│
├── evals/ # Default eval output dir
├── test/
│ ├── engine/ # Interview engine tests
│ ├── eval/ # Eval engine tests
│ ├── commands/ # Command tests
│ └── fixtures/ # Sample SKILL.md, eval YAML
~2,500 total lines across source + tests.
┌──────────────────────────────────────────────────┐
│ CLI Layer (Commander.js) │
│ src/commands/*.ts — 16 commands │
│ Parses args → picks InterviewConfig → orchestrates│
└──────────┬────────────────────┬───────────────────┘
│ │
┌──────────▼──────────┐ ┌─────▼──────────────────┐
│ Interview Engine │ │ Eval Engine │
│ src/engine/*.ts │ │ src/eval/*.ts │
│ Multi-phase Claude │ │ A/B skill testing │
│ interviews → files │ │ Multi-Axis 5-dim │
└──────────┬──────────┘ └─────┬──────────────────┘
│ │
┌──────────▼────────────────────▼──────────────────┐
│ Claude Client (@anthropic-ai/sdk) │
│ src/engine/claude-client.ts — single wrapper │
└──────────────────────────────────────────────────┘
Key insight: 7 of 9 prompt-based commands use the SAME interview engine. Each prompt becomes a declarative InterviewConfig — the engine sends the system prompt to Claude, Claude drives the conversation, the engine routes stdin responses back. Zero custom NLP logic.
// src/engine/claude-client.ts
function createClient(model?: string): Anthropic {
const oauthToken = process.env.CLAUDE_CODE_OAUTH_TOKEN || process.env.ANTHROPIC_SETUP_TOKEN;
const apiKey = process.env.ANTHROPIC_API_KEY;
if (oauthToken) {
return new Anthropic({
authToken: oauthToken,
defaultHeaders: {
'anthropic-beta': 'oauth-2025-04-20',
'user-agent': 'mm-cli/0.1.0',
},
});
}
if (apiKey) {
return new Anthropic({ apiKey });
}
throw new Error('No auth configured. Set CLAUDE_CODE_OAUTH_TOKEN or ANTHROPIC_API_KEY');
}Priority: CLAUDE_CODE_OAUTH_TOKEN > ANTHROPIC_SETUP_TOKEN > ANTHROPIC_API_KEY
OAuth tokens start with sk-ant-oat — use authToken param + beta headers.
API keys start with sk-ant-api — use apiKey param, no special headers.
interface InterviewConfig {
id: string; // e.g., "spec-new", "intent-init"
systemPrompt: string; // <role> + <instructions> text
phases: InterviewPhase[]; // Ordered phases with question groups
artifactTemplate: string; // Output format from <output> block
guardrails: string[]; // From <guardrails> block
enableTools?: boolean; // When true, Claude can explore the codebase
}
async function runInterview(
config: InterviewConfig,
client: ClaudeClient,
io: StdinIO
): Promise<{ artifact: string; transcript: Message[] }>Commands that produce specifications, eval harnesses, or constraint docs must be able to explore the local codebase. Without this, the output is generic boilerplate — not grounded in actual code.
Implementation: src/engine/tools.ts — three tools exposed to Claude via the Anthropic API's native tool use:
| Tool | What it does | Example |
|---|---|---|
read_file |
Read any file in the project | read_file({ path: "lib/db/schema.ts" }) |
list_files |
Find files by name pattern | list_files({ pattern: "*.ts", path: "src/" }) |
search_files |
Grep file contents with regex | search_files({ pattern: "ffmpeg", file_pattern: "*.ts" }) |
Which commands get tools:
| Command | Tools? | Why |
|---|---|---|
mm spec new |
Yes | Must read codebase to write grounded specs |
mm eval new |
Yes | Must read SKILL.md + codebase to generate test cases |
mm constraint |
Yes | Must understand codebase to define constraint architecture |
mm diagnose |
No | Assesses user practices, not code |
mm context build |
No | Captures personal/business context |
mm intent init |
No | Captures human priorities |
mm rewrite |
No | Rewrites a text prompt |
How it works: When InterviewConfig.enableTools is true, the interview engine:
- Appends a
<tools-context>block to the system prompt telling Claude to use tools proactively - Uses
ClaudeClient.sendWithTools()instead ofsend()for API calls - Maintains a separate
apiMessagesarray (with tool_use/tool_result content blocks) alongside the simple transcript - Prints
⚙ tool_name(detail)to stderr when Claude uses a tool, so the user sees what's happening - Handles the tool loop (Claude requests → execute locally → send result → Claude continues) with a 25-iteration safety limit
Security: Tools are sandboxed to the current working directory. Path traversal outside process.cwd() is rejected. File output is truncated at 10KB. Shell commands have timeouts (5s for find, 10s for grep).
- Auto-save immediately — Write the artifact to disk as soon as Claude finishes generating it, BEFORE any follow-up prompt. Never gate the save behind a y/N question.
- Clear completion signal — Print
Saved to <filename>immediately after artifact generation. - Explicit follow-up prompt — If offering to continue, say exactly what continuing does:
"Follow up on your results? (Ask questions, refine the document, etc.) (y/N)"— not a vague "Continue the conversation?" - Default to exit —
Nis the default. The artifact is already saved. Pressing Enter exits cleanly. - No dangling state — If the user presses Ctrl-C at any point after the artifact is generated, the file should already be on disk.
// Multi-Axis 5 scoring dimensions (1-3 each, max 15)
interface ManifoldScore {
selectiveTransfer: number; // What still holds vs needs revision?
causalTransparency: number; // Can it explain WHY?
creativeRerouting: number; // Finds alternatives when blocked?
degradationAwareness: number; // Flags harder/impossible?
outputCoherence: number; // Satisfies original + new constraint?
}
// A/B testing: with skill vs without skill
// mm eval run <skill> → with SKILL.md loaded
// mm eval run <skill> --without-skill → baseline
// mm eval compare <skill> → delta tableGoal: Repo scaffold + CLI entry + preflight + skill management.
Build:
- Scaffold repo:
npm init, install deps (@anthropic-ai/sdk,commander,yaml,chalk, dev:typescript,tsx,vitest) src/index.ts— Commander.js root with all subcommand registrationssrc/commands/preflight.ts— Print the 7 pre-flight questionssrc/commands/skill.ts—skill new,skill list,skill validatesrc/skill/manager.ts— CRUD for.claude/skills/<name>/src/skill/validator.ts— Check SKILL.md frontmatter, line count (<200), self-improvement sectionsrc/util/fs.ts— Project root detectionsrc/templates/— Static templates (preflight.md, skill-scaffold.md, tile-scaffold.json)CLAUDE.mdfor mm-cli itself (dogfooding)
Verify:
-
mm preflightprints the 7 pre-flight questions -
mm skill new test-skillcreates.claude/skills/test-skill/SKILL.md+tile.json -
mm skill listshows table of skills in current project -
mm skill validatereports structural issues (missing frontmatter, oversized) -
mm --helpshows all commands -
vitest runpasses (>=3 tests)
Goal: The shared interview engine + mm diagnose + mm rewrite proving it works.
Build:
src/engine/claude-client.ts— Anthropic SDK wrappersrc/engine/stdin-io.ts— readline-based user I/Osrc/engine/interview-types.ts— All typessrc/engine/interview.ts— CorerunInterview()functionsrc/engine/interview-templates.ts— Templates forDIAGNOSE_QUICK(Q1) andREWRITE(Q2)src/engine/artifact-writer.ts— Write output filessrc/commands/diagnose.ts—mm diagnoseusing DIAGNOSE_QUICK templatesrc/commands/rewrite.ts—mm rewrite(reads from stdin or file arg)src/util/config.ts—.mmrcloading, env var resolution
Verify:
-
echo "update the dashboard" | mm rewriteruns Q2 interview, outputs rewritten problem statement with gap map -
mm diagnoseconducts 5-question interview, produces scored 4-discipline table + starter CONTEXT.md -
--dry-runflag prints the messages array without calling API - Ctrl-C gracefully interrupts interview
-
vitest runpasses with mock Claude response tests
Goal: All 6 remaining interview commands. Fast because engine already exists — just new templates + thin wrappers.
Build:
- Add templates to
interview-templates.ts:DIAGNOSE_DEEP— 12-question deep diagnostic + 4-month roadmapCONTEXT_BUILD— 7-domain personal context documentSPEC_NEW— Specification engineer, 3 phasesINTENT_INIT— Intent & delegation framework, 3 phasesEVAL_HARNESS— Eval harness builder, 2 phasesCONSTRAINT_DESIGNER— Constraint architecture, 3 phases
src/commands/context.ts—mm context buildsrc/commands/spec.ts—mm spec new [name]src/commands/intent.ts—mm intent initsrc/commands/constraint.ts—mm constraint <task>- Update
src/commands/diagnose.tsfor--deepflag - Stub
src/commands/eval.tswitheval new --interview(interview mode only)
Verify:
-
mm diagnose --deepproduces DIAGNOSTIC.md + ROADMAP.md with 1-10 scoring -
mm context buildproduces CLAUDE.md through 7-domain interview -
mm spec new auth-systemproducesspecs/auth-system.mdwith all 7 sections -
mm intent initproduces INTENT.md with Priority Hierarchy, Decision Authority Map, Quality Thresholds, Rigor Test -
mm constraint deploy-pipelineproducesconstraints/deploy-pipeline.mdwith 4-quadrant structure -
mm eval new my-skill --interviewproducesevals/my-skill/eval.yaml
Goal: Custom eval engine with A/B skill testing and Multi-Axis 5-dimension scoring.
Build:
src/eval/types.ts— EvalSuite, EvalCase, ManifoldScore, EvalResultsrc/eval/runner.ts— Execute eval suite against Claude API (with/without skill)src/eval/scorer.ts— Quality checkbox scoring + Multi-Axis 5-dim scoring via Claude-as-judgesrc/eval/comparator.ts— A/B comparison table- Complete
src/commands/eval.ts:mm eval new <skill> --quick— auto-generate eval from SKILL.mdmm eval run <skill>— execute with skill loadedmm eval run <skill> --without-skill— execute baselinemm eval compare <skill>— display delta table
src/templates/eval-scaffold.yaml- Tests with fixture data
Verify:
-
mm eval new my-skill --quickreads SKILL.md, generates eval YAML with 3-5 test cases -
mm eval run my-skillexecutes all cases, writes results JSON -
mm eval run my-skill --without-skillruns same cases without SKILL.md -
mm eval compare my-skillshows side-by-side score deltas - Multi-Axis dimensions scored when constraint variations present
- Eval engine total: <=400 lines
Goal: Multi-format export, global install, full dogfooding pass.
Build:
src/skill/exporter.ts— Convert.claude/skills/to.cursorrules/.windsurfrules- Update
src/commands/skill.tswithskill export --format cursor|windsurf|merged - Dogfooding pass — run
mmagainst its own repo - Global flags:
--verbose,--dry-run,--json - Error handling: missing API key, missing files, invalid YAML
npm linkfor global install- Finalize mm-cli's own CLAUDE.md, INTENT.md, SPEC.md
Verify:
-
mm skill export --format cursorproduces valid.cursorrules -
npm link && mm --helpworks globally - All 16 commands functional
-
vitest runpasses (>=15 tests)
name: database-skill-eval
skill: .claude/skills/database/SKILL.md
model: claude-sonnet-4-20250514
judge: claude-sonnet-4-20250514
scenarios:
- name: prisma-query
prompt: |
Write a Prisma query that fetches all users with role 'admin'
along with their most recent login sessions.
context: |
Node.js API using Prisma ORM with PostgreSQL.
expected_qualities:
- Uses Prisma client API, not raw SQL
- References correct model/field names
- Includes relation loading for sessions
failure_modes:
- Uses raw SQL instead of Prisma client
- Wrong field names from hallucination
scoring:
excellent: 5
acceptable: 3
poor: 1
- name: constraint-shift-pagination
base_scenario: prisma-query
constraint_change: |
Also add cursor-based pagination with a limit of 20.
Results must be sorted by lastLoginAt descending.
manifold_dimensions:
selective_transfer: "Original role filter + session include unchanged, only add pagination"
causal_transparency: "Should explain cursor vs offset pagination tradeoffs"
creative_rerouting: "If cursor field has duplicates, needs secondary sort key"
degradation_awareness: "Flag that cursor pagination doesn't support arbitrary page jumps"
output_coherence: "Must still filter admins + load sessions AND paginate correctly"- Interview commands:
2-5 API calls per interview ($0.02-0.10) - Eval run: 4 calls per scenario (baseline + skilled + judge x2). 3 scenarios = ~$0.10-0.50
- Total dev cost for building + testing: ~$5-10
| Out of Scope | Why |
|---|---|
| Web UI / dashboard | Dashboards are a dying paradigm. Agents don't need them. |
| promptfoo integration | Custom engine only, decision made |
| OpenAI / Gemini model support | Claude only for v1 |
| Resume interrupted interviews | Too complex for v1. Ctrl-C = restart |
| VS Code extension | Future. StdinIO abstraction enables it later |
| Auto-generate SKILL.md from codebase | Unbounded. mm skill new scaffolds; human fills |
| Cloud sync / team features | Local-first. Git is the sync mechanism. |
| Custom interview questions | Templates are fixed for v1. |
| CI integration | Manual mm eval run. Hosted API gets pipeline gates (future). |
| Internationalization | English only |