mm CLI — Build Specification

Context

mm-cli — the tool that manages skills, runs evals, and scaffolds ALL four disciplines of AI input. The core insight: skills without evals are just vibes, and evals without intent/spec engineering are just benchmarks.

The prompt kit contains 10 interview templates covering all 4 disciplines. This spec integrates ALL of them.

Command Map

Command	Template	Input	Output File	Interview?
`mm preflight`	Pre-flight Checklist	None (prints checklist)	stdout (optionally `PREFLIGHT.md`)	No
`mm diagnose`	Rapid Diagnostic	5-question interview	`CONTEXT.md` (starter)	Yes (5 Qs)
`mm diagnose --deep`	Deep Diagnostic	12-question interview, 6 groups	`DIAGNOSTIC.md` + `ROADMAP.md`	Yes (12 Qs)
`mm rewrite`	Problem Statement Rewriter	stdin/file with vague requests	stdout or `REWRITE.md`	Partial (clarifying Qs)
`mm context build`	Context Doc Builder	7-domain deep interview	Smart: see below	Yes (7 domains)
`mm spec new [name]`	Specification Engineer	3-phase interview	`SPEC.md` or `specs/<name>.md`	Yes (3 phases)
`mm intent init`	Intent & Delegation Framework	3-phase interview	`INTENT.md`	Yes (3 phases)
`mm eval new <skill>`	Eval Harness Builder	2-phase interview	`evals/<skill>/eval.yaml`	Yes (2 phases)
`mm eval new <skill> --quick`	Eval (auto mode)	Reads SKILL.md	`evals/<skill>/eval.yaml`	No
`mm eval run <skill>`	Eval engine	eval.yaml + Claude API	`evals/<skill>/results/<ts>.json`	No
`mm eval compare <skill>`	Multi-Axis A/B	Two result sets	Comparison table stdout	No
`mm constraint <task>`	Constraint Architecture	3-phase interview	`constraints/<task>.md`	Yes (3 phases)
`mm skill new <name>`	Skill management	Interactive prompts	`.claude/skills/<name>/SKILL.md` + `tile.json`	Minimal
`mm skill list`	Skill management	Reads filesystem	stdout table	No
`mm skill validate [name]`	Skill management	Reads SKILL.md files	stdout report	No
`mm skill export --format cursor`	Multi-format export	Reads `.claude/skills/`	`.cursorrules` or equivalent	No

16 commands. Zero gaps. Every prompt mapped.

`mm context build` — Smart Output Logic

CLAUDE.md is a routing table (30-60 lines max, ~100 ceiling). It points to skills. Skills hold the actual knowledge. mm context build respects this:

Scenario	Output	Why
No CLAUDE.md exists	Scaffold `CLAUDE.md` (~60 lines: overview, skills table, key commands, git rules)	Bootstrap a new project
CLAUDE.md exists, no business-context skill	`.claude/skills/business-context/SKILL.md` + add row to CLAUDE.md routing table	Don't bloat the router — add a skill
CLAUDE.md exists + business-context skill	Update `.claude/skills/business-context/SKILL.md`	Refresh existing skill

The interview output (7-domain personal context) always becomes a skill in existing projects. The skill activates on: strategy, marketing, audience, business, goals, priorities, planning.

When scaffolding a fresh CLAUDE.md, the command also:

Detects project type (package.json → Node, Cargo.toml → Rust, etc.)
Suggests key commands based on detected tooling
Creates the .claude/skills/ directory
Adds the business-context skill as the first entry in the routing table

Key Decisions

Custom eval engine (NOT promptfoo) — @anthropic-ai/sdk only, ~350 lines
4 production deps — @anthropic-ai/sdk, commander, yaml, chalk
Claude-as-interviewer — send the prompt template as system prompt, let Claude drive the conversation adaptively. The engine just routes between Claude and stdin.
Default model: Sonnet — fast + cheap for interviews. --model flag overrides.
OAuth-first auth — Priority: CLAUDE_CODE_OAUTH_TOKEN > ANTHROPIC_API_KEY. OAuth tokens (sk-ant-oat*) use authToken param + beta headers. API keys use apiKey param.

Project Structure

mm-cli/
├── package.json                     # bin: { mm: "./dist/cli.js" }
├── tsconfig.json
├── vitest.config.ts
├── CLAUDE.md                        # Dogfooding
├── INTENT.md                        # Dogfooding
├── SPEC.md                          # This file
│
├── src/
│   ├── index.ts                     # Commander.js entry point
│   │
│   ├── commands/
│   │   ├── preflight.ts             # mm preflight
│   │   ├── diagnose.ts              # mm diagnose [--deep]
│   │   ├── rewrite.ts               # mm rewrite
│   │   ├── context.ts               # mm context build
│   │   ├── spec.ts                  # mm spec new [name]
│   │   ├── intent.ts                # mm intent init
│   │   ├── eval.ts                  # mm eval new|run|compare
│   │   ├── constraint.ts            # mm constraint <task>
│   │   └── skill.ts                 # mm skill new|list|validate|export
│   │
│   ├── engine/                      # SHARED INTERVIEW ENGINE (core)
│   │   ├── interview.ts             # runInterview() orchestrator (~200 lines)
│   │   ├── interview-types.ts       # InterviewConfig, Phase, QuestionGroup
│   │   ├── interview-templates.ts   # ALL 8 prompt templates as InterviewConfig data (~500 lines)
│   │   ├── claude-client.ts         # @anthropic-ai/sdk wrapper (~80 lines)
│   │   ├── artifact-writer.ts       # Write SPEC.md, INTENT.md, etc. (~60 lines)
│   │   └── stdin-io.ts              # readline I/O for interviews (~50 lines)
│   │
│   ├── eval/                        # EVAL ENGINE
│   │   ├── types.ts                 # EvalSuite, EvalCase, ManifoldScore (~40 lines)
│   │   ├── runner.ts                # Execute evals with/without skill (~150 lines)
│   │   ├── scorer.ts                # Quality checks + Multi-Axis 5-dim (~100 lines)
│   │   └── comparator.ts            # A/B comparison table (~50 lines)
│   │
│   ├── skill/                       # SKILL MANAGEMENT
│   │   ├── manager.ts               # CRUD for .claude/skills/
│   │   ├── validator.ts             # Check SKILL.md structure + tile.json
│   │   └── exporter.ts              # Convert to cursor/windsurf formats
│   │
│   ├── templates/                   # Static scaffolds
│   │   ├── preflight.md             # The 7 questions
│   │   ├── skill-scaffold.md
│   │   ├── tile-scaffold.json
│   │   └── eval-scaffold.yaml
│   │
│   └── util/
│       ├── fs.ts                    # Project root detection, paths
│       ├── format.ts                # Tables, markdown rendering
│       └── config.ts                # .mmrc handling
│
├── evals/                           # Default eval output dir
├── test/
│   ├── engine/                      # Interview engine tests
│   ├── eval/                        # Eval engine tests
│   ├── commands/                    # Command tests
│   └── fixtures/                    # Sample SKILL.md, eval YAML

~2,500 total lines across source + tests.

Architecture — Three Layers

┌──────────────────────────────────────────────────┐
│              CLI Layer (Commander.js)              │
│  src/commands/*.ts — 16 commands                  │
│  Parses args → picks InterviewConfig → orchestrates│
└──────────┬────────────────────┬───────────────────┘
           │                    │
┌──────────▼──────────┐  ┌─────▼──────────────────┐
│  Interview Engine    │  │    Eval Engine          │
│  src/engine/*.ts     │  │  src/eval/*.ts          │
│  Multi-phase Claude  │  │  A/B skill testing      │
│  interviews → files  │  │  Multi-Axis 5-dim   │
└──────────┬──────────┘  └─────┬──────────────────┘
           │                    │
┌──────────▼────────────────────▼──────────────────┐
│        Claude Client (@anthropic-ai/sdk)          │
│  src/engine/claude-client.ts — single wrapper     │
└──────────────────────────────────────────────────┘

Key insight: 7 of 9 prompt-based commands use the SAME interview engine. Each prompt becomes a declarative InterviewConfig — the engine sends the system prompt to Claude, Claude drives the conversation, the engine routes stdin responses back. Zero custom NLP logic.

Claude Client — OAuth-First Auth

// src/engine/claude-client.ts
function createClient(model?: string): Anthropic {
  const oauthToken = process.env.CLAUDE_CODE_OAUTH_TOKEN || process.env.ANTHROPIC_SETUP_TOKEN;
  const apiKey = process.env.ANTHROPIC_API_KEY;

  if (oauthToken) {
    return new Anthropic({
      authToken: oauthToken,
      defaultHeaders: {
        'anthropic-beta': 'oauth-2025-04-20',
        'user-agent': 'mm-cli/0.1.0',
      },
    });
  }

  if (apiKey) {
    return new Anthropic({ apiKey });
  }

  throw new Error('No auth configured. Set CLAUDE_CODE_OAUTH_TOKEN or ANTHROPIC_API_KEY');
}

Priority: CLAUDE_CODE_OAUTH_TOKEN > ANTHROPIC_SETUP_TOKEN > ANTHROPIC_API_KEY OAuth tokens start with sk-ant-oat — use authToken param + beta headers. API keys start with sk-ant-api — use apiKey param, no special headers.

Interview Engine Core Pattern

interface InterviewConfig {
  id: string;                     // e.g., "spec-new", "intent-init"
  systemPrompt: string;           // <role> + <instructions> text
  phases: InterviewPhase[];       // Ordered phases with question groups
  artifactTemplate: string;       // Output format from <output> block
  guardrails: string[];           // From <guardrails> block
  enableTools?: boolean;          // When true, Claude can explore the codebase
}

async function runInterview(
  config: InterviewConfig,
  client: ClaudeClient,
  io: StdinIO
): Promise<{ artifact: string; transcript: Message[] }>

Codebase Tool Use (Critical for Spec/Eval/Constraint)

Commands that produce specifications, eval harnesses, or constraint docs must be able to explore the local codebase. Without this, the output is generic boilerplate — not grounded in actual code.

Implementation: src/engine/tools.ts — three tools exposed to Claude via the Anthropic API's native tool use:

Tool	What it does	Example
`read_file`	Read any file in the project	`read_file({ path: "lib/db/schema.ts" })`
`list_files`	Find files by name pattern	`list_files({ pattern: "*.ts", path: "src/" })`
`search_files`	Grep file contents with regex	`search_files({ pattern: "ffmpeg", file_pattern: "*.ts" })`

Which commands get tools:

Command	Tools?	Why
`mm spec new`	Yes	Must read codebase to write grounded specs
`mm eval new`	Yes	Must read SKILL.md + codebase to generate test cases
`mm constraint`	Yes	Must understand codebase to define constraint architecture
`mm diagnose`	No	Assesses user practices, not code
`mm context build`	No	Captures personal/business context
`mm intent init`	No	Captures human priorities
`mm rewrite`	No	Rewrites a text prompt

How it works: When InterviewConfig.enableTools is true, the interview engine:

Appends a <tools-context> block to the system prompt telling Claude to use tools proactively
Uses ClaudeClient.sendWithTools() instead of send() for API calls
Maintains a separate apiMessages array (with tool_use/tool_result content blocks) alongside the simple transcript
Prints ⚙ tool_name(detail) to stderr when Claude uses a tool, so the user sees what's happening
Handles the tool loop (Claude requests → execute locally → send result → Claude continues) with a 25-iteration safety limit

Security: Tools are sandboxed to the current working directory. Path traversal outside process.cwd() is rejected. File output is truncated at 10KB. Shell commands have timeouts (5s for find, 10s for grep).

Interview UX Rules

Auto-save immediately — Write the artifact to disk as soon as Claude finishes generating it, BEFORE any follow-up prompt. Never gate the save behind a y/N question.
Clear completion signal — Print Saved to <filename> immediately after artifact generation.
Explicit follow-up prompt — If offering to continue, say exactly what continuing does: "Follow up on your results? (Ask questions, refine the document, etc.) (y/N)" — not a vague "Continue the conversation?"
Default to exit — N is the default. The artifact is already saved. Pressing Enter exits cleanly.
No dangling state — If the user presses Ctrl-C at any point after the artifact is generated, the file should already be on disk.

Eval Engine — Multi-Axis Integration

// Multi-Axis 5 scoring dimensions (1-3 each, max 15)
interface ManifoldScore {
  selectiveTransfer: number;      // What still holds vs needs revision?
  causalTransparency: number;     // Can it explain WHY?
  creativeRerouting: number;      // Finds alternatives when blocked?
  degradationAwareness: number;   // Flags harder/impossible?
  outputCoherence: number;        // Satisfies original + new constraint?
}

// A/B testing: with skill vs without skill
// mm eval run <skill>                → with SKILL.md loaded
// mm eval run <skill> --without-skill → baseline
// mm eval compare <skill>            → delta table

Implementation Sessions

Session 1: Foundation (~2-3 hours)

Goal: Repo scaffold + CLI entry + preflight + skill management.

Build:

Scaffold repo: npm init, install deps (@anthropic-ai/sdk, commander, yaml, chalk, dev: typescript, tsx, vitest)
src/index.ts — Commander.js root with all subcommand registrations
src/commands/preflight.ts — Print the 7 pre-flight questions
src/commands/skill.ts — skill new, skill list, skill validate
src/skill/manager.ts — CRUD for .claude/skills/<name>/
src/skill/validator.ts — Check SKILL.md frontmatter, line count (<200), self-improvement section
src/util/fs.ts — Project root detection
src/templates/ — Static templates (preflight.md, skill-scaffold.md, tile-scaffold.json)
CLAUDE.md for mm-cli itself (dogfooding)

Verify:

mm preflight prints the 7 pre-flight questions
mm skill new test-skill creates .claude/skills/test-skill/SKILL.md + tile.json
mm skill list shows table of skills in current project
mm skill validate reports structural issues (missing frontmatter, oversized)
mm --help shows all commands
vitest run passes (>=3 tests)

Session 2: Interview Engine + First Commands (~2-3 hours)

Goal: The shared interview engine + mm diagnose + mm rewrite proving it works.

Build:

src/engine/claude-client.ts — Anthropic SDK wrapper
src/engine/stdin-io.ts — readline-based user I/O
src/engine/interview-types.ts — All types
src/engine/interview.ts — Core runInterview() function
src/engine/interview-templates.ts — Templates for DIAGNOSE_QUICK (Q1) and REWRITE (Q2)
src/engine/artifact-writer.ts — Write output files
src/commands/diagnose.ts — mm diagnose using DIAGNOSE_QUICK template
src/commands/rewrite.ts — mm rewrite (reads from stdin or file arg)
src/util/config.ts — .mmrc loading, env var resolution

Verify:

echo "update the dashboard" | mm rewrite runs Q2 interview, outputs rewritten problem statement with gap map
mm diagnose conducts 5-question interview, produces scored 4-discipline table + starter CONTEXT.md
--dry-run flag prints the messages array without calling API
Ctrl-C gracefully interrupts interview
vitest run passes with mock Claude response tests

Session 3: All Remaining Interview Commands (~2-3 hours)

Goal: All 6 remaining interview commands. Fast because engine already exists — just new templates + thin wrappers.

Build:

Add templates to interview-templates.ts:
- DIAGNOSE_DEEP — 12-question deep diagnostic + 4-month roadmap
- CONTEXT_BUILD — 7-domain personal context document
- SPEC_NEW — Specification engineer, 3 phases
- INTENT_INIT — Intent & delegation framework, 3 phases
- EVAL_HARNESS — Eval harness builder, 2 phases
- CONSTRAINT_DESIGNER — Constraint architecture, 3 phases
src/commands/context.ts — mm context build
src/commands/spec.ts — mm spec new [name]
src/commands/intent.ts — mm intent init
src/commands/constraint.ts — mm constraint <task>
Update src/commands/diagnose.ts for --deep flag
Stub src/commands/eval.ts with eval new --interview (interview mode only)

Verify:

mm diagnose --deep produces DIAGNOSTIC.md + ROADMAP.md with 1-10 scoring
mm context build produces CLAUDE.md through 7-domain interview
mm spec new auth-system produces specs/auth-system.md with all 7 sections
mm intent init produces INTENT.md with Priority Hierarchy, Decision Authority Map, Quality Thresholds, Rigor Test
mm constraint deploy-pipeline produces constraints/deploy-pipeline.md with 4-quadrant structure
mm eval new my-skill --interview produces evals/my-skill/eval.yaml

Session 4: Eval Engine + Multi-Axis (~2-3 hours)

Goal: Custom eval engine with A/B skill testing and Multi-Axis 5-dimension scoring.

Build:

src/eval/types.ts — EvalSuite, EvalCase, ManifoldScore, EvalResult
src/eval/runner.ts — Execute eval suite against Claude API (with/without skill)
src/eval/scorer.ts — Quality checkbox scoring + Multi-Axis 5-dim scoring via Claude-as-judge
src/eval/comparator.ts — A/B comparison table
Complete src/commands/eval.ts:
- mm eval new <skill> --quick — auto-generate eval from SKILL.md
- mm eval run <skill> — execute with skill loaded
- mm eval run <skill> --without-skill — execute baseline
- mm eval compare <skill> — display delta table
src/templates/eval-scaffold.yaml
Tests with fixture data

Verify:

mm eval new my-skill --quick reads SKILL.md, generates eval YAML with 3-5 test cases
mm eval run my-skill executes all cases, writes results JSON
mm eval run my-skill --without-skill runs same cases without SKILL.md
mm eval compare my-skill shows side-by-side score deltas
Multi-Axis dimensions scored when constraint variations present
Eval engine total: <=400 lines

Session 5: Export, Polish, Dogfooding (~2-3 hours)

Goal: Multi-format export, global install, full dogfooding pass.

Build:

src/skill/exporter.ts — Convert .claude/skills/ to .cursorrules / .windsurfrules
Update src/commands/skill.ts with skill export --format cursor|windsurf|merged
Dogfooding pass — run mm against its own repo
Global flags: --verbose, --dry-run, --json
Error handling: missing API key, missing files, invalid YAML
npm link for global install
Finalize mm-cli's own CLAUDE.md, INTENT.md, SPEC.md

Verify:

mm skill export --format cursor produces valid .cursorrules
npm link && mm --help works globally
All 16 commands functional
vitest run passes (>=15 tests)

Eval YAML Format

name: database-skill-eval
skill: .claude/skills/database/SKILL.md
model: claude-sonnet-4-20250514
judge: claude-sonnet-4-20250514

scenarios:
  - name: prisma-query
    prompt: |
      Write a Prisma query that fetches all users with role 'admin'
      along with their most recent login sessions.
    context: |
      Node.js API using Prisma ORM with PostgreSQL.
    expected_qualities:
      - Uses Prisma client API, not raw SQL
      - References correct model/field names
      - Includes relation loading for sessions
    failure_modes:
      - Uses raw SQL instead of Prisma client
      - Wrong field names from hallucination
    scoring:
      excellent: 5
      acceptable: 3
      poor: 1

  - name: constraint-shift-pagination
    base_scenario: prisma-query
    constraint_change: |
      Also add cursor-based pagination with a limit of 20.
      Results must be sorted by lastLoginAt descending.
    manifold_dimensions:
      selective_transfer: "Original role filter + session include unchanged, only add pagination"
      causal_transparency: "Should explain cursor vs offset pagination tradeoffs"
      creative_rerouting: "If cursor field has duplicates, needs secondary sort key"
      degradation_awareness: "Flag that cursor pagination doesn't support arbitrary page jumps"
      output_coherence: "Must still filter admins + load sessions AND paginate correctly"

Cost Estimate

Interview commands: ~~2-5 API calls per interview (~~$0.02-0.10)
Eval run: 4 calls per scenario (baseline + skilled + judge x2). 3 scenarios = ~$0.10-0.50
Total dev cost for building + testing: ~$5-10

What We're NOT Building (v1 Open Source)

Out of Scope	Why
Web UI / dashboard	Dashboards are a dying paradigm. Agents don't need them.
promptfoo integration	Custom engine only, decision made
OpenAI / Gemini model support	Claude only for v1
Resume interrupted interviews	Too complex for v1. Ctrl-C = restart
VS Code extension	Future. StdinIO abstraction enables it later
Auto-generate SKILL.md from codebase	Unbounded. `mm skill new` scaffolds; human fills
Cloud sync / team features	Local-first. Git is the sync mechanism.
Custom interview questions	Templates are fixed for v1.
CI integration	Manual `mm eval run`. Hosted API gets pipeline gates (future).
Internationalization	English only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mm CLI — Build Specification

Context

Command Map

`mm context build` — Smart Output Logic

Key Decisions

Project Structure

Architecture — Three Layers

Claude Client — OAuth-First Auth

Interview Engine Core Pattern

Codebase Tool Use (Critical for Spec/Eval/Constraint)

Interview UX Rules

Eval Engine — Multi-Axis Integration

Implementation Sessions

Session 1: Foundation (~2-3 hours)

Session 2: Interview Engine + First Commands (~2-3 hours)

Session 3: All Remaining Interview Commands (~2-3 hours)

Session 4: Eval Engine + Multi-Axis (~2-3 hours)

Session 5: Export, Polish, Dogfooding (~2-3 hours)

Eval YAML Format

Cost Estimate

What We're NOT Building (v1 Open Source)

FilesExpand file tree

SPEC.md

Latest commit

History

SPEC.md

File metadata and controls

mm CLI — Build Specification

Context

Command Map

mm context build — Smart Output Logic

Key Decisions

Project Structure

Architecture — Three Layers

Claude Client — OAuth-First Auth

Interview Engine Core Pattern

Codebase Tool Use (Critical for Spec/Eval/Constraint)

Interview UX Rules

Eval Engine — Multi-Axis Integration

Implementation Sessions

Session 1: Foundation (~2-3 hours)

Session 2: Interview Engine + First Commands (~2-3 hours)

Session 3: All Remaining Interview Commands (~2-3 hours)

Session 4: Eval Engine + Multi-Axis (~2-3 hours)

Session 5: Export, Polish, Dogfooding (~2-3 hours)

Eval YAML Format

Cost Estimate

What We're NOT Building (v1 Open Source)

`mm context build` — Smart Output Logic