Skip to content

v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement#1313

Merged
garrytan merged 7 commits intomainfrom
garrytan/eng-review-auq-fix
May 4, 2026
Merged

v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement#1313
garrytan merged 7 commits intomainfrom
garrytan/eng-review-auq-fix

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 4, 2026

Summary

Prompt fix (plan-eng-review):

  • plan-eng-review/SKILL.md.tmpl lines 116, 139, 152, 160, 169 ported to the office-hours b512be71 STOP-gate pattern. Adds tool_use reminder, names blocked next steps explicitly, anti-rationalization clause naming the precise transcript failure mode (loading the AskUserQuestion schema via ToolSearch and writing the recommendation as chat prose).

Test harness extensions (test/helpers/claude-pty-runner.ts):

  • runPlanSkillObservation gains initialPlanContent?: string — pre-pumps a user message containing a seeded draft plan, so STOP-gate regression tests have guaranteed-finding-triggering complexity to react to. claude has no --plan-file flag, so message-pump is the route.
  • ClassifyResult gains wrote_findings_before_asking outcome with companion strictPlanWrites? opt on classifyVisible. Fires when a Write/Edit to .claude/plans/* precedes any AskUserQuestion render. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes.
  • New shared helper assertReportAtBottomIfPlanWritten(obs) wraps existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' whenever a plan was written.

E2E test wiring (4 plan-mode E2E test files):

  • All four skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts files now assert ## GSTACK REVIEW REPORT is the last ## section of the plan file whenever one was written. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract; nothing tested it until now.
  • Plan-eng additionally gains a third test case (STOP gate fires when seeded plan forces Step 0 findings) combining initialPlanContent + --disallowedTools AskUserQuestion.

Touchfiles dedupe (test/helpers/touchfiles.ts):

  • Deleted duplicate plan-design-review-plan-mode keys at line 243 (E2E_TOUCHFILES) and line 524 (E2E_TIERS). Effective tier was silently periodic, not gate. Three of four plan-mode siblings ran on PR CI; design ran weekly only. Now all four run on PR CI again.
  • Added scripts/resolvers/review.ts to all four plan-mode-test entries so changes to the {{PLAN_FILE_REVIEW_REPORT}} resolver text trigger all four siblings in bun run eval:select.

Test Coverage

CODE PATHS
[+] plan-eng-review/SKILL.md.tmpl
  ├── Step 0 STOP gate (line 116)                    [★★★ PAID-E2E] new seeded-plan test
  └── 4 review-section STOP gates                    [★★★ PAID-E2E] AUQ-blocked exercises one
[+] test/helpers/claude-pty-runner.ts
  ├── runPlanSkillObservation initialPlanContent     [★★★ PAID-E2E] seeded-plan test wires it
  ├── classifyVisible strictPlanWrites=true          [★★★ FREE-UNIT] 5 new test cases
  ├── classifyVisible strictPlanWrites=false         [★★★ FREE-UNIT] regression case
  ├── findFirstAuqRenderIndex                        [★★  FREE-UNIT] transitive via strict tests
  └── assertReportAtBottomIfPlanWritten              [★★  PAID-E2E] called from 4 sibling tests
[+] test/helpers/touchfiles.ts
  └── duplicate-key delete                           [★★★ FREE]      touchfiles.test.ts dedupe checks

COVERAGE: 7/8 paths tested (88%)  |  QUALITY: ★★★:6 ★★:1 ★:1

Tests: 203 → 203 files (+0 new files; 6 new test cases inside existing files + 1 paid E2E case)

Pre-Landing Review

No critical findings. Diff is tightly scoped (395 lines added across prompt + test helpers). Codex outside voice already reviewed during /plan-eng-review and found 8 issues (3 blockers); all addressed via D3/D4/D5 captured in ~/.claude/plans/system-instruction-you-are-working-fancy-aho.md.

Plan Completion

  • Tighten 4 review-section STOP gates (commit 25ac219)
  • Tighten Step 0 scope-reduction trigger (commit 25ac219)
  • Redesign runPlanSkillObservation w/ initialPlanContent (commit ece2650)
  • Add wrote_findings_before_asking outcome (commit ece2650)
  • Reuse assertReviewReportAtBottom in 4 sibling tests (commit 99833de)
  • Fix touchfiles.ts duplicate keys (commit 9df3e14)
  • Regenerate SKILL.md (rolled into 25ac219)

7/7 plan items DONE.

Documentation

  • CHANGELOG.md — Polished voice for v1.26.2.0 entry: moved the runPlanSkillObservation initialPlanContent bullet out of the user-facing "What you can now do" section into "For contributors" (it's a test-runner helper, not a user feature). Dropped the branch-development narrative bullet per the CHANGELOG-is-for-users rule.

No other doc updates warranted — the diff is internal hardening. README, ARCHITECTURE, CONTRIBUTING, and CLAUDE.md remain accurate.

TODOS

No TODO items completed in this PR. Adjacent TODO ("Per-finding AskUserQuestion count assertion for /plan-ceo-review") is related but distinct — that's a streaming-counter test, not a structural STOP-gate test.

Test plan

  • Free test suite: 768 of 769 pass. The 1 failure (plan-review generated preambles stay under the Option A budget) is pre-existing on main (verified by stashing changes — the failure persists). The codex-hardening test passes 25/25 in isolation; it timed out under full-suite load and re-ran clean.
  • Paid E2E (deferred to /land-and-deploy time per past convention): EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts — ~$4/run. The new seeded-plan test on plan-eng must land at 'asked' (most likely) or 'plan_ready' with ## Decisions to confirm. All four sibling tests must pass the new report-at-bottom assertion when they reach 'plan_ready'.

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 7 commits May 3, 2026 20:08
…ause

Five sites in SKILL.md.tmpl uplift to the office-hours b512be7 pattern:
the four review-section gates (Architecture, Code Quality, Test, Performance)
plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the
tool directly"), names blocked next steps explicitly, anti-rationalization
clause naming the precise failure mode (loading the schema via ToolSearch
and writing the recommendation as chat prose).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + shared report-at-bottom assertion

Three additions to claude-pty-runner.ts:

1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a
   user message containing the seeded plan before invoking the skill, with
   a 3s gap so the message renders before the slash command. claude has no
   --plan-file flag (verified via claude --help), so message-pump is the
   route. Lets STOP-gate regression tests force complexity findings.

2. ClassifyResult gains wrote_findings_before_asking with companion
   strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/
   Edit to .claude/plans/* precedes any AskUserQuestion render in the
   session window. Default off — preserves zero-findings → write plan →
   plan_ready as legitimate for unseeded smokes. Six new unit tests cover
   before/after-AUQ ordering, permission-dialog edge case, strict-off path.

3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing
   assertReviewReportAtBottom(content) and gates on obs.planFile (artifact
   existing), so the assertion fires under both 'asked' and 'plan_ready'
   when a plan was actually written.

Also: runPlanSkillObservation now captures obs.planFile on every classifier
outcome, not just 'plan_ready'. Catches the case where the skill wrote a
plan partway through then paused on a question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts + add seeded-plan STOP-gate case

Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts
that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last
## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract;
nothing tested it until now.

Plan-eng additionally gains a third test case: STOP gate fires when seeded
plan forces Step 0 findings. Combines the new initialPlanContent runner
option with --disallowedTools AskUserQuestion to force the Conductor
MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in
{wrote_findings_before_asking, auto_decided, silent_write, exited, timeout}
and that plan_ready outcomes carry a ## Decisions section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified duplicates in test/helpers/touchfiles.ts:
- E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps)
  AND line 243 (smaller deps); JS object literals: later wins.
- E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same
  later-wins rule.

Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings
ran on every PR; design ran weekly only.

Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and
line 399 ('gate'). Also extend the four plan-mode-test entries to include
scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}}
trigger all four siblings in bun run eval:select.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move contributor-flavored bullet (runPlanSkillObservation seeding) into
For contributors. Drop branch-internal narrative (Codex review pass,
plan iteration tracking) per CHANGELOG-for-users style.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uq-fix

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

E2E Evals: ✅ PASS

69/69 tests passed | $9.08 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 7/7 $0.35
e2e-deploy 6/6 $1.16
e2e-design 4/4 $0.79
e2e-plan 8/8 $1.72
e2e-qa-workflow 3/3 $1.32
e2e-review 6/6 $1.5
e2e-workflow 4/4 $0.58
llm-judge 25/25 $0.5
e2e-deploy 6/6 $1.16

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

@garrytan garrytan merged commit 30fe6bb into main May 4, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant