v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement by garrytan · Pull Request #1313 · garrytan/gstack

garrytan · 2026-05-04T03:12:05Z

Summary

Prompt fix (plan-eng-review):

plan-eng-review/SKILL.md.tmpl lines 116, 139, 152, 160, 169 ported to the office-hours b512be71 STOP-gate pattern. Adds tool_use reminder, names blocked next steps explicitly, anti-rationalization clause naming the precise transcript failure mode (loading the AskUserQuestion schema via ToolSearch and writing the recommendation as chat prose).

Test harness extensions (test/helpers/claude-pty-runner.ts):

runPlanSkillObservation gains initialPlanContent?: string — pre-pumps a user message containing a seeded draft plan, so STOP-gate regression tests have guaranteed-finding-triggering complexity to react to. claude has no --plan-file flag, so message-pump is the route.
ClassifyResult gains wrote_findings_before_asking outcome with companion strictPlanWrites? opt on classifyVisible. Fires when a Write/Edit to .claude/plans/* precedes any AskUserQuestion render. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes.
New shared helper assertReportAtBottomIfPlanWritten(obs) wraps existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' whenever a plan was written.

E2E test wiring (4 plan-mode E2E test files):

All four skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts files now assert ## GSTACK REVIEW REPORT is the last ## section of the plan file whenever one was written. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract; nothing tested it until now.
Plan-eng additionally gains a third test case (STOP gate fires when seeded plan forces Step 0 findings) combining initialPlanContent + --disallowedTools AskUserQuestion.

Touchfiles dedupe (test/helpers/touchfiles.ts):

Deleted duplicate plan-design-review-plan-mode keys at line 243 (E2E_TOUCHFILES) and line 524 (E2E_TIERS). Effective tier was silently periodic, not gate. Three of four plan-mode siblings ran on PR CI; design ran weekly only. Now all four run on PR CI again.
Added scripts/resolvers/review.ts to all four plan-mode-test entries so changes to the {{PLAN_FILE_REVIEW_REPORT}} resolver text trigger all four siblings in bun run eval:select.

Test Coverage

CODE PATHS
[+] plan-eng-review/SKILL.md.tmpl
  ├── Step 0 STOP gate (line 116)                    [★★★ PAID-E2E] new seeded-plan test
  └── 4 review-section STOP gates                    [★★★ PAID-E2E] AUQ-blocked exercises one
[+] test/helpers/claude-pty-runner.ts
  ├── runPlanSkillObservation initialPlanContent     [★★★ PAID-E2E] seeded-plan test wires it
  ├── classifyVisible strictPlanWrites=true          [★★★ FREE-UNIT] 5 new test cases
  ├── classifyVisible strictPlanWrites=false         [★★★ FREE-UNIT] regression case
  ├── findFirstAuqRenderIndex                        [★★  FREE-UNIT] transitive via strict tests
  └── assertReportAtBottomIfPlanWritten              [★★  PAID-E2E] called from 4 sibling tests
[+] test/helpers/touchfiles.ts
  └── duplicate-key delete                           [★★★ FREE]      touchfiles.test.ts dedupe checks

COVERAGE: 7/8 paths tested (88%)  |  QUALITY: ★★★:6 ★★:1 ★:1

Tests: 203 → 203 files (+0 new files; 6 new test cases inside existing files + 1 paid E2E case)

Pre-Landing Review

No critical findings. Diff is tightly scoped (395 lines added across prompt + test helpers). Codex outside voice already reviewed during /plan-eng-review and found 8 issues (3 blockers); all addressed via D3/D4/D5 captured in ~/.claude/plans/system-instruction-you-are-working-fancy-aho.md.

Plan Completion

Tighten 4 review-section STOP gates (commit 25ac219)
Tighten Step 0 scope-reduction trigger (commit 25ac219)
Redesign runPlanSkillObservation w/ initialPlanContent (commit ece2650)
Add wrote_findings_before_asking outcome (commit ece2650)
Reuse assertReviewReportAtBottom in 4 sibling tests (commit 99833de)
Fix touchfiles.ts duplicate keys (commit 9df3e14)
Regenerate SKILL.md (rolled into 25ac219)

7/7 plan items DONE.

Documentation

CHANGELOG.md — Polished voice for v1.26.2.0 entry: moved the runPlanSkillObservation initialPlanContent bullet out of the user-facing "What you can now do" section into "For contributors" (it's a test-runner helper, not a user feature). Dropped the branch-development narrative bullet per the CHANGELOG-is-for-users rule.

No other doc updates warranted — the diff is internal hardening. README, ARCHITECTURE, CONTRIBUTING, and CLAUDE.md remain accurate.

TODOS

No TODO items completed in this PR. Adjacent TODO ("Per-finding AskUserQuestion count assertion for /plan-ceo-review") is related but distinct — that's a streaming-counter test, not a structural STOP-gate test.

Test plan

Free test suite: 768 of 769 pass. The 1 failure (plan-review generated preambles stay under the Option A budget) is pre-existing on main (verified by stashing changes — the failure persists). The codex-hardening test passes 25/25 in isolation; it timed out under full-suite load and re-ran clean.
Paid E2E (deferred to /land-and-deploy time per past convention): EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts — ~$4/run. The new seeded-plan test on plan-eng must land at 'asked' (most likely) or 'plan_ready' with ## Decisions to confirm. All four sibling tests must pass the new report-at-bottom assertion when they reach 'plan_ready'.

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…ause Five sites in SKILL.md.tmpl uplift to the office-hours b512be7 pattern: the four review-section gates (Architecture, Code Quality, Test, Performance) plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the tool directly"), names blocked next steps explicitly, anti-rationalization clause naming the precise failure mode (loading the schema via ToolSearch and writing the recommendation as chat prose). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + shared report-at-bottom assertion Three additions to claude-pty-runner.ts: 1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command. claude has no --plan-file flag (verified via claude --help), so message-pump is the route. Lets STOP-gate regression tests force complexity findings. 2. ClassifyResult gains wrote_findings_before_asking with companion strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/ Edit to .claude/plans/* precedes any AskUserQuestion render in the session window. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes. Six new unit tests cover before/after-AUQ ordering, permission-dialog edge case, strict-off path. 3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' when a plan was actually written. Also: runPlanSkillObservation now captures obs.planFile on every classifier outcome, not just 'plan_ready'. Catches the case where the skill wrote a plan partway through then paused on a question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts + add seeded-plan STOP-gate case Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last ## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract; nothing tested it until now. Plan-eng additionally gains a third test case: STOP gate fires when seeded plan forces Step 0 findings. Combines the new initialPlanContent runner option with --disallowedTools AskUserQuestion to force the Conductor MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in {wrote_findings_before_asking, auto_decided, silent_write, exited, timeout} and that plan_ready outcomes carry a ## Decisions section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verified duplicates in test/helpers/touchfiles.ts: - E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps) AND line 243 (smaller deps); JS object literals: later wins. - E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same later-wins rule. Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings ran on every PR; design ran weekly only. Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and line 399 ('gate'). Also extend the four plan-mode-test entries to include scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}} trigger all four siblings in bun run eval:select. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move contributor-flavored bullet (runPlanSkillObservation seeding) into For contributors. Drop branch-internal narrative (Codex review pass, plan iteration tracking) per CHANGELOG-for-users style. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uq-fix # Conflicts: # CHANGELOG.md # VERSION # package.json

github-actions · 2026-05-04T03:26:41Z

E2E Evals: ✅ PASS

69/69 tests passed | $9.08 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	7/7	✅	$0.35
e2e-deploy	6/6	✅	$1.16
e2e-design	4/4	✅	$0.79
e2e-plan	8/8	✅	$1.72
e2e-qa-workflow	3/3	✅	$1.32
e2e-review	6/6	✅	$1.5
e2e-workflow	4/4	✅	$0.58
llm-judge	25/25	✅	$0.5
e2e-deploy	6/6	✅	$1.16

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan and others added 7 commits May 3, 2026 20:08

chore: bump version and changelog (v1.26.2.0)

e213c10

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/eng-review-a…

7acb75a

…uq-fix # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit 30fe6bb into main May 4, 2026
23 checks passed

norbert-pap mentioned this pull request May 4, 2026

weekly sync — 2026-05-04 norbert-pap/claudebert#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement#1313

v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement#1313
garrytan merged 7 commits intomainfrom
garrytan/eng-review-auq-fix

garrytan commented May 4, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 4, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Plan Completion

Documentation

TODOS

Test plan

Uh oh!

github-actions Bot commented May 4, 2026

E2E Evals: ✅ PASS

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented May 4, 2026 •

edited by blacksmith-sh Bot

Loading