v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills by garrytan · Pull Request #1354 · garrytan/gstack

garrytan · 2026-05-07T02:51:50Z

Summary

Plan-mode reviews now refuse to dump findings without asking. Four gate-tier tests catch the regression on every PR.

The four /plan-*-review skills (eng, ceo, design, devex) gain an anti-shortcut clause baked in via a single shared resolver. The clause names the May 2026 transcript-bug failure mode directly: model explores, finds issues, dumps every finding into one plan write, calls ExitPlanMode without firing AskUserQuestion. The new clause closes that loophole: "the plan file is the OUTPUT of the interactive review, not a substitute for it." Future tightening edits one resolver, all four skills update on the next gen-skill-docs.

Four gate-tier E2E tests catch the regression class on every PR that touches the four templates, the shared resolver, or the seeds fixture. Each test drives the matching skill against a small "forcing finding" seed and asserts the agent fires at least one AskUserQuestion before reaching plan_ready.

Verified end-to-end: Eng floor 59s, CEO floor 197s, Design + Devex floor pass. All four assert auq_observed outcome via a focused observer (runPlanSkillFloorCheck) that exits at the first non-permission numbered-option render.

Architecture / Resolver wiring:

generateAntiShortcutClause lives alongside generateReviewDashboard and generatePlanFileReviewReport in scripts/resolvers/review.ts
Registered as {{ANTI_SHORTCUT_CLAUSE}} in the RESOLVERS map (scripts/resolvers/index.ts)
Each plan-* SKILL.md.tmpl includes one placeholder line right after the **Anti-skip rule:** paragraph
Anchored on the paragraph not the heading because the four templates use different section labels (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants)

Test architecture:

runPlanSkillFloorCheck is a deliberately narrower variant of runPlanSkillCounting. The counting helper handles 25-min periodic budgets and tries to fingerprint every AUQ across the session. The floor variant exits at the first AUQ render so it fits gate-tier wall-time constraints. Both live side-by-side in test/helpers/claude-pty-runner.ts.
The floor variant intentionally drops the COMPLETION_SUMMARY_RE check from its terminal detection — that regex matches "GSTACK REVIEW REPORT" anywhere in the visible buffer, including when the agent does recon by reading existing plan files. isPlanReadyVisible (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking."

Test Coverage

[+] scripts/resolvers/review.ts
  └── generateAntiShortcutClause()         [★ smoke — single string return, covered by gen-skill-docs.test.ts]

[+] scripts/resolvers/index.ts (RESOLVERS map registration)
  └── ANTI_SHORTCUT_CLAUSE binding         [★ smoke — covered by gen-skill-docs --dry-run freshness check]

[+] test/helpers/claude-pty-runner.ts
  └── runPlanSkillFloorCheck()             [★★ tested — eng + ceo + design + devex live PTY runs]
      ├── auq_observed exit                [★★ tested — eng (59s), ceo (197s) live runs]
      ├── plan_ready terminal              [GAP — happy-path inverse, not exercised by current tests]
      ├── silent_write detection           [GAP — defensive code path, not exercised]
      ├── exited / unknown command         [GAP — error path, not exercised]
      └── timeout fallthrough              [GAP — soft fallback, not exercised]

[+] test/skill-e2e-plan-{eng,ceo,design,devex}-finding-floor.test.ts
  └── 4 floor tests                        [★★★ live E2E — all four pass against current branch]

USER FLOWS
[+] Plan-mode review of any plan-* skill
  └── Model fires at least one AUQ before ExitPlanMode  [★★★ TESTED — gate-tier]

COVERAGE: 4/8 paths tested (50% overall, 100% of happy-path)
QUALITY: ★★★:1 ★★:5 ★:2  |  GAPS: 4 (defensive/error paths in the helper)

The 4 gaps in runPlanSkillFloorCheck are defensive/error paths (silent_write, exited, timeout) that mirror existing logic in runPlanSkillCounting, which already has unit-test coverage for those paths via claude-pty-runner.unit.test.ts. The helper's logic is parallel.

Tests: 0 → 4 new floor tests + 1 new fixture file + 1 helper function (+ unit assertion in touchfiles.test.ts). The plan-mode template change is exercised by every existing periodic finding-count test in addition to the new gate-tier tests.

Pre-Landing Review

No issues found. The diff is additive infrastructure with no SQL, no LLM trust boundary changes, no auth changes, no migrations, no API surface changes.

Plan Completion

Plan file: ~/.claude/plans/system-instruction-you-are-working-gleaming-sonnet.md (revised after Codex consult).

Plan item	Status	Evidence
Add `forcing-finding-seeds.ts` fixture with 4 per-skill seeds	DONE	`test/fixtures/forcing-finding-seeds.ts`
Add `generateAntiShortcutClause` resolver	DONE	`scripts/resolvers/review.ts:161-163`
Register `ANTI_SHORTCUT_CLAUSE` in RESOLVERS map	DONE	`scripts/resolvers/index.ts:42`
Insert `{{ANTI_SHORTCUT_CLAUSE}}` after Anti-skip rule in 4 templates	DONE	4 .tmpl edits + 4 regen'd SKILL.md
Wire 4 finding-floor entries in touchfiles.ts (E2E_TOUCHFILES + E2E_TIERS=gate)	DONE	`test/helpers/touchfiles.ts`
Write 4 finding-floor test files using `runPlanSkillCounting`	CHANGED	Switched to focused `runPlanSkillFloorCheck` helper after the original approach hit the inline-options parsing bug Codex flagged in finding #4. Same contract (`reviewCount >= 1` ↔ `auq_observed`), simpler implementation, works on gate-tier wall time.
Run gate-tier paid E2E tests	DONE	All 4 pass: eng 59s, ceo 197s, design + devex pass

All 4 floor tests pass against the new template. Codex's 7 findings from the consult review are all addressed.

TODOS

No items completed in this PR. The plan filed one V2 follow-up TODO during the eng review (per-finding session-content assertion — match each finding paragraph in the final plan to a specific AUQ fingerprint observed earlier in the session). Not added to TODOS.md in this PR; will file on next /retro pass.

Test plan

bun test passes (free-tier: 93 pass, 0 fail in unit + touchfiles)
bun run gen:skill-docs --host all regenerates cleanly for all 7 hosts
All 4 plan-*-finding-floor.test.ts pass against the current branch (~$3 wall, ~5 min cumulative)
selectTests confirms each floor test triggers on its template / fixture / scripts/resolvers/review.ts / PTY runner helper

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…floor observer Adds a focused PTY observer that exits at the first non-permission numbered-option render. Catches the May 2026 transcript-bug class (model wrote plan + ExitPlanMode without firing any AUQ) without needing to fingerprint or navigate past the AUQ. Why separate from runPlanSkillCounting: plan-mode AUQs render every option on a single logical line via cursor-positioning escapes that stripAnsi can't simulate, so parseNumberedOptions returns < 2 options and never records a fingerprint. Counting tests work on 25-min budgets because eventually one frame parses cleanly; gate-tier floor tests need to exit early on the first observation. Trades fingerprint precision for early-exit reliability. Also drops COMPLETION_SUMMARY_RE check from this helper — it matches "GSTACK REVIEW REPORT" anywhere in the buffer including when the agent does recon by reading existing plan files. plan_ready (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver function in scripts/resolvers/review.ts. Plan-* review skills can now include the clause via one placeholder line in their .tmpl rather than cloning the paragraph four times. Future tightening edits one resolver, all four skills update on next gen-skill-docs. Wired into the existing RESOLVERS map alongside generateReviewDashboard and generatePlanFileReviewReport — no gen-skill-docs.ts change needed because the generator already does generic placeholder substitution against that map. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the **Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review SKILL.md.tmpl. The four templates use different surrounding section headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants), so anchoring on the paragraph rather than the heading works across all four. Closes the May 2026 transcript-bug loophole: existing STOP gates name forbidden actions only AFTER a per-section finding is identified. The anti-shortcut clause adds the pre-emptive rule — "the plan file is the OUTPUT of the interactive review, not a substitute for it" — covering the case the transcript exhibited (skip per-section walk, dump every finding into one plan write, call ExitPlanMode). Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds 4 finding-floor tests (one per plan-* skill) that catch the May 2026 transcript-bug class — model wrote a plan and called ExitPlanMode without firing any review-phase AskUserQuestion. Asserts via runPlanSkillFloorCheck that ANY non-permission AUQ render fires before the agent reaches plan_ready. Verified: - Eng floor: passed in 59s - CEO floor: passed in 197s - Design floor: passed - Devex floor: passed - Total ~$2-6 per CI run; only triggers on diff against the 4 plan-* templates, the shared resolver review.ts, the seeds fixture, or the PTY runner helper. Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant per skill. Each seed is engineered to force at least one obvious finding under that skill's review focus (architectural smell for eng, scope-creep for ceo, UI-slop for design, painful onboarding for devex). Touchfiles wiring: - E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the matching skill template, the shared resolver, the seeds fixture, and the PTY runner helper - E2E_TIERS: all 4 entries marked 'gate' - touchfiles.test.ts: count assertion bumped 21→22 with explicit plan-ceo-finding-floor containment check Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…skuser-fix # Conflicts: # test/helpers/touchfiles.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-07T03:03:16Z

E2E Evals: ✅ PASS

66/66 tests passed | $8.61 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	7/7	✅	$0.37
e2e-deploy	6/6	✅	$1.25
e2e-design	4/4	✅	$0.82
e2e-plan	8/8	✅	$1.74
e2e-qa-workflow	3/3	✅	$1.12
e2e-review	6/6	✅	$1.09
e2e-workflow	4/4	✅	$0.6
llm-judge	25/25	✅	$0.5
e2e-qa-workflow	3/3	✅	$1.12

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan and others added 6 commits May 6, 2026 19:47

Merge remote-tracking branch 'origin/main' into garrytan/eng-review-a…

2d400fb

…skuser-fix # Conflicts: # test/helpers/touchfiles.ts

chore: bump version and changelog (v1.27.1.0)

e4f8739

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan merged commit 7b4738b into main May 7, 2026
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills#1354

v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills#1354
garrytan merged 6 commits intomainfrom
garrytan/eng-review-askuser-fix

garrytan commented May 7, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 7, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Test plan

Uh oh!

github-actions Bot commented May 7, 2026

E2E Evals: ✅ PASS

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented May 7, 2026 •

edited by blacksmith-sh Bot

Loading