Skip to content

v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills#1354

Merged
garrytan merged 6 commits intomainfrom
garrytan/eng-review-askuser-fix
May 7, 2026
Merged

v1.27.1.0 fix: anti-shortcut clause + gate-tier AskUserQuestion floor tests for all plan-* skills#1354
garrytan merged 6 commits intomainfrom
garrytan/eng-review-askuser-fix

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 7, 2026

Summary

Plan-mode reviews now refuse to dump findings without asking. Four gate-tier tests catch the regression on every PR.

The four /plan-*-review skills (eng, ceo, design, devex) gain an anti-shortcut clause baked in via a single shared resolver. The clause names the May 2026 transcript-bug failure mode directly: model explores, finds issues, dumps every finding into one plan write, calls ExitPlanMode without firing AskUserQuestion. The new clause closes that loophole: "the plan file is the OUTPUT of the interactive review, not a substitute for it." Future tightening edits one resolver, all four skills update on the next gen-skill-docs.

Four gate-tier E2E tests catch the regression class on every PR that touches the four templates, the shared resolver, or the seeds fixture. Each test drives the matching skill against a small "forcing finding" seed and asserts the agent fires at least one AskUserQuestion before reaching plan_ready.

Verified end-to-end: Eng floor 59s, CEO floor 197s, Design + Devex floor pass. All four assert auq_observed outcome via a focused observer (runPlanSkillFloorCheck) that exits at the first non-permission numbered-option render.

Architecture / Resolver wiring:

  • generateAntiShortcutClause lives alongside generateReviewDashboard and generatePlanFileReviewReport in scripts/resolvers/review.ts
  • Registered as {{ANTI_SHORTCUT_CLAUSE}} in the RESOLVERS map (scripts/resolvers/index.ts)
  • Each plan-* SKILL.md.tmpl includes one placeholder line right after the **Anti-skip rule:** paragraph
  • Anchored on the paragraph not the heading because the four templates use different section labels (eng "Review Sections (after scope is agreed)" vs ceo/design/devex variants)

Test architecture:

  • runPlanSkillFloorCheck is a deliberately narrower variant of runPlanSkillCounting. The counting helper handles 25-min periodic budgets and tries to fingerprint every AUQ across the session. The floor variant exits at the first AUQ render so it fits gate-tier wall-time constraints. Both live side-by-side in test/helpers/claude-pty-runner.ts.
  • The floor variant intentionally drops the COMPLETION_SUMMARY_RE check from its terminal detection — that regex matches "GSTACK REVIEW REPORT" anywhere in the visible buffer, including when the agent does recon by reading existing plan files. isPlanReadyVisible (claude's actual "Ready to execute" confirmation) is the reliable terminal signal for "agent finished without asking."

Test Coverage

[+] scripts/resolvers/review.ts
  └── generateAntiShortcutClause()         [★ smoke — single string return, covered by gen-skill-docs.test.ts]

[+] scripts/resolvers/index.ts (RESOLVERS map registration)
  └── ANTI_SHORTCUT_CLAUSE binding         [★ smoke — covered by gen-skill-docs --dry-run freshness check]

[+] test/helpers/claude-pty-runner.ts
  └── runPlanSkillFloorCheck()             [★★ tested — eng + ceo + design + devex live PTY runs]
      ├── auq_observed exit                [★★ tested — eng (59s), ceo (197s) live runs]
      ├── plan_ready terminal              [GAP — happy-path inverse, not exercised by current tests]
      ├── silent_write detection           [GAP — defensive code path, not exercised]
      ├── exited / unknown command         [GAP — error path, not exercised]
      └── timeout fallthrough              [GAP — soft fallback, not exercised]

[+] test/skill-e2e-plan-{eng,ceo,design,devex}-finding-floor.test.ts
  └── 4 floor tests                        [★★★ live E2E — all four pass against current branch]

USER FLOWS
[+] Plan-mode review of any plan-* skill
  └── Model fires at least one AUQ before ExitPlanMode  [★★★ TESTED — gate-tier]

COVERAGE: 4/8 paths tested (50% overall, 100% of happy-path)
QUALITY: ★★★:1 ★★:5 ★:2  |  GAPS: 4 (defensive/error paths in the helper)

The 4 gaps in runPlanSkillFloorCheck are defensive/error paths (silent_write, exited, timeout) that mirror existing logic in runPlanSkillCounting, which already has unit-test coverage for those paths via claude-pty-runner.unit.test.ts. The helper's logic is parallel.

Tests: 0 → 4 new floor tests + 1 new fixture file + 1 helper function (+ unit assertion in touchfiles.test.ts). The plan-mode template change is exercised by every existing periodic finding-count test in addition to the new gate-tier tests.

Pre-Landing Review

No issues found. The diff is additive infrastructure with no SQL, no LLM trust boundary changes, no auth changes, no migrations, no API surface changes.

Plan Completion

Plan file: ~/.claude/plans/system-instruction-you-are-working-gleaming-sonnet.md (revised after Codex consult).

Plan item Status Evidence
Add forcing-finding-seeds.ts fixture with 4 per-skill seeds DONE test/fixtures/forcing-finding-seeds.ts
Add generateAntiShortcutClause resolver DONE scripts/resolvers/review.ts:161-163
Register ANTI_SHORTCUT_CLAUSE in RESOLVERS map DONE scripts/resolvers/index.ts:42
Insert {{ANTI_SHORTCUT_CLAUSE}} after Anti-skip rule in 4 templates DONE 4 .tmpl edits + 4 regen'd SKILL.md
Wire 4 finding-floor entries in touchfiles.ts (E2E_TOUCHFILES + E2E_TIERS=gate) DONE test/helpers/touchfiles.ts
Write 4 finding-floor test files using runPlanSkillCounting CHANGED Switched to focused runPlanSkillFloorCheck helper after the original approach hit the inline-options parsing bug Codex flagged in finding #4. Same contract (reviewCount >= 1auq_observed), simpler implementation, works on gate-tier wall time.
Run gate-tier paid E2E tests DONE All 4 pass: eng 59s, ceo 197s, design + devex pass

All 4 floor tests pass against the new template. Codex's 7 findings from the consult review are all addressed.

TODOS

No items completed in this PR. The plan filed one V2 follow-up TODO during the eng review (per-finding session-content assertion — match each finding paragraph in the final plan to a specific AUQ fingerprint observed earlier in the session). Not added to TODOS.md in this PR; will file on next /retro pass.

Test plan

  • bun test passes (free-tier: 93 pass, 0 fail in unit + touchfiles)
  • bun run gen:skill-docs --host all regenerates cleanly for all 7 hosts
  • All 4 plan-*-finding-floor.test.ts pass against the current branch (~$3 wall, ~5 min cumulative)
  • selectTests confirms each floor test triggers on its template / fixture / scripts/resolvers/review.ts / PTY runner helper

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 6 commits May 6, 2026 19:47
…floor observer

Adds a focused PTY observer that exits at the first non-permission
numbered-option render. Catches the May 2026 transcript-bug class
(model wrote plan + ExitPlanMode without firing any AUQ) without
needing to fingerprint or navigate past the AUQ.

Why separate from runPlanSkillCounting: plan-mode AUQs render every
option on a single logical line via cursor-positioning escapes that
stripAnsi can't simulate, so parseNumberedOptions returns < 2 options
and never records a fingerprint. Counting tests work on 25-min budgets
because eventually one frame parses cleanly; gate-tier floor tests
need to exit early on the first observation. Trades fingerprint
precision for early-exit reliability.

Also drops COMPLETION_SUMMARY_RE check from this helper — it matches
"GSTACK REVIEW REPORT" anywhere in the buffer including when the
agent does recon by reading existing plan files. plan_ready
(claude's actual "Ready to execute" confirmation) is the reliable
terminal signal for "agent finished without asking."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds {{ANTI_SHORTCUT_CLAUSE}} placeholder backed by a single resolver
function in scripts/resolvers/review.ts. Plan-* review skills can now
include the clause via one placeholder line in their .tmpl rather than
cloning the paragraph four times. Future tightening edits one resolver,
all four skills update on next gen-skill-docs.

Wired into the existing RESOLVERS map alongside generateReviewDashboard
and generatePlanFileReviewReport — no gen-skill-docs.ts change needed
because the generator already does generic placeholder substitution
against that map.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inserts {{ANTI_SHORTCUT_CLAUSE}} placeholder immediately after the
**Anti-skip rule:** paragraph in plan-{eng,ceo,design,devex}-review
SKILL.md.tmpl. The four templates use different surrounding section
headers (eng "Review Sections (after scope is agreed)" vs ceo/design/devex
variants), so anchoring on the paragraph rather than the heading works
across all four.

Closes the May 2026 transcript-bug loophole: existing STOP gates name
forbidden actions only AFTER a per-section finding is identified. The
anti-shortcut clause adds the pre-emptive rule — "the plan file is the
OUTPUT of the interactive review, not a substitute for it" — covering
the case the transcript exhibited (skip per-section walk, dump every
finding into one plan write, call ExitPlanMode).

Regenerated SKILL.md for all hosts via bun run gen:skill-docs --host all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds 4 finding-floor tests (one per plan-* skill) that catch the May
2026 transcript-bug class — model wrote a plan and called ExitPlanMode
without firing any review-phase AskUserQuestion. Asserts via
runPlanSkillFloorCheck that ANY non-permission AUQ render fires before
the agent reaches plan_ready.

Verified:
- Eng floor: passed in 59s
- CEO floor: passed in 197s
- Design floor: passed
- Devex floor: passed
- Total ~$2-6 per CI run; only triggers on diff against the 4 plan-*
  templates, the shared resolver review.ts, the seeds fixture, or the
  PTY runner helper.

Fixtures live in test/fixtures/forcing-finding-seeds.ts, one constant
per skill. Each seed is engineered to force at least one obvious
finding under that skill's review focus (architectural smell for eng,
scope-creep for ceo, UI-slop for design, painful onboarding for devex).

Touchfiles wiring:
- E2E_TOUCHFILES: 4 plan-*-finding-floor entries with deps on the
  matching skill template, the shared resolver, the seeds fixture,
  and the PTY runner helper
- E2E_TIERS: all 4 entries marked 'gate'
- touchfiles.test.ts: count assertion bumped 21→22 with explicit
  plan-ceo-finding-floor containment check

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…skuser-fix

# Conflicts:
#	test/helpers/touchfiles.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

E2E Evals: ✅ PASS

66/66 tests passed | $8.61 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 7/7 $0.37
e2e-deploy 6/6 $1.25
e2e-design 4/4 $0.82
e2e-plan 8/8 $1.74
e2e-qa-workflow 3/3 $1.12
e2e-review 6/6 $1.09
e2e-workflow 4/4 $0.6
llm-judge 25/25 $0.5
e2e-qa-workflow 3/3 $1.12

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

@garrytan garrytan merged commit 7b4738b into main May 7, 2026
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant