Conversation
…ause Five sites in SKILL.md.tmpl uplift to the office-hours b512be7 pattern: the four review-section gates (Architecture, Code Quality, Test, Performance) plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the tool directly"), names blocked next steps explicitly, anti-rationalization clause naming the precise failure mode (loading the schema via ToolSearch and writing the recommendation as chat prose). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… + shared report-at-bottom assertion Three additions to claude-pty-runner.ts: 1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a user message containing the seeded plan before invoking the skill, with a 3s gap so the message renders before the slash command. claude has no --plan-file flag (verified via claude --help), so message-pump is the route. Lets STOP-gate regression tests force complexity findings. 2. ClassifyResult gains wrote_findings_before_asking with companion strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/ Edit to .claude/plans/* precedes any AskUserQuestion render in the session window. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes. Six new unit tests cover before/after-AUQ ordering, permission-dialog edge case, strict-off path. 3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing assertReviewReportAtBottom(content) and gates on obs.planFile (artifact existing), so the assertion fires under both 'asked' and 'plan_ready' when a plan was actually written. Also: runPlanSkillObservation now captures obs.planFile on every classifier outcome, not just 'plan_ready'. Catches the case where the skill wrote a plan partway through then paused on a question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts + add seeded-plan STOP-gate case
Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts
that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last
## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract;
nothing tested it until now.
Plan-eng additionally gains a third test case: STOP gate fires when seeded
plan forces Step 0 findings. Combines the new initialPlanContent runner
option with --disallowedTools AskUserQuestion to force the Conductor
MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in
{wrote_findings_before_asking, auto_decided, silent_write, exited, timeout}
and that plan_ready outcomes carry a ## Decisions section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified duplicates in test/helpers/touchfiles.ts:
- E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps)
AND line 243 (smaller deps); JS object literals: later wins.
- E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same
later-wins rule.
Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings
ran on every PR; design ran weekly only.
Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and
line 399 ('gate'). Also extend the four plan-mode-test entries to include
scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}}
trigger all four siblings in bun run eval:select.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move contributor-flavored bullet (runPlanSkillObservation seeding) into For contributors. Drop branch-internal narrative (Codex review pass, plan iteration tracking) per CHANGELOG-for-users style. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uq-fix # Conflicts: # CHANGELOG.md # VERSION # package.json
E2E Evals: ✅ PASS69/69 tests passed | $9.08 total cost | 12 parallel runners
12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Prompt fix (
plan-eng-review):plan-eng-review/SKILL.md.tmpllines 116, 139, 152, 160, 169 ported to the office-hoursb512be71STOP-gate pattern. Adds tool_use reminder, names blocked next steps explicitly, anti-rationalization clause naming the precise transcript failure mode (loading the AskUserQuestion schema via ToolSearch and writing the recommendation as chat prose).Test harness extensions (
test/helpers/claude-pty-runner.ts):runPlanSkillObservationgainsinitialPlanContent?: string— pre-pumps a user message containing a seeded draft plan, so STOP-gate regression tests have guaranteed-finding-triggering complexity to react to. claude has no--plan-fileflag, so message-pump is the route.ClassifyResultgainswrote_findings_before_askingoutcome with companionstrictPlanWrites?opt onclassifyVisible. Fires when aWrite/Editto.claude/plans/*precedes any AskUserQuestion render. Default off — preserves zero-findings → write plan → plan_ready as legitimate for unseeded smokes.assertReportAtBottomIfPlanWritten(obs)wraps existingassertReviewReportAtBottom(content)and gates onobs.planFile(artifact existing), so the assertion fires under both'asked'and'plan_ready'whenever a plan was written.E2E test wiring (4 plan-mode E2E test files):
skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.tsfiles now assert## GSTACK REVIEW REPORTis the last##section of the plan file whenever one was written. The{{PLAN_FILE_REVIEW_REPORT}}resolver mandated this contract; nothing tested it until now.STOP gate fires when seeded plan forces Step 0 findings) combininginitialPlanContent+--disallowedTools AskUserQuestion.Touchfiles dedupe (
test/helpers/touchfiles.ts):plan-design-review-plan-modekeys at line 243 (E2E_TOUCHFILES) and line 524 (E2E_TIERS). Effective tier was silentlyperiodic, notgate. Three of four plan-mode siblings ran on PR CI; design ran weekly only. Now all four run on PR CI again.scripts/resolvers/review.tsto all four plan-mode-test entries so changes to the{{PLAN_FILE_REVIEW_REPORT}}resolver text trigger all four siblings inbun run eval:select.Test Coverage
Tests: 203 → 203 files (+0 new files; 6 new test cases inside existing files + 1 paid E2E case)
Pre-Landing Review
No critical findings. Diff is tightly scoped (395 lines added across prompt + test helpers). Codex outside voice already reviewed during
/plan-eng-reviewand found 8 issues (3 blockers); all addressed via D3/D4/D5 captured in~/.claude/plans/system-instruction-you-are-working-fancy-aho.md.Plan Completion
runPlanSkillObservationw/initialPlanContent(commit ece2650)wrote_findings_before_askingoutcome (commit ece2650)assertReviewReportAtBottomin 4 sibling tests (commit 99833de)7/7 plan items DONE.
Documentation
runPlanSkillObservationinitialPlanContentbullet out of the user-facing "What you can now do" section into "For contributors" (it's a test-runner helper, not a user feature). Dropped the branch-development narrative bullet per the CHANGELOG-is-for-users rule.No other doc updates warranted — the diff is internal hardening. README, ARCHITECTURE, CONTRIBUTING, and CLAUDE.md remain accurate.
TODOS
No TODO items completed in this PR. Adjacent TODO ("Per-finding AskUserQuestion count assertion for /plan-ceo-review") is related but distinct — that's a streaming-counter test, not a structural STOP-gate test.
Test plan
plan-review generated preambles stay under the Option A budget) is pre-existing onmain(verified by stashing changes — the failure persists). The codex-hardening test passes 25/25 in isolation; it timed out under full-suite load and re-ran clean.EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts— ~$4/run. The new seeded-plan test on plan-eng must land at'asked'(most likely) or'plan_ready'with## Decisions to confirm. All four sibling tests must pass the new report-at-bottom assertion when they reach'plan_ready'.🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.