feat(map-efficient): wire deferred_nondeterministic into Monitor verdict path (part of #252)#299
Merged
Merged
Conversation
…ict path (part of #252) The flaky-triage primitives (run_/record_/validate_flaky_test_triage) and the defer_flaky_subtask close+advance command already existed, but were disjoint from the Monitor verdict path: Monitor could only emit valid:true/false, had no field to signal a flaky defer, and validate_step 2.4 could only pass or hard-stop. A confirmed flake therefore forced an out-of-band manual defer_flaky_subtask. This makes the third Monitor outcome part of the structured verdict. - Monitor schema: optional structured `disposition: {kind, check_id}` field (kind enum {deferred_nondeterministic}), absent for normal verdicts, with guidance to emit it on confirmed mixed pass/fail evidence. - Orchestrator: validate_step 2.4 --disposition routes to defer_flaky_subtask in-process (single owner of close+advance), BEFORE the recommendation gates so a defer with recommendation=needs_investigation is not hard-stopped. - Anti-gaming: deferral honored only when the envelope structurally backs it (valid:false, non-empty failed_checks, matching disposition) AND the sidecar holds mixed pass/fail evidence for that check_id; revise/block + disposition is rejected as contradictory. failed_checks lists quality dimensions (a different namespace from a flaky check id), so the bind is admit-failure + dispositions- match rather than check_id-in-failed_checks. - Verdict vs routing: a deferred run returns valid:false + deferred:true + non_green_outcome:true; CLI exits 0 on a deferral, 1 only on a true invalid verdict. - Single-source MONITOR_DISPOSITIONS policy dict drives routing, the --disposition CLI surface, and a drift-guard test. - Codex parity (separate source tree): monitor.toml + map-efficient skill docs. - Tests: three-way split, anti-gaming rejections, CLI exit codes, drift guard. Docs: USAGE / ARCHITECTURE / CHANGELOG. Design llm-council-reviewed (conv d3ddca63). make check green (ruff/mypy/pyright 0, 2961 tests, render check).
This was referenced Jun 27, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Wires the third Monitor outcome —
deferred_nondeterministicfor confirmed flaky tests — into the core verdict path.Part of #252(epic stays open).The flaky-triage primitives (
run_/record_/validate_flaky_test_triage) and thedefer_flaky_subtaskclose+advance command already existed, but were disjoint from the Monitor verdict path: Monitor could only emitvalid:true/valid:false, had no field to signal a flaky defer, andvalidate_step 2.4could only pass or hard-stop. A confirmed flake therefore forced an out-of-band manualdefer_flaky_subtask. This makes the third outcome part of the structured verdict.Changes
monitor.md.jinja): optional structureddisposition: {kind, check_id}field (kindenum{deferred_nondeterministic}), absent for normal verdicts, with guidance to emit it on confirmed mixed pass/fail evidence instead of demanding a fake Actor fix.map_orchestrator.py.jinja):validate_step 2.4 --disposition deferred_nondeterministic --check-id <id> --monitor-envelope -routes to the existingdefer_flaky_subtaskin-process (single owner of the close+advance transaction), placed BEFORE the recommendation gates so a defer carryingrecommendation=needs_investigationis not hard-stopped.valid:false, non-emptyfailed_checks, a structureddispositionwhose kind+check_id match the flags) AND the sidecar holds mixed pass/fail evidence for thatcheck_id(re-validated from disk). A Monitor cannot dodge a real deterministic failure or a green check by claiming "flaky";recommendation in {revise, block}+ a disposition is rejected as contradictory. (failed_checkslists failed quality dimensions, a different namespace from a flaky check id — so the bind is "admit a dimension failure + dispositions match" rather than "check_id ∈ failed_checks".)valid:false+deferred:true+non_green_outcome:true(a deferral is NOT green — a routing decision, not a clean pass); the CLI exits0on a deferral,1only on a true invalid verdict.MONITOR_DISPOSITIONSpolicy dict drives routing, the--dispositionCLI surface, and a drift-guard test (the Monitor prompt must name every supported disposition).Design was llm-council-reviewed (conv
d3ddca63) — coupling (shared in-process owner, no reload hazard), anti-gaming bind, gate ordering,valid:false-not-truecorrection, and single-source policy dict all per the council synthesis.Tests
New
TestValidateStepDisposition+TestMonitorDispositionSingleSourceintests/test_map_orchestrator.py:valid:false+deferred:true, advancesvalid:true/ emptyfailed_checks/ disposition-check_id mismatch → binding rejectedrecommendation=revise+ disposition → rejected--monitor-envelope/ missing--check-id/ unknown disposition → rejectedMONITOR_DISPOSITIONSkey; CLI surface matchesVerification
make checkgreen locally: ruff/mypy/pyright 0, 2961 passed, 3 skipped, render check ✅ (Generated trees match templates_src).Closes the last core slice of #252.