Skip to content

feat(map-efficient): wire deferred_nondeterministic into Monitor verdict path (part of #252)#299

Merged
azalio merged 1 commit into
mainfrom
feat/252-flaky-deferred-disposition-wiring
Jun 27, 2026
Merged

feat(map-efficient): wire deferred_nondeterministic into Monitor verdict path (part of #252)#299
azalio merged 1 commit into
mainfrom
feat/252-flaky-deferred-disposition-wiring

Conversation

@azalio

@azalio azalio commented Jun 27, 2026

Copy link
Copy Markdown
Owner

What

Wires the third Monitor outcomedeferred_nondeterministic for confirmed flaky tests — into the core verdict path. Part of #252 (epic stays open).

The flaky-triage primitives (run_/record_/validate_flaky_test_triage) and the defer_flaky_subtask close+advance command already existed, but were disjoint from the Monitor verdict path: Monitor could only emit valid:true/valid:false, had no field to signal a flaky defer, and validate_step 2.4 could only pass or hard-stop. A confirmed flake therefore forced an out-of-band manual defer_flaky_subtask. This makes the third outcome part of the structured verdict.

Changes

  • Monitor schema (monitor.md.jinja): optional structured disposition: {kind, check_id} field (kind enum {deferred_nondeterministic}), absent for normal verdicts, with guidance to emit it on confirmed mixed pass/fail evidence instead of demanding a fake Actor fix.
  • Orchestrator (map_orchestrator.py.jinja): validate_step 2.4 --disposition deferred_nondeterministic --check-id <id> --monitor-envelope - routes to the existing defer_flaky_subtask in-process (single owner of the close+advance transaction), placed BEFORE the recommendation gates so a defer carrying recommendation=needs_investigation is not hard-stopped.
  • Anti-gaming: the deferral is honored ONLY when the Monitor envelope structurally backs it (valid:false, non-empty failed_checks, a structured disposition whose kind+check_id match the flags) AND the sidecar holds mixed pass/fail evidence for that check_id (re-validated from disk). A Monitor cannot dodge a real deterministic failure or a green check by claiming "flaky"; recommendation in {revise, block} + a disposition is rejected as contradictory. (failed_checks lists failed quality dimensions, a different namespace from a flaky check id — so the bind is "admit a dimension failure + dispositions match" rather than "check_id ∈ failed_checks".)
  • Verdict vs routing: a deferred run returns valid:false + deferred:true + non_green_outcome:true (a deferral is NOT green — a routing decision, not a clean pass); the CLI exits 0 on a deferral, 1 only on a true invalid verdict.
  • Single source of truth: MONITOR_DISPOSITIONS policy dict drives routing, the --disposition CLI surface, and a drift-guard test (the Monitor prompt must name every supported disposition).
  • Codex parity: separate codex source tree updated (monitor.toml + map-efficient skill docs).
  • Docs: USAGE / ARCHITECTURE / CHANGELOG.

Design was llm-council-reviewed (conv d3ddca63) — coupling (shared in-process owner, no reload hazard), anti-gaming bind, gate ordering, valid:false-not-true correction, and single-source policy dict all per the council synthesis.

Tests

New TestValidateStepDisposition + TestMonitorDispositionSingleSource in tests/test_map_orchestrator.py:

  • deferred disposition + valid sidecar + matching envelope → valid:false+deferred:true, advances
  • missing sidecar / deterministic-failure sidecar / check_id-not-in-sidecar → hard-stop, no advance
  • envelope valid:true / empty failed_checks / disposition-check_id mismatch → binding rejected
  • contradictory recommendation=revise + disposition → rejected
  • missing --monitor-envelope / missing --check-id / unknown disposition → rejected
  • normal disposition-less verdict unaffected
  • CLI subprocess: deferral exits 0, rejected deferral exits 1 (real entrypoint, project layout)
  • drift guard: Monitor prompt names every MONITOR_DISPOSITIONS key; CLI surface matches

Verification

make check green locally: ruff/mypy/pyright 0, 2961 passed, 3 skipped, render check ✅ (Generated trees match templates_src).

Closes the last core slice of #252.

…ict path (part of #252)

The flaky-triage primitives (run_/record_/validate_flaky_test_triage) and the
defer_flaky_subtask close+advance command already existed, but were disjoint
from the Monitor verdict path: Monitor could only emit valid:true/false, had no
field to signal a flaky defer, and validate_step 2.4 could only pass or
hard-stop. A confirmed flake therefore forced an out-of-band manual
defer_flaky_subtask. This makes the third Monitor outcome part of the
structured verdict.

- Monitor schema: optional structured `disposition: {kind, check_id}` field
  (kind enum {deferred_nondeterministic}), absent for normal verdicts, with
  guidance to emit it on confirmed mixed pass/fail evidence.
- Orchestrator: validate_step 2.4 --disposition routes to defer_flaky_subtask
  in-process (single owner of close+advance), BEFORE the recommendation gates so
  a defer with recommendation=needs_investigation is not hard-stopped.
- Anti-gaming: deferral honored only when the envelope structurally backs it
  (valid:false, non-empty failed_checks, matching disposition) AND the sidecar
  holds mixed pass/fail evidence for that check_id; revise/block + disposition is
  rejected as contradictory. failed_checks lists quality dimensions (a different
  namespace from a flaky check id), so the bind is admit-failure + dispositions-
  match rather than check_id-in-failed_checks.
- Verdict vs routing: a deferred run returns valid:false + deferred:true +
  non_green_outcome:true; CLI exits 0 on a deferral, 1 only on a true invalid
  verdict.
- Single-source MONITOR_DISPOSITIONS policy dict drives routing, the
  --disposition CLI surface, and a drift-guard test.
- Codex parity (separate source tree): monitor.toml + map-efficient skill docs.
- Tests: three-way split, anti-gaming rejections, CLI exit codes, drift guard.
  Docs: USAGE / ARCHITECTURE / CHANGELOG.

Design llm-council-reviewed (conv d3ddca63). make check green (ruff/mypy/pyright
0, 2961 tests, render check).
@azalio azalio merged commit 1ff4d22 into main Jun 27, 2026
6 checks passed
@azalio azalio deleted the feat/252-flaky-deferred-disposition-wiring branch June 27, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant