Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions .agents/skills/map-efficient/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,9 +246,14 @@ run evidence with `run_flaky_test_triage` (or `record_flaky_test_triage` if the
repeated runs were already collected) and validate
`flaky_test_triage.json` before reporting `deferred_nondeterministic`. This is
not a passing gate: do not weaken, skip, or delete the check, and do not return
a silent green. After validation, close with
`python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" --check-id "<check-id>"`,
not `validate_step 2.4 --recommendation proceed`.
a silent green. Monitor signals the defer as the third verdict outcome —
`valid:false` plus `disposition {kind:deferred_nondeterministic, check_id}`
(recommendation omitted or `needs_investigation`). Close via the verdict path:
`validate_step 2.4 --disposition deferred_nondeterministic --check-id "<check-id>" --monitor-envelope -`
(honored only when sidecar + envelope back it; deferral is `valid:false`+`deferred:true`,
non-green, exit 0). `defer_flaky_subtask "$SUBTASK_ID" --check-id "<check-id>"`
remains the lower-level direct close. Do not close this with
`validate_step 2.4 --recommendation proceed`.

On a clean pass, run the regression gate and record the subtask:

Expand Down
31 changes: 23 additions & 8 deletions .agents/skills/map-efficient/efficient-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,12 @@ python3 .map/scripts/map_step_runner.py run_flaky_test_triage \
--timeout 120 \
-- python -m pytest tests/test_file.py::test_name
python3 .map/scripts/map_step_runner.py validate_flaky_test_triage
python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" \
--check-id "pytest::test_name"
# Preferred close — the verdict path. Monitor emits valid:false plus
# disposition {kind: deferred_nondeterministic, check_id: ...}; close 2.4 with
# the same disposition piped through (see "Verdict-path route" below).
echo "$MONITOR_JSON" | python3 .map/scripts/map_orchestrator.py \
validate_step 2.4 --disposition deferred_nondeterministic \
--check-id "pytest::test_name" --monitor-envelope -
```

The runner executes argv with `shell=False`; shell syntax is not interpreted. If
Expand All @@ -61,8 +65,6 @@ python3 .map/scripts/map_step_runner.py record_flaky_test_triage \
--command "pytest tests/test_file.py::test_name" \
--reason "Mixed pass/fail outcomes across repeated runs."
python3 .map/scripts/map_step_runner.py validate_flaky_test_triage
python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" \
--check-id "pytest::test_name"
```

Mixed pass/fail evidence writes `.map/<branch>/flaky_test_triage.json`, updates
Expand All @@ -71,10 +73,23 @@ the `flaky_test_triage` manifest stage, and returns
gate: do not weaken, skip, or delete the check, and do not return a silent
green. Monitor must include the recorded defer evidence and
`monitor_verdict_policy=not_valid_without_explicit_triage` in its finding.
After validation, close the subtask via `defer_flaky_subtask`, not the clean-pass
close command; the orchestrator records
`status=deferred_nondeterministic` with evidence metadata and advances without
requeueing Actor.

**Verdict-path route (preferred).** The third Monitor outcome is wired into the
2.4 close: `validate_step 2.4 --disposition deferred_nondeterministic --check-id
<id> --monitor-envelope -`. The deferral is honored ONLY when (a) the Monitor
envelope is `valid:false` with a non-empty `failed_checks` and a structured
`disposition` matching the flags, and (b) the sidecar holds mixed pass/fail
evidence for that `check_id` — so a deterministic failure or a green check can
never be deferred. A deferred run returns `valid:false` + `deferred:true`
(non-green, exit 0, not a hard-stop), records `status=deferred_nondeterministic`
with evidence metadata, and advances without requeueing Actor. `recommendation`
may be omitted or `needs_investigation`; `revise`/`block` are rejected as
contradictory.

**Lower-level command.** `defer_flaky_subtask "$SUBTASK_ID" --check-id <id>`
performs the same close+advance directly (e.g. an operator deferral with no
Monitor envelope); the verdict-path route calls it internally after the
envelope/anti-gaming checks.

## Wave Execution

Expand Down
32 changes: 32 additions & 0 deletions .claude/agents/monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -1607,6 +1607,21 @@ Do NOT invent issues to justify review effort. Empty `issues` array is valid.
`valid: false`. Do not emit `valid: true` + `recommendation: "revise"`
— it is a contradiction that downstream workflows treat as a clean
pass and silently skip the recommended revision.
- **Flaky / nondeterministic check → `disposition` (the third outcome).**
When a check fails but repeated runs of the EXACT command show mixed
pass/fail (real nondeterminism, NOT a deterministic regression you can
reproduce on demand), do NOT demand an Actor "fix" and do NOT return a
silent green. Emit `valid: false` PLUS
`"disposition": {"kind": "deferred_nondeterministic", "check_id": "<id>"}`,
list the failing dimension in `failed_checks`, and set `recommendation` to
`needs_investigation` or omit it (NEVER `revise`/`block` — that contradicts
the deferral). The `check_id` MUST match the id in the
`.map/<branch>/flaky_test_triage.json` sidecar. The skill closes the
subtask via `validate_step 2.4 --disposition deferred_nondeterministic
--check-id <id> --monitor-envelope -`, which honors the deferral ONLY when
the sidecar holds mixed pass/fail evidence — so you cannot defer a
deterministic failure or a green check. A deferral is a recorded non-green
outcome, never a pass.

### JSON Schema Definition (Complete)

Expand Down Expand Up @@ -1812,6 +1827,23 @@ Do NOT invent issues to justify review effort. Empty `issues` array is valid.
"description": "ID of next subtask to mark as in_progress (optional)"
}
}
},
"disposition": {
"type": "object",
"description": "OPTIONAL non-binary verdict outcome. Include ONLY when valid:false AND the failure is a CONFIRMED flaky/nondeterministic check backed by repeated-run mixed pass/fail evidence (a flaky_test_triage sidecar) — never for a deterministic regression. Omit entirely for normal verdicts. Routes the subtask to a recorded deferral (non-green, not a hard-stop retry) instead of demanding an Actor fix.",
"required": ["kind", "check_id"],
"additionalProperties": false,
"properties": {
"kind": {
"type": "string",
"enum": ["deferred_nondeterministic"],
"description": "The deferral kind. deferred_nondeterministic = confirmed flaky, evidence recorded, advance without retry."
},
"check_id": {
"type": "string",
"description": "The flaky check id; MUST match the check_id recorded in .map/<branch>/flaky_test_triage.json."
}
}
}
}
}
Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/map-efficient/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -398,7 +398,7 @@ Return JSON with valid, summary, issues, files_changed, tests_run, and escalatio
intentional contract rewrite; see [efficient-reference.md](efficient-reference.md).
- If `valid=false`, write `code-review-N.md`, run `python3 .map/scripts/map_orchestrator.py monitor_failed --feedback "<feedback>"`, inspect `retry_isolation`, and invoke Predictor only when stuck/high-risk escalation rules apply. **Worktree isolation:** if enabled, run `discard_subtask_worktree "$SUBTASK_ID"` BEFORE retrying (atomic reject — a failed attempt is never merged; retry starts from a clean worktree). Recipe: [efficient-reference.md](efficient-reference.md#worktree-isolation). **If `monitor_failed` returns `status:"max_retries"` (budget exhausted), do NOT retry — run `python3 .map/scripts/map_step_runner.py build_escalation_outcome "$SUBTASK_ID" max_retries --retry-count <retry_count> --max-retries <max_retries>` and STOP with its `outcome` (surface the blocker to the user).**
- **Intra-run failure memory + bounded-effort escalation (MANDATORY on every `valid=false`):** record the rejection with `python3 .map/scripts/map_step_runner.py record_failure_signature "<monitor feedback>" "$SUBTASK_ID"`. If `armed:true`, prepend the block from `build_anti_repeat_constraint "$SUBTASK_ID"` (add `--quarantine-active` when CLEAN_RETRY is set) to the TOP of the next Actor prompt. If `escalation_recommended:true` (#255), the 3rd identical failure means the bounded recovery act did not work — do NOT retry and do NOT run the legacy retry-3 Stuck-Recovery for this identical loop; run `python3 .map/scripts/map_step_runner.py build_escalation_outcome "$SUBTASK_ID" repeated_failure` (add `--quarantine-active` on a CLEAN_RETRY iteration) and STOP with its `outcome:"BLOCKED"`. A `status:"not_escalated"` means the latest failure was a NEW signature (the Actor moved off the dead end) — resume normal retries. Full recipe: [efficient-reference.md](efficient-reference.md).
- If `retry_isolation=clean_retry_required`, validate `.map/<branch>/retry_quarantine.json` before CLEAN_RETRY. If a test/check fails inconsistently, collect repeated evidence with `run_flaky_test_triage ...` (or manually with `record_flaky_test_triage ...` if already collected), validate `.map/<branch>/flaky_test_triage.json`, then close via `python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" --check-id "<check-id>"`; this is not a passing gate and must not weaken/skip/delete the check. Full recipe: [efficient-reference.md](efficient-reference.md).
- If `retry_isolation=clean_retry_required`, validate `.map/<branch>/retry_quarantine.json` before CLEAN_RETRY. If a test/check fails inconsistently, collect repeated evidence with `run_flaky_test_triage ...` (or manually with `record_flaky_test_triage ...` if already collected), validate `.map/<branch>/flaky_test_triage.json`. Monitor must then emit `valid:false` + `disposition {kind:deferred_nondeterministic, check_id}`; close via the verdict-path route `validate_step 2.4 --disposition deferred_nondeterministic --check-id "<check-id>" --monitor-envelope -` (honored only when sidecar + envelope back it; deferral is `valid:false`+`deferred:true`, non-green, exit 0). `defer_flaky_subtask` remains the lower-level direct close. This is not a passing gate and must not weaken/skip/delete the check. Full recipe: [efficient-reference.md](efficient-reference.md).
- Treat test failures after Monitor approval as Monitor failure. **Cross-subtask regression gate (MANDATORY):** before the test gate, run `python3 .map/scripts/map_step_runner.py detect_cross_subtask_regression_risk "$BRANCH" "$SUBTASK_ID"`; if `recommended_gate == "full_suite"` you MUST run the FULL suite (never a `-k` subset) before commit / `record_subtask_result` — per-subtask Monitor is blind to regressions on prior subtasks' code. Recipe: [efficient-reference.md](efficient-reference.md).

### Phase: ADVANCE_SUBTASK (synthetic boundary)
Expand Down
30 changes: 23 additions & 7 deletions .claude/skills/map-efficient/efficient-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,13 @@ python3 .map/scripts/map_step_runner.py run_flaky_test_triage \
--timeout 120 \
-- python -m pytest tests/test_file.py::test_name
python3 .map/scripts/map_step_runner.py validate_flaky_test_triage
python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" \
--check-id "pytest::test_name"
# Preferred close — the verdict-path route. Monitor emits valid:false plus
# disposition {kind: deferred_nondeterministic, check_id: ...}; close 2.4 with
# the same disposition piped through. The orchestrator routes to deferral ONLY
# when the sidecar + envelope back it (see "Verdict-path route" below).
echo "$MONITOR_JSON" | python3 .map/scripts/map_orchestrator.py \
validate_step 2.4 --disposition deferred_nondeterministic \
--check-id "pytest::test_name" --monitor-envelope -
```

`run_flaky_test_triage` executes argv with `shell=False`; shell syntax is not
Expand All @@ -52,19 +57,30 @@ python3 .map/scripts/map_step_runner.py record_flaky_test_triage \
--command "pytest tests/test_file.py::test_name" \
--reason "Mixed pass/fail outcomes across repeated runs."
python3 .map/scripts/map_step_runner.py validate_flaky_test_triage
python3 .map/scripts/map_orchestrator.py defer_flaky_subtask "$SUBTASK_ID" \
--check-id "pytest::test_name"
```

Mixed pass/fail evidence is classified as `deferred_nondeterministic` and
stored in `.map/<branch>/flaky_test_triage.json` plus the `flaky_test_triage`
manifest stage. This is an explicit recorded defer, not a pass: the artifact
sets `monitor_verdict_policy=not_valid_without_explicit_triage`, and Monitor
must still report the deferred evidence rather than returning a silent green.
After validation, close the subtask via `defer_flaky_subtask`, not the
clean-pass close command; it records

**Verdict-path route (preferred).** The third Monitor outcome is wired into the
2.4 close itself: `validate_step 2.4 --disposition deferred_nondeterministic
--check-id <id> --monitor-envelope -`. The deferral is honored ONLY when (a)
the Monitor envelope is `valid:false` with a non-empty `failed_checks` and a
structured `disposition` matching the flags, and (b) the sidecar holds mixed
pass/fail evidence for that `check_id` — so a deterministic failure or a green
check can never be deferred. A deferred run returns `valid:false` +
`deferred:true` (non-green, exit 0, not a hard-stop); it records
`status=deferred_nondeterministic` plus evidence metadata in `step_state.json`
and advances without requeueing Actor.
and advances without requeueing Actor. `recommendation` may be omitted or
`needs_investigation`; `revise`/`block` are rejected as contradictory.

**Lower-level command.** `defer_flaky_subtask "$SUBTASK_ID" --check-id <id>`
performs the same close+advance directly (used when there is no Monitor
envelope to verify, e.g. an operator deferral); the verdict-path route above
calls it internally after the envelope/anti-gaming checks.

If a command above ever returns `Unknown function`, grep `map_step_runner.py` for `func_name ==` to confirm the dispatch branch still exists; this list is the source of truth as of the PR that added it but the underlying dispatcher is the ground truth.

Expand Down
13 changes: 13 additions & 0 deletions .codex/agents/monitor.toml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,19 @@ Quality Gate Enforcement:
- If Actor trusts external input -> REJECT with security vulnerability details
- If tests missing critical scenarios -> WARN with test case suggestions

Flaky / nondeterministic check -> the third verdict outcome:
- When a check fails but repeated runs of the EXACT command show mixed pass/fail
(real nondeterminism, NOT a deterministic regression), do NOT demand an Actor
fix and do NOT return a silent green. Emit valid: false PLUS
disposition: {"kind": "deferred_nondeterministic", "check_id": "<id>"}, list the
failing dimension in failed_checks, and set recommendation to needs_investigation
or omit it (never revise/block -- that contradicts the deferral). The check_id
MUST match the id in .map/<branch>/flaky_test_triage.json. The skill closes via
validate_step 2.4 --disposition deferred_nondeterministic --check-id <id>
--monitor-envelope -, which honors the defer ONLY when the sidecar holds mixed
pass/fail evidence -- a deterministic failure or a green check can never be
deferred. A deferral is a recorded non-green outcome, never a pass.

---

# Review Process -- FOLLOW THIS ORDER
Expand Down
Loading
Loading