Evict pinned sessions and retry Mongo init in the idle reaper#320
Evict pinned sessions and retry Mongo init in the idle reaper#320lewtun wants to merge 3 commits into
Conversation
Production incident: users hit the per-user 10-session cap hours after their last activity. The reaper permanently skipped sessions with a pending approval (including usage-threshold and YOLO-cap prompts that are auto-created at turn end) and sessions stuck processing, and a single failed Mongo ping at boot disabled reaping until restart. - Replace the reaper's idle predicate with a per-session verdict classifier: acknowledgement prompts (usage/YOLO) evict at the normal 15-min idle window, real tool approvals after 60 min (REAPER_TOOL_APPROVAL_IDLE_MINUTES), and processing turns that have emitted no event for 15 min (REAPER_STALLED_MINUTES) are evicted as stalled — tracked via a new last_event_at stamped on the event broadcast path, so healthy long turns are spared. - Stalled evictions set the cooperative interrupt flag before cancelling the task and ignore queued submissions (a hung turn never drains its queue). - _reap_one re-checks the same verdict under the lock, so a session whose prompt is answered mid-reap aborts cleanly; evicted snapshots persist runtime_state waiting_approval/idle, never processing. - Retry MongoSessionStore.init() at each sweep while disabled (maybe_reconnect), so one boot blip no longer kills reaping. - Log a per-sweep counter line (evictions by reason, skips, aborts) and break down what holds the user's slots in the 503 capacity error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Claude finished @lewtun's task in 4m 33s —— View job PR Review: Evict pinned sessions and retry Mongo init in the idle reaperNo blocking issues — 1 P1, 2 P2. The verdict-classifier design is sound: the fixed- P1 — "heartbeats keep a turn alive" is inaccurate; the stalled window only watches streamed events
Consequence: a turn parked in a single tool call that produces no output for Not blocking — it's a documented tradeoff — but worth either (a) fixing the comments to say "streamed events," not "heartbeats," and (b) confirming long bash/sandbox tool calls actually emit periodic events, or raising the default window. Fix this → P2 — sweep log fires every interval on any busy serverThe log gate P2 — capacity message labels auto-created ack prompts as "awaiting your approval"
Verified the verdict ordering (
|
Co-authored-by: OpenAI Codex <codex@openai.com>
|
Claude finished @lewtun's task in 4m 7s —— View job Re-review:
|
|
Claude finished @lewtun's task in 41s —— View job Re-review:
|
| Action | Before | After |
|---|---|---|
actions/checkout |
v4 | v6 |
astral-sh/setup-uv |
v5 | v7 |
actions/setup-python |
v5 | v6 |
anthropics/claude-code-action |
v1 | v1.0.137 |
No provenance or routing concerns. Per the re-review convergence rule I'm suppressing prior P1s; the one open follow-up from my last review still stands as a non-blocking note:
- P1 (non-blocking, unchanged) —
_expire_pending_tool_approvalmutates the live session (session_manager.py:1828-1829) beforepersist_session_snapshot(:1833); if the persist raises, the pending approval is already destroyed. Degraded-UX-on-rare-error, not data loss (next sweep reaps it asevict_idle). Optional to defer the mutation until after the snapshot write.
Nothing in this merge changes that. No P0.
- Diff merge commit vs prior reviewed commit
- Confirm no reaper/backend code changed
- Verify workflow action bumps' provenance
· branchclaude/gracious-banach-1ae430
Why
Production incident on the deployed Space: users hit "maximum of 10 live sessions" hours after their last activity. Diagnosis from the Space logs and code:
pending_approvalset. Since Add session usage threshold approvals #310/Enforce YOLO cap across session usage #313, usage-threshold and YOLO-cap prompts are auto-created at turn end with no user action, so an unattended user accumulates unreapable sessions until they fill all 10 slots. Unanswered tool-permission prompts pin slots the same way.is_processingpins its slot forever — there was no hung-turn detection.MongoSessionStore.init()gives up permanently after one failed ping at boot; the reaper no-ops while the store is disabled, so a single Mongo blip at startup silently killed reaping until the next restart.What changed
Per-session verdict classifier (
_reaper_verdict) replaces the single idle predicate;_reap_onere-checks the same verdict under the lock, so a session whose state changes mid-reap (e.g. prompt answered) aborts cleanly:REAPER_IDLE_MINUTES, unchanged)last_active_atlast_active_atREAPER_TOOL_APPROVAL_IDLE_MINUTES)last_active_atREAPER_STALLED_MINUTES)last_event_atlast_event_atis stamped as events flow through theEventBroadcaster(every chunk, tool update, heartbeat), so legitimate long-running jobs are spared while hung turns go silent and get evicted. Stalled evictions set the cooperative interrupt flag beforetask.cancel()and ignore queued submissions (a hung turn never drains its queue; queue depth is logged).status="active"withruntime_state="waiting_approval"/"idle"(never"processing"), and pending approvals round-trip through the existing serialize/restore path so prompts remain answerable after restore.store.maybe_reconnect()while the store is disabled, re-attemptinginit()every 5-min sweep instead of staying dead until restart.Reaper sweep: evicted_idle=… evicted_pending_ack=… evicted_stalled=… skipped_processing=… aborted=…, only when nonzero), a warning per stalled eviction, and the per-user 503 now appendsCurrently held: N awaiting your approval, M still processing, K idle.Accepted tradeoffs
Testing
tests/unit/test_session_reaper.py: 32 tests (13 new/reworked) covering ack/tool-approval eviction windows, stalled-turn eviction + healthy-long-turn sparing,last_event_atstamping, verdict-mismatch aborts, Mongo retry recovery/no-op, capacity-message breakdown, and sweep-log gating.ruff check+ruff format --checkclean.🤖 Generated with Claude Code