Skip to content

Evict pinned sessions and retry Mongo init in the idle reaper#320

Open
lewtun wants to merge 3 commits into
mainfrom
claude/gracious-banach-1ae430
Open

Evict pinned sessions and retry Mongo init in the idle reaper#320
lewtun wants to merge 3 commits into
mainfrom
claude/gracious-banach-1ae430

Conversation

@lewtun

@lewtun lewtun commented Jun 12, 2026

Copy link
Copy Markdown
Member

Why

Production incident on the deployed Space: users hit "maximum of 10 live sessions" hours after their last activity. Diagnosis from the Space logs and code:

  1. The idle reaper permanently skips sessions with pending_approval set. Since Add session usage threshold approvals #310/Enforce YOLO cap across session usage #313, usage-threshold and YOLO-cap prompts are auto-created at turn end with no user action, so an unattended user accumulates unreapable sessions until they fill all 10 slots. Unanswered tool-permission prompts pin slots the same way.
  2. A turn stuck is_processing pins its slot forever — there was no hung-turn detection.
  3. MongoSessionStore.init() gives up permanently after one failed ping at boot; the reaper no-ops while the store is disabled, so a single Mongo blip at startup silently killed reaping until the next restart.
  4. The reaper logged nothing about skipped sessions, so none of this was diagnosable from logs.

What changed

Per-session verdict classifier (_reaper_verdict) replaces the single idle predicate; _reap_one re-checks the same verdict under the lock, so a session whose state changes mid-reap (e.g. prompt answered) aborts cleanly:

State Eviction window Clock
Plain idle 15 min (REAPER_IDLE_MINUTES, unchanged) last_active_at
Usage-threshold / YOLO ack prompt 15 min last_active_at
Real tool-permission prompt 60 min (REAPER_TOOL_APPROVAL_IDLE_MINUTES) last_active_at
Processing, event-silent 15 min (REAPER_STALLED_MINUTES) new last_event_at
Processing, emitting events never (healthy long run)
  • last_event_at is stamped as events flow through the EventBroadcaster (every chunk, tool update, heartbeat), so legitimate long-running jobs are spared while hung turns go silent and get evicted. Stalled evictions set the cooperative interrupt flag before task.cancel() and ignore queued submissions (a hung turn never drains its queue; queue depth is logged).
  • Evicted sessions stay fully resumable: snapshots persist status="active" with runtime_state="waiting_approval"/"idle" (never "processing"), and pending approvals round-trip through the existing serialize/restore path so prompts remain answerable after restore.
  • Mongo retry: the sweep calls store.maybe_reconnect() while the store is disabled, re-attempting init() every 5-min sweep instead of staying dead until restart.
  • Observability: one per-sweep counter line (Reaper sweep: evicted_idle=… evicted_pending_ack=… evicted_stalled=… skipped_processing=… aborted=…, only when nonzero), a warning per stalled eviction, and the per-user 503 now appends Currently held: N awaiting your approval, M still processing, K idle.

Accepted tradeoffs

  • A single silent tool call blocking >15 min with no events is treated as hung and evicted (session stays resumable).
  • Evicting a session parked on a real tool approval destroys its sandbox; on restore the sandbox is recreated empty (pre-existing behavior for restored pending-approval sessions, now more common at the 60-min mark).

Testing

  • tests/unit/test_session_reaper.py: 32 tests (13 new/reworked) covering ack/tool-approval eviction windows, stalled-turn eviction + healthy-long-turn sparing, last_event_at stamping, verdict-mismatch aborts, Mongo retry recovery/no-op, capacity-message breakdown, and sweep-log gating.
  • Full unit suite: 510 passed. ruff check + ruff format --check clean.

🤖 Generated with Claude Code

Production incident: users hit the per-user 10-session cap hours after
their last activity. The reaper permanently skipped sessions with a
pending approval (including usage-threshold and YOLO-cap prompts that
are auto-created at turn end) and sessions stuck processing, and a
single failed Mongo ping at boot disabled reaping until restart.

- Replace the reaper's idle predicate with a per-session verdict
  classifier: acknowledgement prompts (usage/YOLO) evict at the normal
  15-min idle window, real tool approvals after 60 min
  (REAPER_TOOL_APPROVAL_IDLE_MINUTES), and processing turns that have
  emitted no event for 15 min (REAPER_STALLED_MINUTES) are evicted as
  stalled — tracked via a new last_event_at stamped on the event
  broadcast path, so healthy long turns are spared.
- Stalled evictions set the cooperative interrupt flag before
  cancelling the task and ignore queued submissions (a hung turn never
  drains its queue).
- _reap_one re-checks the same verdict under the lock, so a session
  whose prompt is answered mid-reap aborts cleanly; evicted snapshots
  persist runtime_state waiting_approval/idle, never processing.
- Retry MongoSessionStore.init() at each sweep while disabled
  (maybe_reconnect), so one boot blip no longer kills reaping.
- Log a per-sweep counter line (evictions by reason, skips, aborts)
  and break down what holds the user's slots in the 503 capacity
  error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @lewtun's task in 4m 33s —— View job


PR Review: Evict pinned sessions and retry Mongo init in the idle reaper

No blocking issues — 1 P1, 2 P2. The verdict-classifier design is sound: the fixed-now re-check under the lock in _reap_one (session_manager.py:1735, :1747) correctly aborts any session whose state changes mid-reap, the stalled path sets the cooperative flag before task.cancel() (:1764), the "ended" overwrite is still identity-gated (:1887), and the Mongo retry path no-ops cleanly on a NoopSessionStore. Test coverage is thorough (stalled eviction, healthy-long-turn sparing, queued-message-ignore, verdict-mismatch abort, Mongo recover/no-op, capacity breakdown, log gating). I could not run the suite here (Bash sandbox denied pytest), so I'm trusting the stated 510 passed.


P1 — "heartbeats keep a turn alive" is inaccurate; the stalled window only watches streamed events

session_manager.py:135-137 and :222-224 say healthy long turns are spared because they "stream chunks/tool updates/heartbeats." But last_event_at is only stamped when the broadcaster drains an event off event_queue (session_manager.py:101-103session.py:207). The heartbeat (HeartbeatSaver.maybe_fire, telemetry.py:406) is invoked from inside send_event (session.py:213) and only does a save+upload — it never independently emits an event, so it cannot stamp last_event_at. The real liveness signal is streamed output alone.

Consequence: a turn parked in a single tool call that produces no output for REAPER_STALLED_MINUTES (default 15 min) — a long compile, a training/eval run that buffers stdout, a sleep — is classified evict_stalled (:519) and cancelled mid-run. The PR lists this as an accepted tradeoff and the session "stays resumable," but combined with the other accepted tradeoff (restored sandbox recreated empty) the in-progress work in that sandbox is lost. For an agent that plausibly runs long ML jobs, 15 min is aggressive.

Not blocking — it's a documented tradeoff — but worth either (a) fixing the comments to say "streamed events," not "heartbeats," and (b) confirming long bash/sandbox tool calls actually emit periodic events, or raising the default window. Fix this →

P2 — sweep log fires every interval on any busy server

The log gate if any(counters.values()) (session_manager.py:1700) includes the skipped_* counters. So any sweep where a session is merely is_processing or parked within its approval window emits the Reaper sweep: … INFO line — i.e. roughly every REAPER_INTERVAL_S (5 min) on any server with live traffic, even when nothing was evicted. That somewhat undercuts the "only when nonzero" intent. Consider gating on the evicted_* / aborted counters only, and keeping the skip counts in the line when it does fire.

P2 — capacity message labels auto-created ack prompts as "awaiting your approval"

_user_slot_breakdown (session_manager.py:281-300) counts every pending_approval as awaiting, including the usage-threshold / YOLO ack prompts that are auto-created at turn end with no user action. The 503 then reads "N awaiting your approval" for prompts the user never consciously opened. Minor UX wording — these slots are held by an unanswered prompt — but it may read as confusing. Optional: split ack-prompts out of the "awaiting your approval" bucket.


Verified the verdict ordering (is_processing before pending_approval, :518) is intentional and the turn-start _stamp_event at :1835 runs with no intervening await after is_processing=True is set (:1830), so a resumed approval can't be momentarily misread as stalled. evicted_<reason> counter-key construction at :1693 matches the dict keys. No correctness, auth, injection, or routing concerns found.

  • Read full diff against origin/main
  • Review backend/session_manager.py reaper logic
  • Review agent/core/session_persistence.py changes
  • Review test coverage
  • Post review

Co-authored-by: OpenAI Codex <codex@openai.com>
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown

Claude finished @lewtun's task in 4m 7s —— View job


Re-review: ef7536b Stabilize session reaper semantics

No blocking issues (no new P0). This commit reworks the reaper substantially and resolves all three findings from the prior review.

Prior findings — resolved

  • P1 (heartbeats / stalled-turn eviction) — Fixed by removal. The entire last_event_at / REAPER_STALLED / evict_stalled mechanism is gone (EventBroadcaster no longer takes on_event; _stamp_event deleted). A processing session now unconditionally returns skip_processing (session_manager.py:526-527) — "in-flight work keeps the slot until it finishes, fails, or is interrupted explicitly." This eliminates the inaccurate-comment concern and the aggressive 15-min cancellation of long silent tool calls.
  • P2 (sweep log fires every interval) — Fixed. The gate now keys only on evicted_idle | evicted_pending_ack | evicted_pending_tool | aborted (session_manager.py:1773-1781); pure skip_* sweeps stay silent.
  • P2 (ack prompts mislabeled "awaiting your approval") — Fixed. _user_slot_breakdown now splits tool_approvals from ack_prompts (session_manager.py:268-300) and the 503 reads N awaiting tool approval, M usage/cost prompts, ….

New code — verified correct

The new evict_pending_tool verdict (60-min window, session_manager.py:536-537) and _expire_pending_tool_approval (:564-624) are sound: the assistant message carrying the tool calls is already in context (agent_loop.py:1551) while the approval-required tools' results are not, so synthesizing one role="tool" result per pending call before clearing pending_approval keeps the LLM history valid and restores the session as idle, not actionable. _tool_call_identity handles both ToolCall objects and dicts; _reaper_verdict_still_matches correctly expects the post-expiry verdict to be evict_idle (:633-634). Test coverage is good (test_reaper_evicts_tool_approval_after_long_window exercises expiry + serialize/restore round-trip; verdict-mismatch abort, ack windows, Mongo retry, capacity breakdown, and log gating all covered). I could not run the suite here (sandbox), so I'm trusting the stated 510 passed.

One thing worth a follow-up (P1, non-blocking)

_expire_pending_tool_approval mutates the live session — clears pending_approval and appends the synthetic "expired" tool messages — at session_manager.py:1828-1829, before persist_session_snapshot at :1833. If that persist raises (a Mongo blip — exactly the failure class this PR otherwise hardens against), the except block resets is_reaping=False and leaves the session live (:1839-1848), but the pending approval has already been destroyed and synthetic results injected. The session stays coherent and resumable (next sweep reaps it as evict_idle), so this is a degraded-UX-on-rare-error-path issue, not data loss — but the user's real in-flight tool approval is silently lost without the reap having committed. Consider deferring the mutation until after the snapshot write succeeds (or snapshotting from an expired copy). Fix this →


  • Fetch diff vs origin/main
  • Review ef7536b against prior review findings
  • Check for new P0-class issues
  • Post review

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown

Claude finished @lewtun's task in 41s —— View job


Re-review: 168d21c Merge branch 'main'

LGTM — no new findings.

The only change since my last review (ef7536b) is the merge of main (168d21c). Its entire diff is Dependabot GitHub Actions version bumps (#319) — no reaper or backend code changed.

git diff ef7536b..HEAD --stat
 .github/workflows/ci.yml            | 12 +++---
 .github/workflows/claude-review.yml |  4 +-
 .github/workflows/claude.yml        |  4 +-

The bumps are all trusted first-party / well-known publishers, pinned to plausible existing releases:

Action Before After
actions/checkout v4 v6
astral-sh/setup-uv v5 v7
actions/setup-python v5 v6
anthropics/claude-code-action v1 v1.0.137

No provenance or routing concerns. Per the re-review convergence rule I'm suppressing prior P1s; the one open follow-up from my last review still stands as a non-blocking note:

  • P1 (non-blocking, unchanged)_expire_pending_tool_approval mutates the live session (session_manager.py:1828-1829) before persist_session_snapshot (:1833); if the persist raises, the pending approval is already destroyed. Degraded-UX-on-rare-error, not data loss (next sweep reaps it as evict_idle). Optional to defer the mutation until after the snapshot write.

Nothing in this merge changes that. No P0.

  • Diff merge commit vs prior reviewed commit
  • Confirm no reaper/backend code changed
  • Verify workflow action bumps' provenance
    · branch claude/gracious-banach-1ae430

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant