Problem
A single transient failure (VLM timeout, SSL error, etc.) during Phase 2 (memory extraction) writes a .failed.json marker that permanently blocks all future commits on that session. There is no auto-recovery, no retry, and no admin API to clear it.
Reproduction
- Session commits successfully through Phase 1 (archive)
- Phase 2 memory extraction fails (e.g. VLM 180s timeout, SSL cert error)
.failed.json is written to the archive directory
- All subsequent
POST /sessions/{id}/commit calls return FAILED_PRECONDITION
- The session is permanently locked — only manual deletion of
.failed.json fixes it
Impact
Any transient infrastructure failure (network, LLM timeout, SSL) permanently kills the session until manual intervention.
Proposed Solution
Add a force: bool = False parameter to the commit flow that skips the blocking-failed-archive check. When force=True, commit proceeds despite unresolved .failed.json markers.
Changes Required
session.py — commit_async(force=False): skip _get_blocking_failed_archive_ref() when force=True
session.py — _run_memory_extraction(force=False): skip _wait_for_previous_archive_done() when force=True (this is the Phase 2 check that was missed in the initial fix)
session_service.py — commit_async(force=False): pass through to session
sessions.py router — CommitRequest: add force: bool = False field
The default force=False preserves existing behavior. Clients that understand the failure mode can opt-in to force=True.
Additional Context
This is particularly impactful when used as an OpenClaw context engine plugin, where the afterTurn hook auto-commits. A single VLM timeout locks the session forever with no user-visible way to fix it.
Problem
A single transient failure (VLM timeout, SSL error, etc.) during Phase 2 (memory extraction) writes a
.failed.jsonmarker that permanently blocks all future commits on that session. There is no auto-recovery, no retry, and no admin API to clear it.Reproduction
.failed.jsonis written to the archive directoryPOST /sessions/{id}/commitcalls returnFAILED_PRECONDITION.failed.jsonfixes itImpact
Any transient infrastructure failure (network, LLM timeout, SSL) permanently kills the session until manual intervention.
Proposed Solution
Add a
force: bool = Falseparameter to the commit flow that skips the blocking-failed-archive check. Whenforce=True, commit proceeds despite unresolved.failed.jsonmarkers.Changes Required
session.py—commit_async(force=False): skip_get_blocking_failed_archive_ref()whenforce=Truesession.py—_run_memory_extraction(force=False): skip_wait_for_previous_archive_done()whenforce=True(this is the Phase 2 check that was missed in the initial fix)session_service.py—commit_async(force=False): pass through to sessionsessions.pyrouter —CommitRequest: addforce: bool = FalsefieldThe default
force=Falsepreserves existing behavior. Clients that understand the failure mode can opt-in toforce=True.Additional Context
This is particularly impactful when used as an OpenClaw context engine plugin, where the
afterTurnhook auto-commits. A single VLM timeout locks the session forever with no user-visible way to fix it.