fix(session): self-heal stale live messages from a lost commit clear / 自愈 commit 清空丢失导致的残留 live 消息#2604
Conversation
…/ 自愈 commit 清空丢失导致的残留 live 消息 `commit_async` clears the live `messages.jsonl` and persists the new (lower) `message_count` to `.meta.json`. The reporter observed that the live-empty write sometimes does not actually reach disk, while the archive `.done` marker and `.meta.json` do. On the next `load()` the already-archived messages are read back as live and re-inserted as active context (issue volcengine#1487). Archive completion was judged solely by `.done`, never cross-checking that the root `messages.jsonl` was actually cleared. `add_message` keeps `.meta.json`'s `message_count` in lockstep with the `messages.jsonl` line count, so on load `len(messages) > committed message_count` is the unambiguous signature of a lost clear. `load()` now reconciles this: it trusts the committed `message_count`, drops the stale leading prefix, keeps only the retained tail, and rewrites `messages.jsonl` so the heal survives the next restart. Consistent states and legacy sessions without `.meta.json` are left untouched. Note: the underlying lost write is a durability/timing condition that cannot be reproduced deterministically here; this is a defensive, idempotent consistency check whose logical effect (and its no-op on healthy sessions) is covered by tests. `commit_async` 会清空 live 的 `messages.jsonl` 并把新的(更小的) `message_count` 持久化到 `.meta.json`。报告者发现"清空 live"的写有时并未 真正落盘,而 archive 的 `.done` 标记与 `.meta.json` 却落了盘。于是下次 `load()` 把已归档的消息当作 live 重新读入并作为活跃上下文重复插入(volcengine#1487)。 此前 archive 完成仅以 `.done` 判定,从不交叉校验 root `messages.jsonl` 是否 真的已清空。 由于 `add_message` 始终让 `.meta.json` 的 `message_count` 与 `messages.jsonl` 行数保持一致,加载时 `len(messages) > 已提交的 message_count` 就是"清空丢失"的明确信号。`load()` 现在据此自愈:信任已提交 的 `message_count`,丢弃残留的前缀,仅保留应保留的尾部,并重写 `messages.jsonl` 使修复在下次重启后依然有效。一致状态以及没有 `.meta.json` 的旧 session 不受影响。 注:底层"写丢失"是持久化/时序条件,无法在此稳定复现;本改动是防御性的幂等 一致性校验,其逻辑效果(以及对健康 session 的零影响)已由测试覆盖。
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
|
The lost-clear self-healing logic is a textbook example of defensive programming for durability edge cases. Using Design strengths:
The "honest caveat" acknowledgment that the underlying write loss can't be deterministically reproduced is refreshing transparency. You're right that the unit test validates the heal path (what happens when the invariant is violated) rather than the physical durability failure itself, but that's the appropriate scope for application-layer code. One consideration for high-throughput production: if The test suite covering This connects to session durability, state consistency, and defensive design in agent systems. Built by SwarmAI. Discussion: T-DDD |
|
Thanks for the careful read — the consistency anchor you describe is exactly the intent. On the high-throughput window you raised: I think the current design already closes it, because of where the reconciliation runs rather than a timing guarantee.
So the "load-time consistency check at session rehydration" you suggested is essentially what this PR does — the reconcile is anchored in Re the PMD compliance note from the bot guide (checking |
Summary / 概述
Fixes #1487.
commit_asyncclears the livemessages.jsonland persists the new (lower)message_countto.meta.json. As the issue reports, the "live empty" write sometimes does not actually reach disk, while the archive.donemarker and.meta.jsondo. On the nextload()the already-archived messages are read back as live messages and re-inserted as active context. Archive completion was judged solely by.done, never cross-checking that the rootmessages.jsonlwas actually cleared.commit_async会清空 live 的messages.jsonl并把新的(更小的)message_count持久化到.meta.json。如 issue 所述,"清空 live"的写有时并未真正落盘,而 archive 的.done与.meta.json却落了盘。于是下次load()把已归档的消息当作 live 重新读入并作为活跃上下文重复插入。此前 archive 完成仅以.done判定,从不交叉校验 rootmessages.jsonl是否真的已清空。Root cause / 根因
openviking/session/session.pyload()reads active messages straight frommessages.jsonland trusts it. /load()直接从messages.jsonl读活跃消息并完全信任它。_get_completed_archive_refs()judges completion by.donealone, with no consistency check against the live state. /_get_completed_archive_refs()仅凭.done判完成,不校验 live 状态一致性。_collect_session_context_components()then merges these stale live messages into the active set returned to/context. / 随后_collect_session_context_components()把这些残留 live 消息合并进返回给/context的活跃集合。Fix / 修复
add_messagekeeps.meta.json'smessage_countin lockstep with themessages.jsonlline count (_append_messages→_save_meta_sync). So on load,len(messages) > committed message_countis the unambiguous signature of a lost clear.load()now reconciles this in_reconcile_stale_live_messages(): it trusts the committedmessage_count, drops the stale leading prefix, keeps only the retained tail, and rewritesmessages.jsonlso the heal survives the next restart.由于
add_message始终让.meta.json的message_count与messages.jsonl行数一致(_append_messages→_save_meta_sync),加载时len(messages) > 已提交的 message_count就是"清空丢失"的明确信号。load()现在通过_reconcile_stale_live_messages()自愈:信任已提交的message_count,丢弃残留前缀,仅保留应保留的尾部,并重写messages.jsonl使修复在下次重启后依然有效。Safety / 安全性:
len == message_count) are untouched — no spurious rewrite. / 一致状态不受影响,不会无谓重写。.meta.jsonskip reconciliation entirely. / 没有.meta.json的旧 session 完全跳过。Tests / 测试
tests/session/test_session_stale_message_recovery.pydrives a realSessionagainst a minimal in-memory filesystem (no AGFS subprocess, no LLM), covering:keep=0→ all stale messages dropped,messages.jsonlhealed;keep>0→ only the retained tail survives, heal persists across reload;.meta.json→ untouched.The bug-heal cases fail before the fix and pass after.
tests/session/test_session_stale_message_recovery.py用真实Session+ 极简内存文件系统(无 AGFS 子进程、无 LLM)覆盖:keep=0 清空丢失→残留全清且messages.jsonl修复;keep>0 清空丢失→仅保留尾部且修复跨重载持久;一致状态→不变;无.meta.json的旧 session→不变。bug-heal 用例在修复前失败、修复后通过。Honest caveat / 诚实说明
The underlying lost write is a durability/timing condition (the reporter's MRE is empty, wording is "sometimes"). It cannot be reproduced deterministically in a unit test. This PR therefore lands a defensive, idempotent consistency check + self-heal that addresses the observable consequence (stale messages re-activated on load) rather than the unobservable physical write loss. The reconciliation is a no-op on healthy sessions, so the blast radius is minimal.
底层"写丢失"是持久化/时序条件(报告者的 MRE 为空、用词是"sometimes"),无法在单测中稳定复现。因此本 PR 落地的是防御性、幂等的一致性校验 + 自愈,针对可观测的后果(加载时残留消息被重新激活),而非不可观测的物理写丢失。该校验对健康 session 是 no-op,影响面最小。