Skip to content

fix(session): self-heal stale live messages from a lost commit clear / 自愈 commit 清空丢失导致的残留 live 消息#2604

Open
Ryannnice wants to merge 1 commit into
volcengine:mainfrom
Ryannnice:fix/1487-stale-live-message-recovery
Open

fix(session): self-heal stale live messages from a lost commit clear / 自愈 commit 清空丢失导致的残留 live 消息#2604
Ryannnice wants to merge 1 commit into
volcengine:mainfrom
Ryannnice:fix/1487-stale-live-message-recovery

Conversation

@Ryannnice

Copy link
Copy Markdown

Summary / 概述

Fixes #1487.

commit_async clears the live messages.jsonl and persists the new (lower) message_count to .meta.json. As the issue reports, the "live empty" write sometimes does not actually reach disk, while the archive .done marker and .meta.json do. On the next load() the already-archived messages are read back as live messages and re-inserted as active context. Archive completion was judged solely by .done, never cross-checking that the root messages.jsonl was actually cleared.

commit_async 会清空 live 的 messages.jsonl 并把新的(更小的)message_count 持久化到 .meta.json。如 issue 所述,"清空 live"的写有时并未真正落盘,而 archive 的 .done.meta.json 却落了盘。于是下次 load() 把已归档的消息当作 live 重新读入并作为活跃上下文重复插入。此前 archive 完成仅以 .done 判定,从不交叉校验 root messages.jsonl 是否真的已清空。

Root cause / 根因

  • openviking/session/session.py load() reads active messages straight from messages.jsonl and trusts it. / load() 直接从 messages.jsonl 读活跃消息并完全信任它。
  • _get_completed_archive_refs() judges completion by .done alone, with no consistency check against the live state. / _get_completed_archive_refs() 仅凭 .done 判完成,不校验 live 状态一致性。
  • _collect_session_context_components() then merges these stale live messages into the active set returned to /context. / 随后 _collect_session_context_components() 把这些残留 live 消息合并进返回给 /context 的活跃集合。

Fix / 修复

add_message keeps .meta.json's message_count in lockstep with the messages.jsonl line count (_append_messages_save_meta_sync). So on load, len(messages) > committed message_count is the unambiguous signature of a lost clear. load() now reconciles this in _reconcile_stale_live_messages(): it trusts the committed message_count, drops the stale leading prefix, keeps only the retained tail, and rewrites messages.jsonl so the heal survives the next restart.

由于 add_message 始终让 .meta.jsonmessage_countmessages.jsonl 行数一致(_append_messages_save_meta_sync),加载时 len(messages) > 已提交的 message_count 就是"清空丢失"的明确信号load() 现在通过 _reconcile_stale_live_messages() 自愈:信任已提交的 message_count,丢弃残留前缀,仅保留应保留的尾部,并重写 messages.jsonl 使修复在下次重启后依然有效。

Safety / 安全性:

  • Consistent states (len == message_count) are untouched — no spurious rewrite. / 一致状态不受影响,不会无谓重写。
  • Legacy sessions without .meta.json skip reconciliation entirely. / 没有 .meta.json 的旧 session 完全跳过。
  • Only ever trims a stale prefix when the live file has more messages than committed; never deletes legitimately-added post-commit messages. / 仅在 live 文件比已提交时裁剪残留前缀,绝不删除 commit 后合法新增的消息。

Tests / 测试

tests/session/test_session_stale_message_recovery.py drives a real Session against a minimal in-memory filesystem (no AGFS subprocess, no LLM), covering:

  • lost clear with keep=0 → all stale messages dropped, messages.jsonl healed;
  • lost clear with keep>0 → only the retained tail survives, heal persists across reload;
  • consistent state → untouched;
  • legacy session without .meta.json → untouched.

The bug-heal cases fail before the fix and pass after.

tests/session/test_session_stale_message_recovery.py 用真实 Session + 极简内存文件系统(无 AGFS 子进程、无 LLM)覆盖:keep=0 清空丢失→残留全清且 messages.jsonl 修复;keep>0 清空丢失→仅保留尾部且修复跨重载持久;一致状态→不变;无 .meta.json 的旧 session→不变。bug-heal 用例在修复前失败、修复后通过。

Honest caveat / 诚实说明

The underlying lost write is a durability/timing condition (the reporter's MRE is empty, wording is "sometimes"). It cannot be reproduced deterministically in a unit test. This PR therefore lands a defensive, idempotent consistency check + self-heal that addresses the observable consequence (stale messages re-activated on load) rather than the unobservable physical write loss. The reconciliation is a no-op on healthy sessions, so the blast radius is minimal.

底层"写丢失"是持久化/时序条件(报告者的 MRE 为空、用词是"sometimes"),无法在单测中稳定复现。因此本 PR 落地的是防御性、幂等的一致性校验 + 自愈,针对可观测的后果(加载时残留消息被重新激活),而非不可观测的物理写丢失。该校验对健康 session 是 no-op,影响面最小。

…/ 自愈 commit 清空丢失导致的残留 live 消息

`commit_async` clears the live `messages.jsonl` and persists the new (lower)
`message_count` to `.meta.json`. The reporter observed that the live-empty
write sometimes does not actually reach disk, while the archive `.done` marker
and `.meta.json` do. On the next `load()` the already-archived messages are
read back as live and re-inserted as active context (issue volcengine#1487). Archive
completion was judged solely by `.done`, never cross-checking that the root
`messages.jsonl` was actually cleared.

`add_message` keeps `.meta.json`'s `message_count` in lockstep with the
`messages.jsonl` line count, so on load `len(messages) > committed
message_count` is the unambiguous signature of a lost clear. `load()` now
reconciles this: it trusts the committed `message_count`, drops the stale
leading prefix, keeps only the retained tail, and rewrites `messages.jsonl`
so the heal survives the next restart. Consistent states and legacy sessions
without `.meta.json` are left untouched.

Note: the underlying lost write is a durability/timing condition that cannot be
reproduced deterministically here; this is a defensive, idempotent consistency
check whose logical effect (and its no-op on healthy sessions) is covered by
tests.

`commit_async` 会清空 live 的 `messages.jsonl` 并把新的(更小的)
`message_count` 持久化到 `.meta.json`。报告者发现"清空 live"的写有时并未
真正落盘,而 archive 的 `.done` 标记与 `.meta.json` 却落了盘。于是下次
`load()` 把已归档的消息当作 live 重新读入并作为活跃上下文重复插入(volcengine#1487)。
此前 archive 完成仅以 `.done` 判定,从不交叉校验 root `messages.jsonl` 是否
真的已清空。

由于 `add_message` 始终让 `.meta.json` 的 `message_count` 与
`messages.jsonl` 行数保持一致,加载时 `len(messages) > 已提交的
message_count` 就是"清空丢失"的明确信号。`load()` 现在据此自愈:信任已提交
的 `message_count`,丢弃残留的前缀,仅保留应保留的尾部,并重写
`messages.jsonl` 使修复在下次重启后依然有效。一致状态以及没有 `.meta.json`
的旧 session 不受影响。

注:底层"写丢失"是持久化/时序条件,无法在此稳定复现;本改动是防御性的幂等
一致性校验,其逻辑效果(以及对健康 session 的零影响)已由测试覆盖。
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1487 - Partially compliant

Compliant requirements:

  • Check that .meta.json message_count matches root messages.jsonl row count
  • Prevent old messages from being inserted as active messages

Non-compliant requirements:

  • Check that archive_xxx/messages.jsonl exists
  • Check that archive_xxx/.done exists
  • Check that archive_xxx/.abstract.md exists
  • Check that archive_xxx/.overview.md exists
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

@chenjw chenjw requested a review from qin-ctx June 15, 2026 03:52
@xg-gh-25

Copy link
Copy Markdown

The lost-clear self-healing logic is a textbook example of defensive programming for durability edge cases. Using message_count from .meta.json as the source of truth — since it's kept in lockstep with actual appends via _save_meta_sync — is the right consistency anchor.

Design strengths:

  1. Unambiguous detectionlen(messages) > committed message_count is a clean signal that doesn't require heuristics or timestamp reasoning
  2. Idempotent reconciliation_reconcile_stale_live_messages() is a no-op on healthy sessions, so the check costs nothing in the common case
  3. Minimal blast radius — only trims the stale prefix when the invariant is violated; never touches legitimately-added post-commit messages

The "honest caveat" acknowledgment that the underlying write loss can't be deterministically reproduced is refreshing transparency. You're right that the unit test validates the heal path (what happens when the invariant is violated) rather than the physical durability failure itself, but that's the appropriate scope for application-layer code.

One consideration for high-throughput production: if commit_async loses the clear but the .done marker is written, you might have a brief window where _get_completed_archive_refs() sees the archive as complete while messages.jsonl still contains stale data. The heal triggers on the next load(), but if another message is added before that load, the stale prefix gets mixed with new messages and the message_count check won't fire cleanly. Consider adding a load-time consistency check at startup or session rehydration to catch this earlier.

The test suite covering keep=0, keep>0, consistent state, and legacy sessions is comprehensive coverage for the heal logic.


This connects to session durability, state consistency, and defensive design in agent systems. Built by SwarmAI. Discussion: T-DDD

@Ryannnice

Copy link
Copy Markdown
Author

Thanks for the careful read — the consistency anchor you describe is exactly the intent.

On the high-throughput window you raised: I think the current design already closes it, because of where the reconciliation runs rather than a timing guarantee.

  • Same process, no restart: commit_async clears the in-memory self._messages itself (it sets the live list to the retained tail / []). A lost disk clear doesn't affect the in-memory list, and add_message_append_messages derives message_count from len(self._messages), i.e. the already-cleared in-memory state. So a post-commit add_message in the same process appends onto clean state — no stale prefix is ever in memory to mix with.
  • After a restart: the stale prefix only materializes when load() reads it back from disk. _reconcile_stale_live_messages() runs inside load(), before self._loaded is set to True, so the heal completes before the session is usable and before any add_message can interleave. add_message doesn't lazy-trigger load() mid-append, so there's no path where a fresh message lands between "read stale jsonl" and "reconcile".

So the "load-time consistency check at session rehydration" you suggested is essentially what this PR does — the reconcile is anchored in load() precisely so the next-message-before-heal interleaving can't occur.

Re the PMD compliance note from the bot guide (checking messages.jsonl / .done / .abstract.md / .overview.md existence): I deliberately scoped this to the one invariant that actually drives the reported symptom — len(messages) > committed message_count re-inserting archived messages as active. The archive-completeness checks are a separate concern (archive integrity vs. live-state correctness) and would widen the blast radius; happy to split that into a follow-up if maintainers want it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: not confirm empty message.jsonl state before commit return

2 participants