Skip to content

Disk KV eviction preferentially removes short-prefix cache files causing disk cache miss & full pre-fill when Disk KV Store is full #444

Description

@srinathh

Background

As the good balance of speed & granularity, I am running a Disk KV configuration of 4096 tokens (--kv-cache-continued-interval-tokens 4096, --kv-cache-boundary-align-tokens 4096) backing OpenClaw. DS4 writes a new disk checkpoint every 4,096 tokens during prefill:

0–4,096 tokens   → one .kv file  (~76 MiB)
0–8,192 tokens   → one .kv file  (~130 MiB)
0–12,288 tokens  → one .kv file  (~184 MiB)

Issue

I was running a 256Gb Disk KV Cache store limit which I ran out of today. When the disk budget runs out, DS4 must evict files to make room. The eviction score formula is:

score = (effective_hits + 1.0) * tokens / file_size

When there are no hits, tokens / file_size is token density — tokens saved per byte of disk used. Because all KV files for the same model have roughly the same per-token size, density increases slightly with file size due to fixed header overhead. Small files always score lower than large files and are evicted first.

This creates a self-defeating cycle when disk KV store hits the limit:

  1. New turn arrives. DS4 starts prefilling a ~35,000-token prompt.
  2. At each 4,096-token boundary it writes a small checkpoint file.
  3. Disk is full → the small checkpoint (76 MiB, 4,096 tokens) is immediately evicted to make room for the next larger one.
  4. By the end of prefill, only large files survive (e.g. the 481 MiB full-context snapshot).
  5. Next turn: the new prompt differs slightly from the saved one — the client strips metadata from old messages as they become history, so the rendered text diverges around token ~34,000 out of ~35,000.
  6. DS4 checks disk. The only surviving file is the 481 MiB snapshot, but its stored text extends past the divergence point, so the SHA1 doesn't match.
  7. The smaller files (e.g. 0–32,768 tokens) would have matched, since their text falls entirely within the stable prefix — but they were evicted in step 3.
  8. Full re-prefill from scratch: ~100–200 seconds instead of ~8 seconds.

Temporary Workaround

I nuked the kv store & started with a fresh 400 Gb store to avoid these eviction issues. This will again stop working when KV store hits its limit but at least now I know what to monitor.

Pre-Post Results

State Turn-start time Disk KV loaded
Fresh disk / well under budget ~8–12 s 32,768 tokens from disk ✓
Disk at capacity (256 GB, 441 stale files) 100–208 s 0 ✗
After nuke + budget increase to 400 GB ~8–12 s 32,768 tokens from disk ✓

Long Term Fix needed

We need to revisit the kv-eviction score formula to re-prioritize what should be evicted. I will also try experimenting with some options.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions