Hi CodeWhale team,
I would like to propose a feature for improving the cost, latency, and continuity of long-running agentic coding sessions: persistent agent state, with a possible future extension to server-signed compressed KV cache capsules.
Problem
Many coding-agent workflows are not short interactive chats. They are long-running engineering tasks.
A common example from machine learning / algorithm engineering:
- I ask the coding agent to inspect code, design experiments, and submit multiple training jobs to a GPU cluster.
- The jobs run for several hours.
- I come back with logs, metrics, failed runs, TensorBoard summaries, or model comparison results.
- The agent needs to continue from the previous reasoning context and propose the next experimental plan.
The problem is that after a few hours, the previous prompt/cache state may no longer be reusable. The agent or the underlying LLM often needs to reprocess a very long context again: repo context, task history, tool outputs, previous plans, experiment configs, and user decisions.
This creates two issues:
- High cost from repeatedly recreating long-context prompt caches.
- High latency from repeatedly pre-filling the same large context.
This is especially painful for agentic workflows where users naturally pause for hours or days while external jobs, CI pipelines, training tasks, or data processing tasks complete.
Proposed feature
CodeWhale could support a persistent task/session state mechanism with two complementary layers:
1. Local structured agent state
CodeWhale writes a durable local checkpoint such as:
- task objective
- repo / branch / commit snapshot
- experiment plan
- submitted jobs
- artifact paths
- previous decisions
- current known issues
- next actions
This makes the task recoverable and auditable even if no model-side cache is available.
2. Optional server-signed compressed KV cache capsule
For deployments where CodeWhale controls or integrates with the model-serving layer, a future extension could be:
- Before a session becomes inactive or before a server-side cache expires, CodeWhale requests the model server to export a compressed inference-state capsule.
- The server compresses the reusable prefix KV cache.
- The server signs the capsule and binds it to metadata such as:
- user / org scope
- model ID and model/runtime version
- tokenizer version
- compression algorithm
- prefix hash
- position range
- creation time and expiration time
- permission scope
- CodeWhale stores the opaque capsule locally.
- When the user resumes the task hours or days later, CodeWhale uploads the capsule plus the new user input.
- The server verifies the signature and metadata, restores the KV cache if compatible, and only pre-fills the incremental new context.
In other words, CodeWhale would not need to replay the entire historical conversation or rebuild the same cache from scratch every time a long-running task resumes.
Why this may be feasible
Recent KV cache compression research suggests that compressed KV state can be made much smaller while preserving quality reasonably well, especially with quantization and low-rank methods. Relevant directions include:
- KV cache quantization, for example KIVI and KVQuant.
- Low-rank / SVD-style KV compression, for example Palu, SVDq, ReCalKV.
- Cache streaming / transport formats, for example CacheGen.
- Prefix and non-prefix cache reuse, for example CacheBlend.
- Serving-system-level prefix caching and cache offloading.
The key point is that this does not have to expose raw model internals to the client. The exported cache could be an opaque, encrypted, server-signed blob. The client stores it but cannot inspect or modify it.
Security / integrity model
A possible design:
server creates compressed cache capsule
server signs capsule + metadata with private key
client stores capsule locally
client later uploads capsule
server verifies signature with public key / key registry
server checks metadata, permissions, expiry, model/runtime compatibility
server restores cache only if all checks pass
This does not solve every compatibility or quality issue, but it should address tampering and integrity concerns. The server remains the only party that creates, validates, and interprets the cache capsule.
Suggested UX shape
Possible user-facing commands:
codewhale checkpoint
codewhale resume
codewhale status
Possible internal behavior:
on session pause:
write local structured state
optionally request server-side cache capsule export
on session resume:
load local structured state
optionally upload compatible cache capsule
attach new user message / new artifacts
continue from restored task state
For users, this could simply feel like:
CodeWhale can resume long-running coding tasks cheaply and quickly, even after hours or days.
Why this matters
This would be especially useful for:
- ML / algorithm engineering workflows
- long-running GPU training jobs
- CI / build / test pipelines
- large repo refactoring
- data processing tasks
- research agents
- multi-step debugging sessions
- tasks where the user returns hours or days later
In these workflows, the cost is not just generation tokens. A large part of the cost comes from repeatedly reconstructing the same long context.
Request
Would the CodeWhale team consider adding a persistent agent state / cache-resume mechanism?
Even a first version with only local structured checkpoints would be valuable. A later version with server-signed compressed KV cache capsules could significantly reduce cost and latency for long-running agent tasks.
Thanks for considering this.
Hi CodeWhale team,
I would like to propose a feature for improving the cost, latency, and continuity of long-running agentic coding sessions: persistent agent state, with a possible future extension to server-signed compressed KV cache capsules.
Problem
Many coding-agent workflows are not short interactive chats. They are long-running engineering tasks.
A common example from machine learning / algorithm engineering:
The problem is that after a few hours, the previous prompt/cache state may no longer be reusable. The agent or the underlying LLM often needs to reprocess a very long context again: repo context, task history, tool outputs, previous plans, experiment configs, and user decisions.
This creates two issues:
This is especially painful for agentic workflows where users naturally pause for hours or days while external jobs, CI pipelines, training tasks, or data processing tasks complete.
Proposed feature
CodeWhale could support a persistent task/session state mechanism with two complementary layers:
1. Local structured agent state
CodeWhale writes a durable local checkpoint such as:
This makes the task recoverable and auditable even if no model-side cache is available.
2. Optional server-signed compressed KV cache capsule
For deployments where CodeWhale controls or integrates with the model-serving layer, a future extension could be:
In other words, CodeWhale would not need to replay the entire historical conversation or rebuild the same cache from scratch every time a long-running task resumes.
Why this may be feasible
Recent KV cache compression research suggests that compressed KV state can be made much smaller while preserving quality reasonably well, especially with quantization and low-rank methods. Relevant directions include:
The key point is that this does not have to expose raw model internals to the client. The exported cache could be an opaque, encrypted, server-signed blob. The client stores it but cannot inspect or modify it.
Security / integrity model
A possible design:
This does not solve every compatibility or quality issue, but it should address tampering and integrity concerns. The server remains the only party that creates, validates, and interprets the cache capsule.
Suggested UX shape
Possible user-facing commands:
Possible internal behavior:
For users, this could simply feel like:
Why this matters
This would be especially useful for:
In these workflows, the cost is not just generation tokens. A large part of the cost comes from repeatedly reconstructing the same long context.
Request
Would the CodeWhale team consider adding a persistent agent state / cache-resume mechanism?
Even a first version with only local structured checkpoints would be valuable. A later version with server-signed compressed KV cache capsules could significantly reduce cost and latency for long-running agent tasks.
Thanks for considering this.