[Bug?] CPU KV buffer size is huge (8x of what llama-server creates) for Gemma 4 31B when model is loaded and started

**Describe the Issue**
I have been noting for a while I cannot run Gemma-3 and later Gemma-4 dense with long contexts because `koboldcpp` "wanted" huge amounts of RAM for KV cache. As a newbie, I thought it made sense - many parameters, even thought Qwen 3.6 asked for several times less RAM (different architecture I thought, the difference to be understood, to learn).

Several days ago I have started to try `llama-server` and today I have tried to load same Gemma-4-31B model with same context size and same default KV (f16) - `koboldcpp` wanted to allocate 8x times RAM for "CPU KV buffer" (per terminal output - and exited - most probably as could not get Linux to allocate such huge amount) as `llama-server` has allocated to start the model.

For "Qwen 3.6 27B" both `koboldcpp` and `llama-server` allocates ~ same RAM amount for KV cache at the start (~ 1.5 x less than `llama-server` for Gemma 4 31B).

**Additional Information:**
`koboldcpp-linux-x64-nocuda-v1.113.2`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug?] CPU KV buffer size is huge (8x of what llama-server creates) for Gemma 4 31B when model is loaded and started #2235

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug?] CPU KV buffer size is huge (8x of what llama-server creates) for Gemma 4 31B when model is loaded and started #2235

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions