Skip to content

[Bug?] CPU KV buffer size is huge (8x of what llama-server creates) for Gemma 4 31B when model is loaded and started #2235

@alex-ie

Description

@alex-ie

Describe the Issue
I have been noting for a while I cannot run Gemma-3 and later Gemma-4 dense with long contexts because koboldcpp "wanted" huge amounts of RAM for KV cache. As a newbie, I thought it made sense - many parameters, even thought Qwen 3.6 asked for several times less RAM (different architecture I thought, the difference to be understood, to learn).

Several days ago I have started to try llama-server and today I have tried to load same Gemma-4-31B model with same context size and same default KV (f16) - koboldcpp wanted to allocate 8x times RAM for "CPU KV buffer" (per terminal output - and exited - most probably as could not get Linux to allocate such huge amount) as llama-server has allocated to start the model.

For "Qwen 3.6 27B" both koboldcpp and llama-server allocates ~ same RAM amount for KV cache at the start (~ 1.5 x less than llama-server for Gemma 4 31B).

Additional Information:
koboldcpp-linux-x64-nocuda-v1.113.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions