Describe the Issue
I have been noting for a while I cannot run Gemma-3 and later Gemma-4 dense with long contexts because koboldcpp "wanted" huge amounts of RAM for KV cache. As a newbie, I thought it made sense - many parameters, even thought Qwen 3.6 asked for several times less RAM (different architecture I thought, the difference to be understood, to learn).
Several days ago I have started to try llama-server and today I have tried to load same Gemma-4-31B model with same context size and same default KV (f16) - koboldcpp wanted to allocate 8x times RAM for "CPU KV buffer" (per terminal output - and exited - most probably as could not get Linux to allocate such huge amount) as llama-server has allocated to start the model.
For "Qwen 3.6 27B" both koboldcpp and llama-server allocates ~ same RAM amount for KV cache at the start (~ 1.5 x less than llama-server for Gemma 4 31B).
Additional Information:
koboldcpp-linux-x64-nocuda-v1.113.2
Describe the Issue
I have been noting for a while I cannot run Gemma-3 and later Gemma-4 dense with long contexts because
koboldcpp"wanted" huge amounts of RAM for KV cache. As a newbie, I thought it made sense - many parameters, even thought Qwen 3.6 asked for several times less RAM (different architecture I thought, the difference to be understood, to learn).Several days ago I have started to try
llama-serverand today I have tried to load same Gemma-4-31B model with same context size and same default KV (f16) -koboldcppwanted to allocate 8x times RAM for "CPU KV buffer" (per terminal output - and exited - most probably as could not get Linux to allocate such huge amount) asllama-serverhas allocated to start the model.For "Qwen 3.6 27B" both
koboldcppandllama-serverallocates ~ same RAM amount for KV cache at the start (~ 1.5 x less thanllama-serverfor Gemma 4 31B).Additional Information:
koboldcpp-linux-x64-nocuda-v1.113.2