-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Update to llama.cpp 2026-01-01 #2108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
502532a to
23c10e8
Compare
|
Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages |
|
This will need at least one more (very important) change as the layout of class mtmd_context_params(Structure):
_fields_ = [
("use_gpu", c_bool),
("print_timings", c_bool),
("n_threads", c_int),
("image_marker", c_char_p),
("media_marker", c_char_p),
("flash_attn_type", c_int),
("warmup", c_bool),
("image_min_tokens", c_int),
("image_max_tokens", c_int),
] |
|
More changes needed as the layout of |
|
Also the |
23c10e8 to
a070f61
Compare
| @ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t) | ||
| def llama_max_tensor_buft_overrides() -> int: | ||
| """Get maximum number of tensor buffer type overrides""" | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray ellipsis operator (which does nothing, but still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! The issue here isn't the ellipsis operator, it's that at some point there were two of them - you shouldn't change this to pass because that implies that the function returns None, which will cause type checking to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, and I understand now. done
5042296 to
d14a24f
Compare
|
@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime |
64b087c to
3ffec02
Compare
I think I have fixed this, could you check? |
6dbddac to
39a2ee8
Compare
Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though... |
39a2ee8 to
103f671
Compare
|
My intention was to sweep in like a hero and save the day. Didn't work as planned :/ I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated. |
|
Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version! I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt). I'm pretty sure, it can be my mistake when installing your branch for GPU support: Any ideas what I did wrong?! Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path. |
|
@abetlen Thank you for your work! Please keep this repo alive and merge avion23's updates into the main branch. |
e351642 to
235a3d4
Compare
|
@oss-roettger thank you for all the testing. I found a bug with |
- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01) - Bump version 0.3.16 -> 0.4.0 Critical fixes: - Remove phantom flash_attn field from llama_context_params (caused segfaults) - Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level) - Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE) - Add flash_attn_type to TYPE_CHECKING block - Fix test: use flash_attn_type instead of removed flash_attn field API changes: - flash_attn: bool field REMOVED from structs - flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1) - High-level API maintains backward compatibility via wrapper - Server default changed: flash_attn=False -> flash_attn=None (AUTO mode) New features: - 20+ new functions (memory API, state management, samplers, vocab queries) - 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.) - 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params Deprecated removals: - 11 llama_kv_self_* functions (replaced by llama_memory_*) - llama_sampler_init_softmax - verbosity field from mtmd_context_params
235a3d4 to
17aae47
Compare
|
@avion23 once again respect for your dedication. I have tested the new version after building it with !CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01 Good news first: But: I think I have discovered an additional issue (with & without your latest update; on GPU & CPU): Edit: Log added: |
|
Thank you,
Could You attach the log as before? Then it'll be quicker for me and you
get a nice solution.
Let's just pray this gets merged, i don't want to maintain my fork.
…On Mon 12. Jan 2026 at 18:29, oss-roettger ***@***.***> wrote:
*oss-roettger* left a comment (abetlen/llama-cpp-python#2108)
<#2108 (comment)>
@avion23 <https://github.com/avion23> once again respect for your
dedication. I have tested the new version after building it with
!CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install
--force-reinstall --upgrade git+
***@***.***
Good news first:
The update runs out of the box (with all values for *flash_attn* in the
constructor flash_attn=None/True/False/not present)
But: I think I have discovered an *additional issue* (with & without your
latest update; on GPU & CPU):
https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a
KV cache issue on the second dialog turn. I guess there is a cache
initialization parameter missing in the python bindings, since the
llama-cli command of the same llama build (=same libllama.so) works on
multi-turn dialogs with the same Nemotron model.
(see Llama_test.txt
<https://github.com/user-attachments/files/24562603/Llama_test.txt> for
minimum code to reproduce the error)
—
Reply to this email directly, view it on GitHub
<#2108 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTGWLHPMSGS7UVCWIRBTY34GOAP5AVCNFSM6AAAAACQO5EZGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZYGEYDCNRRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
👍 Model loading log.txt attached above. Error log in Llama_test.txt |
Bindings were 5 months outdated, preventing newer model architectures from loading.
Updates bindings to llama.cpp commit be47fb92 (2026-01-01).
Removed
llama_kv_self_*functions (usellama_memory_*API)llama_sampler_init_softmax()Added
Enums:
LLAMA_ROPE_TYPE_IMROPEllama_flash_attn_typellama_params_fit_statusllama_model_meta_keyStruct fields:
llama_model_params:no_host,no_allocllama_context_params:flash_attn_type(replacedflash_attnbool)Functions:
llama_max_tensor_buft_overrides,llama_n_ctx_seq,llama_model_n_embd_inp,llama_model_is_hybrid,llama_flash_attn_type_name,llama_model_meta_key_str,llama_adapter_meta_*(5 functions),llama_log_get,llama_log_set,llama_memory_breakdown_printBreaking Changes
flash_attn parameter:
KV cache API:
Other
ggml_log_callbacktypedefLLAMA_INSTALL_VERSIONbefore subdirectory include)Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B