models : fix the attn_factor for mistral3 graphs + improve consistency #17945

ggerganov · 2025-12-11T19:43:45Z

Fix the adjustment of the RoPE attention factor based on:

https://github.com/huggingface/transformers/blob/6d00f6b0a5679c36510f203e4226e36f517c3032/src/transformers/modeling_rope_utils.py#L336-L348

Apply the adjustment for all models that use yarn_ext_factor != 0.0f, instead of on a case-by-case basis
Extract the logic in a common function: llama_hparams::yarn_attn_factor_adjust() and reuse both in llama_graph and llama_kv_cache::build_rope_shift() for consistency
Normalize the meaning of the hparams.rope_yarn_log_mul parameter (a.k.a. mscale_all_dims)
Keep the pre-scaling of hparams.rope_yarn_log_mul with 0.1 in the convert script for Deepseek v2 for backwards compatibility. Negate it during model load. (see [TAG_DEEPSEEK2_YARN_LOG_MUL_FIX])

ggerganov · 2025-12-12T09:58:59Z

convert_hf_to_gguf.py

        yarn_params = self.hparams["yarn"]
        self.gguf_writer.add_attn_temperature_length(yarn_params["original_max_position_embeddings"])
-        self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1) # mscale_all_dim * 0.1
+        self.gguf_writer.add_rope_scaling_yarn_log_mul(1.0) # mscale_all_dim


I believe the old version of this was incorrect. AFAICT this affects Mistral 3 Large models. Since it's difficult to test these and given that they are recent and likely haven't proliferated much, I think it's best to accept this breaking change.

I think the old version should be correct because I copied from this LOC of DeepseekV2:

llama.cpp/convert_hf_to_gguf.py

Line 7289 in a81a569

self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1 * rope_scaling["mscale_all_dim"])

Also note that, the current LOC is used by MistralMoeModel which is based on DeepseekV2Model, it is different from normal MistralModel which is based on LlamaModel

Otherwise, it's probably better to use another metadata key in mistral3, I would suggest {arch}.rope.scaling.yarn_mscale_all_dims

And can default it to 1.0 to avoid breaking changes for existing GGUF

Correct, I missed that Mistral 3 Large implements DS2 arch. Reverted the change and added a comment

ngxson · 2025-12-12T10:53:00Z

src/llama-model.cpp

+        // special-case DEEPSEEK v2:
+        // https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat/blob/main/config.json#L42-L43
+        if (arch == LLM_ARCH_DEEPSEEK2 && mscale_all_dims != 1.0f) {
+            mscale = mscale_all_dims;


For clarification, I think for deepseek we can do mscale_all_dims = mscale_all_dims * 10.0f, because it was scaled by 0.1 before written into GGUF

Already done above on line 1642

ngxson · 2025-12-12T10:58:04Z

src/llama-model.cpp

+            return scale <= 1.0f ? 1.0f : (0.1f * mscale * logf(scale) + 1.0f);
+        };
+
+        hparams.yarn_attn_factor = get_mscale(factor, mscale) / get_mscale(factor, mscale_all_dims);


We're calculating 0.1f * mscale * logf(scale) + 1.0f in 4 places in the code base:

Here

yarn_attn_factor_adjust

inside model graph build function

inside the rope kernel

I think there is a way to simply it:

leave the yarn_attn_factor as-is (because it's also used by LLM_ARCH_GROK, so probably better not to adjust it?)

leave the yarn_attn_factor_adjust to adjust everything

remove the adjustment from deepseek2 cgraph?

In short, I think this LOC: hparams.yarn_attn_factor = get_mscale(factor, mscale) / get_mscale(factor, mscale_all_dims) can be redundant because:

The numerator get_mscale(factor, mscale) is already handled inside the rope kernel:

llama.cpp/ggml/src/ggml-cuda/rope.cu

Line 34 in c33a58b

mscale *= 1.0f + 0.1f * logf(1.0f / freq_scale);

The denumerator get_mscale(factor, mscale_all_dims) will be handled by yarn_attn_factor_adjust()

The numerator get_mscale(factor, mscale) is already handled inside the rope kernel:

This numerator assumes mscale == 1.0f. I am not sure why it was added, but effectively we have to cancel it every time if we want to support mscale != 1.0f

because it's also used by LLM_ARCH_GROK, so probably better not to adjust it?

The hparams.yarn_attn_factor = get_mscale(... branch is only entered when hparams.rope_yarn_log_mul != 0.0f, so it should not trigger for GROK.

ah yeah, right. that sounds good then

ggerganov

The logic is still quite messy, but at least I hope we are handling YaRN correctly now in all cases. @ngxson Thanks for the help and investigation of this issue.

Tested that logprobs with Ministral 3 and Devstral 2 are improved and DS2 are the same as before. Not sure which other models would be affected. Plan to merge this later today.

Nindaleth · 2025-12-12T18:25:02Z

Is this supposed to affect Devstral Small 2 too?

Getting improved generation now that this PR was merged; just wondering whether I'm just hitting a statistical anomaly or if this PR helped.

models : fix the attn_factor for mistral3 graphs

1df2e90

github-actions bot added the model Model specific label Dec 11, 2025

loci-dev mentioned this pull request Dec 11, 2025

UPSTREAM PR #17945: models : fix the attn_factor for mistral3 graphs auroralabs-loci/llama.cpp#526

Open

ngxson approved these changes Dec 11, 2025

View reviewed changes

ggerganov added 2 commits December 12, 2025 11:37

cont : rework attn_factor correction logic

59b9e36

cont : make deepseek2 consistent

45930c9

ggerganov force-pushed the gg/mistral-fix-attn-factor branch from 0ca55b6 to 45930c9 Compare December 12, 2025 09:49

cont : add TODO

45875df

ggerganov marked this pull request as ready for review December 12, 2025 09:56

ggerganov requested a review from CISC as a code owner December 12, 2025 09:56

ggerganov requested a review from ngxson December 12, 2025 09:56

ggerganov commented Dec 12, 2025

View reviewed changes

github-actions bot added the python python script changes label Dec 12, 2025

cont : special-case DSv2

06eb8e8

ggerganov changed the title ~~models : fix the attn_factor for mistral3 graphs~~ models : fix the attn_factor for mistral3 graphs + improve consistency Dec 12, 2025

cont : revert Mistral 3 Large changes

01b77b5

ngxson reviewed Dec 12, 2025

View reviewed changes

cont : fix DS2 to use the original attn_factor

7320a2d

ngxson approved these changes Dec 12, 2025

View reviewed changes

ggerganov commented Dec 12, 2025

View reviewed changes

cont : minor comments [no ci]

d6477e1

ggerganov merged commit 7bed317 into master Dec 12, 2025
1 check passed

ggerganov deleted the gg/mistral-fix-attn-factor branch December 12, 2025 15:12

This was referenced Dec 12, 2025

Eval bug: Devstral-Small-2-24B-Instruct-2512 doesn't pass the example tool call tests #17960

Open

Eval bug: Devstral 2 over-constrained during tool call #17928

Open

models : fix the attn_factor for mistral3 graphs + improve consistency #17945

models : fix the attn_factor for mistral3 graphs + improve consistency #17945

Conversation

ggerganov commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nindaleth commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Dec 11, 2025 •

edited

Loading