Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 11, 2025

Compare logits between llama.cpp and another inference engine using OpenAI-compatible server endpoints.

Unlike compare-logits.py, it allows dumping logits from a hosted API endpoint. Useful when it's not possible to run both models locally.

Example usage:
    Step 1: Dump logits from two different servers
        python scripts/compare-logprobs.py dump logits_llama.log http://localhost:8080/v1/completions
        python scripts/compare-logprobs.py dump logits_other.log http://other-engine:8000/v1/completions

        (optionally, you can add --api-key <key> if the endpoint requires authentication)

    Step 2: Compare the dumped logits
        python scripts/compare-logprobs.py compare logits_llama.log logits_other.log report.md

@ngxson ngxson requested a review from ggerganov December 11, 2025 23:30
"top_k": 1,
"max_tokens": 1,
"logprobs": 1,
"stream": False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that I'm no longer use the echo option because the support is hit-or-missed across frameworks

@github-actions github-actions bot added script Script related python python script changes labels Dec 11, 2025
@pwilkin
Copy link
Collaborator

pwilkin commented Dec 12, 2025

If this is a going to be a general tool I'd drop the prompt fragment about the tool call syntax, especially since we're not providing any tools.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2025

Tools and chat template are not handled here, we are using the raw completions endpoints.

The goal is a bit like perplexity test, where we don't actually care about the input text, we only care about the logits of the next predicted token

@ngxson ngxson changed the title scripts: add script to compare logits of llama.cpp against other frameworks scripts: add script to compare logprobs of llama.cpp against other frameworks Dec 12, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangentially, thinking if we can prepare a script that automates the comparison for a set of most prominent models:

  1. Download latest HF repo of original model
  2. Convert to GGUF BF16
  3. Compare logprobs, generate report
  4. Delete models
  5. Goto 1. for next model

The main issue is to get hardware to run this periodically.


One more thing - this roughly tests the logprobs at mostly empty context. It would be useful to be able to run the comparison at larger depth - f.ex. logprobs for 100 tokens at 32k tokens depth. The reason is because some models change their parameters at larger depths and we should be able to test that too.

@pwilkin
Copy link
Collaborator

pwilkin commented Dec 12, 2025

@ggerganov The scripts by @danbev in examples/model-conversion already do a lot of that. I routinely run causal-verify-logits.

As for the second task, I've been thinking of the same thing recently, in fact, I've already started working on a branch for that as well. Since the llama-logits tool is way too restricted for this, I've employed llama-cli (or the new llama-completion) and made it so that if you run it with --verbose and --n-predict 1, it dumps logits like llama-logits does. I've also refactored the run-org-model.py script to run long queries for the original transformer in batched chunks similar to the llama-cli behavior.

The biggest issue though is after processing 20k+ tokens the logits will almost never match exactly - the slight numerical differences will accumulate too much to make this test reasonable (I mean, even the fact that we're doing norm calcs in F16 in CUDA and Transformers does it in F32 will probably be enough). Unless we somehow make an exact copy of the KV cache and only process the tokens, I don't think we can reliably get a reasonable result from that.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2025

The main issue is to get hardware to run this periodically.

I think we could already somewhat automate this workflow by deploying everything to HF inference endpoint, which supports both vLLM and llama.cpp.

The main downside is that we can only do this with publicly available weights and it can be quite tricky to test a specific PR (since HFE only accepts publicly available docker image)

But still, I think we can already test some of the existing models (lmk which one you think we should test).

Will modify the script to pick N tokens at a deeper context length (unfortunately we cannot count exactly by tokens, because the runtime may not public the API for counting it)

@ggerganov
Copy link
Member

@ngxson I'll see if I can make a setup on my DGX Spark - need to learn how to run vllm though.

@pwilkin There are various advantages of this comparison:

  • Works through OAI API
  • Can be used to verify against vllm which is what most model providers develop against

The biggest issue though is after processing 20k+ tokens the logits will almost never match exactly

The logprobs actually align quite well at long contexts since the result for token N is no longer a function of the result for token N-1 and hence there is no accumulation of numerical differences.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 13, 2025

I added an option "pattern":

  --pattern PATTERN  Pattern n_get,n_skip,... where n_get is number of words to get and n_skip is number of words
                     to skip (num of words, NOT num of tokens)

default pattern is 10,1000,10,4000,10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants