scripts: add script to compare logprobs of llama.cpp against other frameworks #17947

ngxson · 2025-12-11T23:30:39Z

Compare logits between llama.cpp and another inference engine using OpenAI-compatible server endpoints.

Unlike compare-logits.py, it allows dumping logits from a hosted API endpoint. Useful when it's not possible to run both models locally.

Example usage:
    Step 1: Dump logits from two different servers
        python scripts/compare-logprobs.py dump logits_llama.log http://localhost:8080/v1/completions
        python scripts/compare-logprobs.py dump logits_other.log http://other-engine:8000/v1/completions

        (optionally, you can add --api-key <key> if the endpoint requires authentication)

    Step 2: Compare the dumped logits
        python scripts/compare-logprobs.py compare logits_llama.log logits_other.log report.md

…eworks

ngxson · 2025-12-11T23:33:51Z

scripts/compare-logprobs.py

+                "top_k": 1,
+                "max_tokens": 1,
+                "logprobs": 1,
+                "stream": False,


Just note that I'm no longer use the echo option because the support is hit-or-missed across frameworks

pwilkin · 2025-12-12T00:21:15Z

If this is a going to be a general tool I'd drop the prompt fragment about the tool call syntax, especially since we're not providing any tools.

ngxson · 2025-12-12T00:30:14Z

Tools and chat template are not handled here, we are using the raw completions endpoints.

The goal is a bit like perplexity test, where we don't actually care about the input text, we only care about the logits of the next predicted token

ggerganov

Tangentially, thinking if we can prepare a script that automates the comparison for a set of most prominent models:

Download latest HF repo of original model
Convert to GGUF BF16
Compare logprobs, generate report
Delete models
Goto 1. for next model

The main issue is to get hardware to run this periodically.

One more thing - this roughly tests the logprobs at mostly empty context. It would be useful to be able to run the comparison at larger depth - f.ex. logprobs for 100 tokens at 32k tokens depth. The reason is because some models change their parameters at larger depths and we should be able to test that too.

pwilkin · 2025-12-12T14:51:38Z

@ggerganov The scripts by @danbev in examples/model-conversion already do a lot of that. I routinely run causal-verify-logits.

As for the second task, I've been thinking of the same thing recently, in fact, I've already started working on a branch for that as well. Since the llama-logits tool is way too restricted for this, I've employed llama-cli (or the new llama-completion) and made it so that if you run it with --verbose and --n-predict 1, it dumps logits like llama-logits does. I've also refactored the run-org-model.py script to run long queries for the original transformer in batched chunks similar to the llama-cli behavior.

The biggest issue though is after processing 20k+ tokens the logits will almost never match exactly - the slight numerical differences will accumulate too much to make this test reasonable (I mean, even the fact that we're doing norm calcs in F16 in CUDA and Transformers does it in F32 will probably be enough). Unless we somehow make an exact copy of the KV cache and only process the tokens, I don't think we can reliably get a reasonable result from that.

ngxson · 2025-12-12T18:48:39Z

The main issue is to get hardware to run this periodically.

I think we could already somewhat automate this workflow by deploying everything to HF inference endpoint, which supports both vLLM and llama.cpp.

The main downside is that we can only do this with publicly available weights and it can be quite tricky to test a specific PR (since HFE only accepts publicly available docker image)

But still, I think we can already test some of the existing models (lmk which one you think we should test).

Will modify the script to pick N tokens at a deeper context length (unfortunately we cannot count exactly by tokens, because the runtime may not public the API for counting it)

ggerganov · 2025-12-13T06:39:48Z

@ngxson I'll see if I can make a setup on my DGX Spark - need to learn how to run vllm though.

@pwilkin There are various advantages of this comparison:

Works through OAI API
Can be used to verify against vllm which is what most model providers develop against

The biggest issue though is after processing 20k+ tokens the logits will almost never match exactly

The logprobs actually align quite well at long contexts since the result for token N is no longer a function of the result for token N-1 and hence there is no accumulation of numerical differences.

ngxson · 2025-12-13T11:30:43Z

I added an option "pattern":

  --pattern PATTERN  Pattern n_get,n_skip,... where n_get is number of words to get and n_skip is number of words
                     to skip (num of words, NOT num of tokens)

default pattern is 10,1000,10,4000,10

scripts: add script to compare logits of llama.cpp against other fram…

f9b9881

…eworks

ngxson requested a review from ggerganov December 11, 2025 23:30

accept custom prompt file

d6b41ff

ngxson commented Dec 11, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 11, 2025

UPSTREAM PR #17947: scripts: add script to compare logits of llama.cpp against other frameworks auroralabs-loci/llama.cpp#529

Open

fix code style

dc18bfc

github-actions bot added script Script related python python script changes labels Dec 11, 2025

ngxson added 5 commits December 12, 2025 00:40

clarify endpoint

a92a8f6

fix displaying

0b4efb6

use abs for diff

bff275e

fix vllm case

d41c8bd

rm output file

7ae43cb

rename to compare-logprobs

2f79a10

ngxson changed the title ~~scripts: add script to compare logits of llama.cpp against other frameworks~~ scripts: add script to compare logprobs of llama.cpp against other frameworks Dec 12, 2025

ggerganov approved these changes Dec 12, 2025

View reviewed changes

add "pattern"

768416a

ngxson merged commit c00ff92 into ggml-org:master Dec 13, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scripts: add script to compare logprobs of llama.cpp against other frameworks #17947

scripts: add script to compare logprobs of llama.cpp against other frameworks #17947

ngxson commented Dec 11, 2025 •

edited

Loading

Uh oh!

ngxson Dec 11, 2025

Uh oh!

pwilkin commented Dec 12, 2025

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

ggerganov left a comment •

edited

Loading

Uh oh!

pwilkin commented Dec 12, 2025

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

ggerganov commented Dec 13, 2025

Uh oh!

ngxson commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scripts: add script to compare logprobs of llama.cpp against other frameworks #17947

scripts: add script to compare logprobs of llama.cpp against other frameworks #17947

Conversation

ngxson commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 12, 2025

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

ggerganov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Dec 12, 2025

Uh oh!

ngxson commented Dec 12, 2025

Uh oh!

ggerganov commented Dec 13, 2025

Uh oh!

ngxson commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 11, 2025 •

edited

Loading

ggerganov left a comment •

edited

Loading