CUDA: fix overflow in MMA kernel without stream-k #17939

JohannesGaessler · 2025-12-11T13:50:49Z

On Ampere and older a stream-k decomposition is only used if the efficiency of tiling is bad. However, the code that is being run is still the exact same, there is only a change in the launch configuration so that each CUDA block works on a single tile. Since the number of tiles is proportional to the physical batch size, raising it can cause a numerical overflow in an intermediate result. This PR simply makes it so that 64 bit integers are used for that part rather than 32 bit integers.

CUDA: fix overflow in MMA kernel without stream-k

ed39524

JohannesGaessler mentioned this pull request Dec 11, 2025

Misc. bug: When running the gpt‑oss‑120b model on a 3060 GPU, using -b 4096 -ub 4096 causes it to crash and exit #17931

Closed

loci-dev mentioned this pull request Dec 11, 2025

UPSTREAM PR #17939: CUDA: fix overflow in MMA kernel without stream-k auroralabs-loci/llama.cpp#524

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2025

ggerganov approved these changes Dec 12, 2025

View reviewed changes

JohannesGaessler merged commit 4822114 into ggml-org:master Dec 12, 2025
58 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix overflow in MMA kernel without stream-k #17939

CUDA: fix overflow in MMA kernel without stream-k #17939

JohannesGaessler commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: fix overflow in MMA kernel without stream-k #17939

CUDA: fix overflow in MMA kernel without stream-k #17939

Conversation

JohannesGaessler commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants