Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #17931 .

On Ampere and older a stream-k decomposition is only used if the efficiency of tiling is bad. However, the code that is being run is still the exact same, there is only a change in the launch configuration so that each CUDA block works on a single tile. Since the number of tiles is proportional to the physical batch size, raising it can cause a numerical overflow in an intermediate result. This PR simply makes it so that 64 bit integers are used for that part rather than 32 bit integers.

@JohannesGaessler JohannesGaessler merged commit 4822114 into ggml-org:master Dec 12, 2025
58 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: When running the gpt‑oss‑120b model on a 3060 GPU, using -b 4096 -ub 4096 causes it to crash and exit

2 participants