Skip to content

Misc. bug: When running the gpt‑oss‑120b model on a 3060 GPU, using -b 4096 -ub 4096 causes it to crash and exit #17931

@zts9989

Description

@zts9989

Name and Version

llama.cpp-b7342/build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 0 (unknown)
built with GNU 11.2.1 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on  -c 131072 -b 4096 -ub 4096

Problem description & steps to reproduce

On Debian 12, CentOS 7.9 and similar systems, using CUDA 12.4, 12.8 or 13.1 (with the default latest GPU driver), llama.cpp versions b6432, b7211, b7342 (and a few other builds that were tested sporadically) crash when the gpt‑oss‑120b‑Q4_K_M model is run with -ub 4096 -b 4096. After processing input longer than about 40960 tokens, the program (llama‑server) aborts.

The same llama.cpp versions, on the same OS and CUDA versions, work fine on RTX 40‑series cards such as the 4070 or 4090 when using -ub 4096 -b 4096.

We suspect that the Turing‑specific code path in fattn‑mma‑f16.cuh contains something that conflicts with the RTX 3060 (perhaps a shared‑memory issue?).

-ub 1024 -b 1024 runs normally on a 3060 GPU.

First Bad Commit

No response

Relevant log output

slot update_slots: id  3 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.283075
slot update_slots: id  3 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 40960, batch.n_tokens = 4096, progress = 0.314528
slot update_slots: id  3 | task 0 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 4096, progress = 0.345980
/home/debUser/llama.cpp-b7342/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/debUser/llama.cpp-b7342/ggml/src/ggml-cuda/ggml-cuda.cu:2847
  cudaStreamSynchronize(cuda_ctx->stream())
No symbol table is loaded.  Use the "file" command.
[New LWP 11536]
[New LWP 11535]
[New LWP 11534]
[New LWP 11533]
[New LWP 11532]
[New LWP 11531]
[New LWP 11530]
[New LWP 11529]
[New LWP 11528]
[New LWP 11527]
[New LWP 11526]
[New LWP 11525]
[New LWP 11524]
[New LWP 11523]
[New LWP 11522]
[New LWP 11521]
[New LWP 11520]
[New LWP 11519]
[New LWP 11518]
[New LWP 11517]
[New LWP 11516]
[New LWP 11515]
[New LWP 11514]
[New LWP 11504]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f5281a741d9 in waitpid () from /lib64/libpthread.so.0
No symbol table is loaded.  Use the "file" command.
[Inferior 1 (process 11503) detached]
Aborted (core dumped)
llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on  -c 131072 -b 4096 -ub 4096

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   53C    P0             N/A /  170W |       1MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 130163, batch.n_tokens = 115, progress = 0.999509
slot update_slots: id  3 | task 0 | n_tokens = 130163, memory_seq_rm [130163, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 130227, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 130227, batch.n_tokens = 64
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 128627, pos_max = 130162, size = 54.018 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =  626310.71 ms / 130227 tokens (    4.81 ms per token,   207.93 tokens per second)
       eval time =   55423.91 ms /   845 tokens (   65.59 ms per token,    15.25 tokens per second)
      total time =  681734.62 ms / 131072 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 131071, truncated = 1
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 3060)   | 12037 = 4510 + ( 7256 =  1224 +    4662 +    1369) +         270 |
llama_memory_breakdown_print: |   - Host               |                 59790 = 59261 +       0 +     529                |
(base) root@DebianX:~$ llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on  -c 131072 -b 1024 -ub 1024

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions