-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Name and Version
llama.cpp-b7342/build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
version: 0 (unknown)
built with GNU 11.2.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on -c 131072 -b 4096 -ub 4096Problem description & steps to reproduce
On Debian 12, CentOS 7.9 and similar systems, using CUDA 12.4, 12.8 or 13.1 (with the default latest GPU driver), llama.cpp versions b6432, b7211, b7342 (and a few other builds that were tested sporadically) crash when the gpt‑oss‑120b‑Q4_K_M model is run with -ub 4096 -b 4096. After processing input longer than about 40960 tokens, the program (llama‑server) aborts.
The same llama.cpp versions, on the same OS and CUDA versions, work fine on RTX 40‑series cards such as the 4070 or 4090 when using -ub 4096 -b 4096.
We suspect that the Turing‑specific code path in fattn‑mma‑f16.cuh contains something that conflicts with the RTX 3060 (perhaps a shared‑memory issue?).
-ub 1024 -b 1024 runs normally on a 3060 GPU.
First Bad Commit
No response
Relevant log output
slot update_slots: id 3 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 4096, progress = 0.283075
slot update_slots: id 3 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 40960, batch.n_tokens = 4096, progress = 0.314528
slot update_slots: id 3 | task 0 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 4096, progress = 0.345980
/home/debUser/llama.cpp-b7342/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /home/debUser/llama.cpp-b7342/ggml/src/ggml-cuda/ggml-cuda.cu:2847
cudaStreamSynchronize(cuda_ctx->stream())
No symbol table is loaded. Use the "file" command.
[New LWP 11536]
[New LWP 11535]
[New LWP 11534]
[New LWP 11533]
[New LWP 11532]
[New LWP 11531]
[New LWP 11530]
[New LWP 11529]
[New LWP 11528]
[New LWP 11527]
[New LWP 11526]
[New LWP 11525]
[New LWP 11524]
[New LWP 11523]
[New LWP 11522]
[New LWP 11521]
[New LWP 11520]
[New LWP 11519]
[New LWP 11518]
[New LWP 11517]
[New LWP 11516]
[New LWP 11515]
[New LWP 11514]
[New LWP 11504]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f5281a741d9 in waitpid () from /lib64/libpthread.so.0
No symbol table is loaded. Use the "file" command.
[Inferior 1 (process 11503) detached]
Aborted (core dumped)
llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on -c 131072 -b 4096 -ub 4096
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 0% 53C P0 N/A / 170W | 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 130163, batch.n_tokens = 115, progress = 0.999509
slot update_slots: id 3 | task 0 | n_tokens = 130163, memory_seq_rm [130163, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 130227, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id 3 | task 0 | prompt done, n_tokens = 130227, batch.n_tokens = 64
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 8 (pos_min = 128627, pos_max = 130162, size = 54.018 MiB)
slot print_timing: id 3 | task 0 |
prompt eval time = 626310.71 ms / 130227 tokens ( 4.81 ms per token, 207.93 tokens per second)
eval time = 55423.91 ms / 845 tokens ( 65.59 ms per token, 15.25 tokens per second)
total time = 681734.62 ms / 131072 tokens
slot release: id 3 | task 0 | stop processing: n_tokens = 131071, truncated = 1
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
^Csrv operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (RTX 3060) | 12037 = 4510 + ( 7256 = 1224 + 4662 + 1369) + 270 |
llama_memory_breakdown_print: | - Host | 59790 = 59261 + 0 + 529 |
(base) root@DebianX:~$ llama.cpp-b7342/build/bin/llama-server -m gpt-oss-120b-Q4_K_M-00001-of-00002.gguf --temp 0.9 --top-p 0.95 --top-k 39 -s 3047 --no-warmup -ngl 100 --host 0.0.0.0 -t 8 -tb 8 -ot exps=CPU -fa on -c 131072 -b 1024 -ub 1024