The posted benchmarks cover Apple Metal (M5 Max, Mac Studio M3 Ultra) and
DGX Spark GB10 (sm_121, ARM64). This is the first discrete x86_64 Blackwell
workstation data point — TL;DR: builds + runs correctly on sm_120, generation
throughput beats the posted Metal/Spark q2 numbers, but the 96 GB card is tight:
DeepSeek-V4-Flash q2 (80.76 GB) only runs full-residency with the fp16 dequant
cache disabled, and the heavier ds4_test sessions OOM.
Environment
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation, 96 GB (compute cap sm_120)
- Host: x86_64, Linux 6.12 (Rocky 10), driver 610.43.02 / CUDA UMD 13.3, ~50 GB host RAM
- Toolkit: CUDA 13.0.1 via
nvidia/cuda:13.0.1-devel-ubuntu24.04 (host has runtime libs only, no nvcc; ds4 built in-container)
- Model:
DeepSeek-V4-Flash q2-imatrix (~80.76 GB) from antirez/deepseek-v4-gguf
Build — clean on sm_120, one cosmetic Makefile note
make ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_test CUDA_ARCH=sm_120 —
nvcc (CUDA 13.0.1) compiled and linked everything, no source changes. 👍
Minor: CUDA_LDLIBS hardcodes -L$(CUDA_HOME)/targets/sbsa-linux/lib (ARM64).
On x86_64 that path is absent so it's a dead -L (links via lib64, no warning).
Suggest targets/$(shell uname -m)-linux/lib. Also make cuda-generic uses
-arch=native (needs a visible GPU at build); make cuda CUDA_ARCH=sm_120 is the
one to document for containerized/GPU-less build hosts.
Correctness (ds4_test, sm_120) — PASSES
--server, --logprob-vectors, --local-golden-vectors,
--streaming-decode-prefill-correctness: OK. The official top-logprob /
golden continuation vectors match on sm_120, so the CUDA path is numerically
correct on Blackwell. (One vector, long_memory_archive, self-skips:
"API/official graph mismatch" — appears pre-existing, not sm_120-specific.)
--long-context, --think-tool-recovery, --tool-call-quality,
--mtp-verify-depth: FAIL, but purely OOM — they abort at
ds4_session_create(engine, 100000|32768) == 0 (and mtp-verify at model+MTP
load). Not numerical failures: the session/activation buffers for those tests
don't fit on top of 80.76 GB of resident weights in 96 GB. Loading MTP for
every test (we set DS4_TEST_MTP) tightened this further. These tests assume
more headroom than a 96 GB card has.
Memory / fit (the main finding for 96 GB discrete cards)
Full residency of the 80.76 GB q2 weights + the q8→fp16 dequant cache
(auto-budgets to 12 GB; ~5.7 GB grown) + the CUDA session workspace exceeds
96 GB → failed to create cuda session (a wall of CUDA tensor alloc failed).
Two knobs make it run fully resident:
DS4_CUDA_Q8_F16_CACHE_MB=0 — disable the fp16 dequant cache (falls back to
q8 kernels), frees ~6 GB → loads at ~91 GB peak. This is what made everything
below work.
- MTP off — the +3.5 GB head + draft state pushes back over 96 GB.
- KV is not a constraint: MLA keeps it tiny (~1 GB @ 28k ctx, ~2.4 GB @ 131k), so
large context is essentially free. SSD/KV-disk offload wasn't useful here (and
this host has no fast NVMe — virtiofs only).
So on 96 GB, ds4 runs but with both speed features (MTP, fp16 cache) off. A card
with ~8–12 GB more VRAM (or a slightly smaller quant) would keep them on.
Throughput (ds4-bench, q2, sm_120, fp16 cache off, no MTP, greedy)
30,493-token prompt, --ctx-start 2048 --ctx-max 28672 --step-mul 2 --gen-tokens 128:
| ctx |
prefill t/s |
gen t/s |
| 2048 |
362.3 |
45.9 |
| 4096 |
360.1 |
42.6 |
| 8192 |
358.0 |
41.6 |
| 16384 |
352.4 |
41.7 |
| 28672 |
345.5 |
40.6 |
Model load ~38 s. For comparison with the posted q2 numbers: M5 Max 87/463 t/s
prefill (short/long) & 34/26 t/s gen; DGX Spark GB10 344 t/s prefill & 13.8 t/s
gen. So this card's generation (~41–46 t/s) beats Metal and Spark, and prefill
(~350 t/s) ≈ Spark — and this is with the fp16 cache and MTP disabled, so the
ceiling is higher on a roomier card.
Real-world note (optional, our use case)
We A/B'd DeepSeek-V4-Flash (via ds4) vs gpt-oss-120b (vLLM+EAGLE3) as the
orchestrator in a multi-tool due-diligence RAG pipeline (/v1/chat/completions,
131k ctx). Quality was near-parity on answered queries (LLM-judge 4.46 vs 4.52),
but end-to-end it was 3.3–4.5× slower — because synthesis emits many tokens at
~41–46 t/s (no MTP) and thinking-mode (deepseek-reasoner/default) exceeded our
per-call timeout on the heaviest 23-tool queries. deepseek-chat (no-think) was
100% reliable. Raw engine speed was fine; token count was the cost.
Questions for the maintainer
- Any plan to bound the fp16 dequant cache automatically when free VRAM is low,
rather than needing DS4_CUDA_Q8_F16_CACHE_MB=0? (It already falls back to q8
kernels at runtime, but the startup growth + session alloc race to OOM.)
- The heavy
ds4_test sessions (100k/32k ctx) assume headroom a 96 GB card
doesn't have with q2 resident — intended, or worth a smaller default test ctx?
- Is generation expected to scale further with MTP on sm_120, and is MTP's extra
VRAM expected to be ~3.5 GB + draft state?
The posted benchmarks cover Apple Metal (M5 Max, Mac Studio M3 Ultra) and
DGX Spark GB10 (sm_121, ARM64). This is the first discrete x86_64 Blackwell
workstation data point — TL;DR: builds + runs correctly on sm_120, generation
throughput beats the posted Metal/Spark q2 numbers, but the 96 GB card is tight:
DeepSeek-V4-Flash q2 (80.76 GB) only runs full-residency with the fp16 dequant
cache disabled, and the heavier
ds4_testsessions OOM.Environment
nvidia/cuda:13.0.1-devel-ubuntu24.04(host has runtime libs only, nonvcc; ds4 built in-container)DeepSeek-V4-Flashq2-imatrix (~80.76 GB) fromantirez/deepseek-v4-ggufBuild — clean on sm_120, one cosmetic Makefile note
make ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_test CUDA_ARCH=sm_120—nvcc(CUDA 13.0.1) compiled and linked everything, no source changes. 👍Minor:
CUDA_LDLIBShardcodes-L$(CUDA_HOME)/targets/sbsa-linux/lib(ARM64).On x86_64 that path is absent so it's a dead
-L(links vialib64, no warning).Suggest
targets/$(shell uname -m)-linux/lib. Alsomake cuda-genericuses-arch=native(needs a visible GPU at build);make cuda CUDA_ARCH=sm_120is theone to document for containerized/GPU-less build hosts.
Correctness (
ds4_test, sm_120) — PASSES--server,--logprob-vectors,--local-golden-vectors,--streaming-decode-prefill-correctness: OK. The official top-logprob /golden continuation vectors match on sm_120, so the CUDA path is numerically
correct on Blackwell. (One vector,
long_memory_archive, self-skips:"API/official graph mismatch" — appears pre-existing, not sm_120-specific.)
--long-context,--think-tool-recovery,--tool-call-quality,--mtp-verify-depth: FAIL, but purely OOM — they abort atds4_session_create(engine, 100000|32768) == 0(andmtp-verifyat model+MTPload). Not numerical failures: the session/activation buffers for those tests
don't fit on top of 80.76 GB of resident weights in 96 GB. Loading MTP for
every test (we set
DS4_TEST_MTP) tightened this further. These tests assumemore headroom than a 96 GB card has.
Memory / fit (the main finding for 96 GB discrete cards)
Full residency of the 80.76 GB q2 weights + the q8→fp16 dequant cache
(auto-budgets to 12 GB; ~5.7 GB grown) + the CUDA session workspace exceeds
96 GB →
failed to create cuda session(a wall ofCUDA tensor alloc failed).Two knobs make it run fully resident:
DS4_CUDA_Q8_F16_CACHE_MB=0— disable the fp16 dequant cache (falls back toq8 kernels), frees ~6 GB → loads at ~91 GB peak. This is what made everything
below work.
large context is essentially free. SSD/KV-disk offload wasn't useful here (and
this host has no fast NVMe — virtiofs only).
So on 96 GB, ds4 runs but with both speed features (MTP, fp16 cache) off. A card
with ~8–12 GB more VRAM (or a slightly smaller quant) would keep them on.
Throughput (
ds4-bench, q2, sm_120, fp16 cache off, no MTP, greedy)30,493-token prompt,
--ctx-start 2048 --ctx-max 28672 --step-mul 2 --gen-tokens 128:Model load ~38 s. For comparison with the posted q2 numbers: M5 Max 87/463 t/s
prefill (short/long) & 34/26 t/s gen; DGX Spark GB10 344 t/s prefill & 13.8 t/s
gen. So this card's generation (~41–46 t/s) beats Metal and Spark, and prefill
(~350 t/s) ≈ Spark — and this is with the fp16 cache and MTP disabled, so the
ceiling is higher on a roomier card.
Real-world note (optional, our use case)
We A/B'd DeepSeek-V4-Flash (via ds4) vs
gpt-oss-120b(vLLM+EAGLE3) as theorchestrator in a multi-tool due-diligence RAG pipeline (
/v1/chat/completions,131k ctx). Quality was near-parity on answered queries (LLM-judge 4.46 vs 4.52),
but end-to-end it was 3.3–4.5× slower — because synthesis emits many tokens at
~41–46 t/s (no MTP) and thinking-mode (
deepseek-reasoner/default) exceeded ourper-call timeout on the heaviest 23-tool queries.
deepseek-chat(no-think) was100% reliable. Raw engine speed was fine; token count was the cost.
Questions for the maintainer
rather than needing
DS4_CUDA_Q8_F16_CACHE_MB=0? (It already falls back to q8kernels at runtime, but the startup growth + session alloc race to OOM.)
ds4_testsessions (100k/32k ctx) assume headroom a 96 GB carddoesn't have with q2 resident — intended, or worth a smaller default test ctx?
VRAM expected to be ~3.5 GB + draft state?