Skip to content

Metrics & feedback on RTX PRO 6000 Blackwell (sm_120, x86_64) #449

Description

@bryanvine

The posted benchmarks cover Apple Metal (M5 Max, Mac Studio M3 Ultra) and
DGX Spark GB10 (sm_121, ARM64). This is the first discrete x86_64 Blackwell
workstation data point — TL;DR: builds + runs correctly on sm_120, generation
throughput beats the posted Metal/Spark q2 numbers, but the 96 GB card is tight:
DeepSeek-V4-Flash q2 (80.76 GB) only runs full-residency with the fp16 dequant
cache disabled, and the heavier ds4_test sessions OOM.

Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation, 96 GB (compute cap sm_120)
  • Host: x86_64, Linux 6.12 (Rocky 10), driver 610.43.02 / CUDA UMD 13.3, ~50 GB host RAM
  • Toolkit: CUDA 13.0.1 via nvidia/cuda:13.0.1-devel-ubuntu24.04 (host has runtime libs only, no nvcc; ds4 built in-container)
  • Model: DeepSeek-V4-Flash q2-imatrix (~80.76 GB) from antirez/deepseek-v4-gguf

Build — clean on sm_120, one cosmetic Makefile note

make ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_test CUDA_ARCH=sm_120
nvcc (CUDA 13.0.1) compiled and linked everything, no source changes. 👍
Minor: CUDA_LDLIBS hardcodes -L$(CUDA_HOME)/targets/sbsa-linux/lib (ARM64).
On x86_64 that path is absent so it's a dead -L (links via lib64, no warning).
Suggest targets/$(shell uname -m)-linux/lib. Also make cuda-generic uses
-arch=native (needs a visible GPU at build); make cuda CUDA_ARCH=sm_120 is the
one to document for containerized/GPU-less build hosts.

Correctness (ds4_test, sm_120) — PASSES

  • --server, --logprob-vectors, --local-golden-vectors,
    --streaming-decode-prefill-correctness: OK. The official top-logprob /
    golden continuation vectors match on sm_120, so the CUDA path is numerically
    correct on Blackwell. (One vector, long_memory_archive, self-skips:
    "API/official graph mismatch" — appears pre-existing, not sm_120-specific.)
  • --long-context, --think-tool-recovery, --tool-call-quality,
    --mtp-verify-depth: FAIL, but purely OOM — they abort at
    ds4_session_create(engine, 100000|32768) == 0 (and mtp-verify at model+MTP
    load). Not numerical failures: the session/activation buffers for those tests
    don't fit on top of 80.76 GB of resident weights in 96 GB. Loading MTP for
    every test (we set DS4_TEST_MTP) tightened this further. These tests assume
    more headroom than a 96 GB card has.

Memory / fit (the main finding for 96 GB discrete cards)

Full residency of the 80.76 GB q2 weights + the q8→fp16 dequant cache
(auto-budgets to 12 GB; ~5.7 GB grown) + the CUDA session workspace exceeds
96 GB → failed to create cuda session (a wall of CUDA tensor alloc failed).
Two knobs make it run fully resident:

  • DS4_CUDA_Q8_F16_CACHE_MB=0 — disable the fp16 dequant cache (falls back to
    q8 kernels), frees ~6 GB → loads at ~91 GB peak. This is what made everything
    below work.
  • MTP off — the +3.5 GB head + draft state pushes back over 96 GB.
  • KV is not a constraint: MLA keeps it tiny (~1 GB @ 28k ctx, ~2.4 GB @ 131k), so
    large context is essentially free. SSD/KV-disk offload wasn't useful here (and
    this host has no fast NVMe — virtiofs only).

So on 96 GB, ds4 runs but with both speed features (MTP, fp16 cache) off. A card
with ~8–12 GB more VRAM (or a slightly smaller quant) would keep them on.

Throughput (ds4-bench, q2, sm_120, fp16 cache off, no MTP, greedy)

30,493-token prompt, --ctx-start 2048 --ctx-max 28672 --step-mul 2 --gen-tokens 128:

ctx prefill t/s gen t/s
2048 362.3 45.9
4096 360.1 42.6
8192 358.0 41.6
16384 352.4 41.7
28672 345.5 40.6

Model load ~38 s. For comparison with the posted q2 numbers: M5 Max 87/463 t/s
prefill (short/long) & 34/26 t/s gen; DGX Spark GB10 344 t/s prefill & 13.8 t/s
gen. So this card's generation (~41–46 t/s) beats Metal and Spark, and prefill
(~350 t/s) ≈ Spark — and this is with the fp16 cache and MTP disabled, so the
ceiling is higher on a roomier card.

Real-world note (optional, our use case)

We A/B'd DeepSeek-V4-Flash (via ds4) vs gpt-oss-120b (vLLM+EAGLE3) as the
orchestrator in a multi-tool due-diligence RAG pipeline (/v1/chat/completions,
131k ctx). Quality was near-parity on answered queries (LLM-judge 4.46 vs 4.52),
but end-to-end it was 3.3–4.5× slower — because synthesis emits many tokens at
~41–46 t/s (no MTP) and thinking-mode (deepseek-reasoner/default) exceeded our
per-call timeout on the heaviest 23-tool queries. deepseek-chat (no-think) was
100% reliable. Raw engine speed was fine; token count was the cost.

Questions for the maintainer

  1. Any plan to bound the fp16 dequant cache automatically when free VRAM is low,
    rather than needing DS4_CUDA_Q8_F16_CACHE_MB=0? (It already falls back to q8
    kernels at runtime, but the startup growth + session alloc race to OOM.)
  2. The heavy ds4_test sessions (100k/32k ctx) assume headroom a 96 GB card
    doesn't have with q2 resident — intended, or worth a smaller default test ctx?
  3. Is generation expected to scale further with MTP on sm_120, and is MTP's extra
    VRAM expected to be ~3.5 GB + draft state?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions