Metrics & feedback  on RTX PRO 6000 Blackwell (sm_120, x86_64)

The posted benchmarks cover Apple Metal (M5 Max, Mac Studio M3 Ultra) and
**DGX Spark GB10 (sm_121, ARM64)**. This is the first discrete x86_64 Blackwell
workstation data point — TL;DR: **builds + runs correctly on sm_120, generation
throughput beats the posted Metal/Spark q2 numbers, but the 96 GB card is tight:
DeepSeek-V4-Flash q2 (80.76 GB) only runs full-residency with the fp16 dequant
cache disabled, and the heavier `ds4_test` sessions OOM.**

## Environment
- GPU: **NVIDIA RTX PRO 6000 Blackwell Workstation, 96 GB** (compute cap **sm_120**)
- Host: x86_64, Linux 6.12 (Rocky 10), driver 610.43.02 / CUDA UMD 13.3, ~50 GB host RAM
- Toolkit: **CUDA 13.0.1** via `nvidia/cuda:13.0.1-devel-ubuntu24.04` (host has runtime libs only, no `nvcc`; ds4 built in-container)
- Model: `DeepSeek-V4-Flash` **q2-imatrix (~80.76 GB)** from `antirez/deepseek-v4-gguf`

## Build — clean on sm_120, one cosmetic Makefile note
`make ds4 ds4-server ds4-bench ds4-eval ds4-agent ds4_test CUDA_ARCH=sm_120` —
`nvcc` (CUDA 13.0.1) compiled and linked everything, no source changes. 👍
Minor: `CUDA_LDLIBS` hardcodes `-L$(CUDA_HOME)/targets/sbsa-linux/lib` (ARM64).
On x86_64 that path is absent so it's a dead `-L` (links via `lib64`, no warning).
Suggest `targets/$(shell uname -m)-linux/lib`. Also `make cuda-generic` uses
`-arch=native` (needs a visible GPU at build); `make cuda CUDA_ARCH=sm_120` is the
one to document for containerized/GPU-less build hosts.

## Correctness (`ds4_test`, sm_120) — PASSES
- `--server`, `--logprob-vectors`, `--local-golden-vectors`,
  `--streaming-decode-prefill-correctness`: **OK.** The official top-logprob /
  golden continuation vectors match on sm_120, so the CUDA path is numerically
  correct on Blackwell. (One vector, `long_memory_archive`, self-skips:
  "API/official graph mismatch" — appears pre-existing, not sm_120-specific.)
- `--long-context`, `--think-tool-recovery`, `--tool-call-quality`,
  `--mtp-verify-depth`: **FAIL, but purely OOM** — they abort at
  `ds4_session_create(engine, 100000|32768) == 0` (and `mtp-verify` at model+MTP
  load). Not numerical failures: the session/activation buffers for those tests
  don't fit on top of 80.76 GB of resident weights in 96 GB. Loading MTP for
  every test (we set `DS4_TEST_MTP`) tightened this further. **These tests assume
  more headroom than a 96 GB card has.**

## Memory / fit (the main finding for 96 GB discrete cards)
Full residency of the 80.76 GB q2 weights + the **q8→fp16 dequant cache**
(auto-budgets to 12 GB; ~5.7 GB grown) + the CUDA session workspace exceeds
96 GB → `failed to create cuda session` (a wall of `CUDA tensor alloc failed`).
Two knobs make it run **fully resident**:
- **`DS4_CUDA_Q8_F16_CACHE_MB=0`** — disable the fp16 dequant cache (falls back to
  q8 kernels), frees ~6 GB → loads at **~91 GB peak**. This is what made everything
  below work.
- **MTP off** — the +3.5 GB head + draft state pushes back over 96 GB.
- KV is *not* a constraint: MLA keeps it tiny (~1 GB @ 28k ctx, ~2.4 GB @ 131k), so
  large context is essentially free. SSD/KV-disk offload wasn't useful here (and
  this host has no fast NVMe — virtiofs only).

So on 96 GB, ds4 runs but with both speed features (MTP, fp16 cache) off. A card
with ~8–12 GB more VRAM (or a slightly smaller quant) would keep them on.

## Throughput (`ds4-bench`, q2, sm_120, **fp16 cache off, no MTP**, greedy)
30,493-token prompt, `--ctx-start 2048 --ctx-max 28672 --step-mul 2 --gen-tokens 128`:

| ctx | prefill t/s | gen t/s |
|---|---|---|
| 2048 | 362.3 | 45.9 |
| 4096 | 360.1 | 42.6 |
| 8192 | 358.0 | 41.6 |
| 16384 | 352.4 | 41.7 |
| 28672 | 345.5 | 40.6 |

Model load ~38 s. For comparison with the posted q2 numbers: **M5 Max** 87/463 t/s
prefill (short/long) & 34/26 t/s gen; **DGX Spark GB10** 344 t/s prefill & 13.8 t/s
gen. So this card's **generation (~41–46 t/s) beats Metal and Spark**, and prefill
(~350 t/s) ≈ Spark — *and this is with the fp16 cache and MTP disabled*, so the
ceiling is higher on a roomier card.

## Real-world note (optional, our use case)
We A/B'd DeepSeek-V4-Flash (via ds4) vs `gpt-oss-120b` (vLLM+EAGLE3) as the
orchestrator in a multi-tool due-diligence RAG pipeline (`/v1/chat/completions`,
131k ctx). Quality was near-parity on answered queries (LLM-judge 4.46 vs 4.52),
but end-to-end it was 3.3–4.5× slower — because synthesis emits many tokens at
~41–46 t/s (no MTP) and thinking-mode (`deepseek-reasoner`/default) exceeded our
per-call timeout on the heaviest 23-tool queries. `deepseek-chat` (no-think) was
100% reliable. Raw engine speed was fine; token count was the cost.

## Questions for the maintainer
1. Any plan to bound the fp16 dequant cache automatically when free VRAM is low,
   rather than needing `DS4_CUDA_Q8_F16_CACHE_MB=0`? (It already falls back to q8
   kernels at runtime, but the startup growth + session alloc race to OOM.)
2. The heavy `ds4_test` sessions (100k/32k ctx) assume headroom a 96 GB card
   doesn't have with q2 resident — intended, or worth a smaller default test ctx?
3. Is generation expected to scale further with MTP on sm_120, and is MTP's extra
   VRAM expected to be ~3.5 GB + draft state?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics & feedback on RTX PRO 6000 Blackwell (sm_120, x86_64) #449

Environment

Build — clean on sm_120, one cosmetic Makefile note

Correctness (`ds4_test`, sm_120) — PASSES

Memory / fit (the main finding for 96 GB discrete cards)

Throughput (`ds4-bench`, q2, sm_120, fp16 cache off, no MTP, greedy)

Real-world note (optional, our use case)

Questions for the maintainer

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ctx	prefill t/s	gen t/s
2048	362.3	45.9
4096	360.1	42.6
8192	358.0	41.6
16384	352.4	41.7
28672	345.5	40.6

Metrics & feedback on RTX PRO 6000 Blackwell (sm_120, x86_64) #449

Description

Environment

Build — clean on sm_120, one cosmetic Makefile note

Correctness (ds4_test, sm_120) — PASSES

Memory / fit (the main finding for 96 GB discrete cards)

Throughput (ds4-bench, q2, sm_120, fp16 cache off, no MTP, greedy)

Real-world note (optional, our use case)

Questions for the maintainer

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Correctness (`ds4_test`, sm_120) — PASSES

Throughput (`ds4-bench`, q2, sm_120, fp16 cache off, no MTP, greedy)