Support SSD streaming for Q4_K routed experts on ROCm by kmc6042 · Pull Request #451 · antirez/ds4

kmc6042 · 2026-06-24T14:34:05Z

Summary

Enables running Q4_K routed-expert GGUFs under --ssd-streaming on the ROCm
(Strix Halo) backend, and warms the routed-expert cache so streaming decode is
meaningfully faster.

Before this change the ROCm streaming MoE paths were gated to the
IQ2_XXS/Q2_K expert quant pair, so a Q4_K expert GGUF failed prefill with
missing compact selected experts ... full expert table is not mapped and could
not run under --ssd-streaming at all.

What changed

Route Q4_K through the quant-agnostic streaming machinery instead of the
IQ2-only selected/split kernels:

Prefill: allow the full-layer streaming path for Q4_K. It stages a whole
layer's expert table into a contiguous buffer and runs the standard matmul, so
it is used for any multi-token prefill (Q4_K has no batched selected-gather
kernel).
Decode: route Q4_K through the shared-overlap selected-load path and force
the selected-expert loader to build a full contiguous compact buffer, since the
split decode kernels exist only for the IQ2_XXS/Q2_K pair.

Cache warm-up to speed up streaming:

Implement the previously stubbed ROCm seed_experts() as a real bulk
sequential preload of the popularity hotlist into the resident cache — far
cheaper than the scattered first-touch random reads it replaces. Read failures
release the resident cache so partially-filled entries are never served as hits.
Allow the hotlist/prefill cache seed for Q4_K layers, and warm the cache at the
start of decode-style prefill so short prompts benefit too.

Testing

On an AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151), 123 GiB RAM, with the
153 GiB DeepSeek-V4-Flash-Q4KExperts-... GGUF on NVMe:

Before: prefill aborts with missing compact selected experts.
After: runs end to end with correct output for both the short-prompt decode
path and the long-prompt (>64 token) full-layer prefill path.
Popularity preload cuts decode cache misses ~50% on identical prompts
(e.g. 3432 → 1720 misses), measured via DS4_ROCM_STREAM_CACHE_STATS=1.
IQ2 streaming path re-verified (no regression); q4k-dot unit test passes.

Escape hatches

DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1 disables the Q4_K decode path.
Existing --ssd-streaming-cold / DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLIST
skip the preload.

🤖 Generated with Claude Code

The ROCm streaming MoE paths were gated to the IQ2_XXS/Q2_K expert quant pair, so Q4_K expert GGUFs failed prefill with "missing compact selected experts" and could not run under --ssd-streaming at all. Route Q4_K through the quant-agnostic machinery instead of the IQ2-only selected/split kernels: - Prefill: allow the full-layer streaming path for Q4_K. It stages a whole layer's expert table contiguously and runs the standard matmul, so use it for any multi-token prefill since Q4_K has no batched selected-gather kernel. - Decode: route Q4_K through the shared-overlap selected-load path and force the selected-expert loader to build a full contiguous compact buffer, since the split decode kernels only exist for the IQ2_XXS/Q2_K pair. Also speed up Q4_K streaming by warming the routed-expert cache from the popularity hotlist: - Implement the previously stubbed ROCm seed_experts() as a real bulk sequential preload into the resident cache, which is far cheaper than the scattered first-touch random reads it replaces. Read failures release the resident cache so partially-filled entries are never served as hits. - Allow the hotlist/prefill cache seed for Q4_K layers, and warm the cache at the start of decode-style prefill so short prompts benefit too. On an AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151) with the 153 GiB Q4_K DeepSeek-V4-Flash GGUF and 123 GiB RAM, this takes the model from failing to start to producing correct output, and the preload cuts decode cache misses roughly in half. Escape hatches: DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1 plus the existing --ssd-streaming-cold / DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLIST. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Enables SSD streaming execution for ROCm (Strix Halo) routed-expert GGUFs quantized as Q4_K by routing Q4_K through quant-agnostic streaming paths and adding an expert-cache warm-up pass to reduce first-touch decode misses.

Changes:

Adds a ROCm selected-expert loader mode that forces contiguous compact buffers (avoids IQ2-only split decode paths).
Extends ROCm streaming prefill/decode routing to support Q4_K, including full-layer prefill enablement for multi-token prompts.
Implements popularity-based expert cache seeding on ROCm and triggers warm-up for short decode-style prefill.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
rocm/ds4_rocm_runtime.cuh	Adds `g_stream_selected_force_contiguous` and gates the async split pending path for selected-expert loads.
rocm/ds4_rocm_current_api_compat.cuh	Exposes a setter for the contiguous mode and implements ROCm `seed_experts()` warm-up via bulk sequential reads.
ds4.c	Updates ROCm streaming routing for Q4_K (prefill + decode), enables cache warm-up at decode-style prefill start, and broadens seeding applicability.
ds4_gpu.h	Adds a public GPU API declaration for the new contiguous-mode toggle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 void ds4_gpu_set_quality(bool quality);
 void ds4_gpu_set_ssd_streaming(bool enabled);
 void ds4_gpu_set_streaming_expert_cache_budget(uint32_t experts);
 void ds4_gpu_set_streaming_expert_cache_expert_bytes(uint64_t bytes);
+void ds4_gpu_stream_set_selected_force_contiguous(int enabled);


+static bool metal_graph_use_rocm_q4_selected_shared_overlap(
+        const ds4_gpu_graph     *g,
+        const ds4_layer_weights *layer) {
+    return g &&
+           g->ssd_streaming &&
+           !g->quality &&
+           layer &&
+           layer->ffn_gate_exps &&
+           layer->ffn_up_exps &&
+           layer->ffn_down_exps &&
+           layer->ffn_gate_exps->type == DS4_TENSOR_Q4_K &&
+           layer->ffn_up_exps->type == DS4_TENSOR_Q4_K &&
+           layer->ffn_down_exps->type == DS4_TENSOR_Q4_K &&
+           DS4_N_EXPERT_USED == 6 &&
+           DS4_N_EXPERT >= 128 &&
+           getenv("DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP") == NULL;
+}


+        jobs[job_count++] = {entry.gate, gate_offset + gate_rel, gate_expert_bytes,
+                             NULL, NULL, 0, 0, 0};
+        jobs[job_count++] = {entry.up, up_offset + gate_rel, gate_expert_bytes,
+                             NULL, NULL, 0, 0, 0};
+        jobs[job_count++] = {entry.down, down_offset + down_rel, down_expert_bytes,
+                             NULL, NULL, 0, 0, 0};
+    }
+
+    if (job_count != 0) {
+        const int flushed =
+            cuda_stream_read_jobs_parallel(jobs, job_count) &&
+            cuda_stream_selected_upload_read_jobs(jobs, job_count);
+        cuda_stream_read_jobs_free(jobs, job_count);
+        if (!flushed) {
+            cuda_stream_resident_cache_release();
+            free(jobs);
+            return 1;
+        }
+    }
+    if (cuda_stream_cache_stats_on()) {
+        g_stream_cache_stats.seed_calls++;
+        g_stream_cache_stats.seed_unique += n_experts;
+    }


Copilot AI review requested due to automatic review settings June 24, 2026 14:34

Copilot started reviewing on behalf of kmc6042 June 24, 2026 14:34 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SSD streaming for Q4_K routed experts on ROCm#451

Support SSD streaming for Q4_K routed experts on ROCm#451
kmc6042 wants to merge 1 commit into
antirez:mainfrom
kmc6042:q4k-ssd-streaming-rocm

kmc6042 commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kmc6042 commented Jun 24, 2026

Summary

What changed

Testing

Escape hatches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants