Skip to content

Support SSD streaming for Q4_K routed experts on ROCm#451

Open
kmc6042 wants to merge 1 commit into
antirez:mainfrom
kmc6042:q4k-ssd-streaming-rocm
Open

Support SSD streaming for Q4_K routed experts on ROCm#451
kmc6042 wants to merge 1 commit into
antirez:mainfrom
kmc6042:q4k-ssd-streaming-rocm

Conversation

@kmc6042

@kmc6042 kmc6042 commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Enables running Q4_K routed-expert GGUFs under --ssd-streaming on the ROCm
(Strix Halo) backend, and warms the routed-expert cache so streaming decode is
meaningfully faster.

Before this change the ROCm streaming MoE paths were gated to the
IQ2_XXS/Q2_K expert quant pair, so a Q4_K expert GGUF failed prefill with
missing compact selected experts ... full expert table is not mapped and could
not run under --ssd-streaming at all.

What changed

Route Q4_K through the quant-agnostic streaming machinery instead of the
IQ2-only selected/split kernels:

  • Prefill: allow the full-layer streaming path for Q4_K. It stages a whole
    layer's expert table into a contiguous buffer and runs the standard matmul, so
    it is used for any multi-token prefill (Q4_K has no batched selected-gather
    kernel).
  • Decode: route Q4_K through the shared-overlap selected-load path and force
    the selected-expert loader to build a full contiguous compact buffer, since the
    split decode kernels exist only for the IQ2_XXS/Q2_K pair.

Cache warm-up to speed up streaming:

  • Implement the previously stubbed ROCm seed_experts() as a real bulk
    sequential preload of the popularity hotlist into the resident cache — far
    cheaper than the scattered first-touch random reads it replaces. Read failures
    release the resident cache so partially-filled entries are never served as hits.
  • Allow the hotlist/prefill cache seed for Q4_K layers, and warm the cache at the
    start of decode-style prefill so short prompts benefit too.

Testing

On an AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151), 123 GiB RAM, with the
153 GiB DeepSeek-V4-Flash-Q4KExperts-... GGUF on NVMe:

  • Before: prefill aborts with missing compact selected experts.
  • After: runs end to end with correct output for both the short-prompt decode
    path and the long-prompt (>64 token) full-layer prefill path.
  • Popularity preload cuts decode cache misses ~50% on identical prompts
    (e.g. 3432 → 1720 misses), measured via DS4_ROCM_STREAM_CACHE_STATS=1.
  • IQ2 streaming path re-verified (no regression); q4k-dot unit test passes.

Escape hatches

  • DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1 disables the Q4_K decode path.
  • Existing --ssd-streaming-cold / DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLIST
    skip the preload.

🤖 Generated with Claude Code

The ROCm streaming MoE paths were gated to the IQ2_XXS/Q2_K expert quant
pair, so Q4_K expert GGUFs failed prefill with "missing compact selected
experts" and could not run under --ssd-streaming at all.

Route Q4_K through the quant-agnostic machinery instead of the IQ2-only
selected/split kernels:

- Prefill: allow the full-layer streaming path for Q4_K. It stages a whole
  layer's expert table contiguously and runs the standard matmul, so use it
  for any multi-token prefill since Q4_K has no batched selected-gather kernel.
- Decode: route Q4_K through the shared-overlap selected-load path and force
  the selected-expert loader to build a full contiguous compact buffer, since
  the split decode kernels only exist for the IQ2_XXS/Q2_K pair.

Also speed up Q4_K streaming by warming the routed-expert cache from the
popularity hotlist:

- Implement the previously stubbed ROCm seed_experts() as a real bulk
  sequential preload into the resident cache, which is far cheaper than the
  scattered first-touch random reads it replaces. Read failures release the
  resident cache so partially-filled entries are never served as hits.
- Allow the hotlist/prefill cache seed for Q4_K layers, and warm the cache at
  the start of decode-style prefill so short prompts benefit too.

On an AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151) with the 153 GiB Q4_K
DeepSeek-V4-Flash GGUF and 123 GiB RAM, this takes the model from failing to
start to producing correct output, and the preload cuts decode cache misses
roughly in half.

Escape hatches: DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1 plus the
existing --ssd-streaming-cold / DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLIST.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 24, 2026 14:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables SSD streaming execution for ROCm (Strix Halo) routed-expert GGUFs quantized as Q4_K by routing Q4_K through quant-agnostic streaming paths and adding an expert-cache warm-up pass to reduce first-touch decode misses.

Changes:

  • Adds a ROCm selected-expert loader mode that forces contiguous compact buffers (avoids IQ2-only split decode paths).
  • Extends ROCm streaming prefill/decode routing to support Q4_K, including full-layer prefill enablement for multi-token prompts.
  • Implements popularity-based expert cache seeding on ROCm and triggers warm-up for short decode-style prefill.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
rocm/ds4_rocm_runtime.cuh Adds g_stream_selected_force_contiguous and gates the async split pending path for selected-expert loads.
rocm/ds4_rocm_current_api_compat.cuh Exposes a setter for the contiguous mode and implements ROCm seed_experts() warm-up via bulk sequential reads.
ds4.c Updates ROCm streaming routing for Q4_K (prefill + decode), enables cache warm-up at decode-style prefill start, and broadens seeding applicability.
ds4_gpu.h Adds a public GPU API declaration for the new contiguous-mode toggle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ds4_gpu.h
Comment on lines 73 to +77
void ds4_gpu_set_quality(bool quality);
void ds4_gpu_set_ssd_streaming(bool enabled);
void ds4_gpu_set_streaming_expert_cache_budget(uint32_t experts);
void ds4_gpu_set_streaming_expert_cache_expert_bytes(uint64_t bytes);
void ds4_gpu_stream_set_selected_force_contiguous(int enabled);
Comment thread ds4.c
Comment on lines +13644 to +13660
static bool metal_graph_use_rocm_q4_selected_shared_overlap(
const ds4_gpu_graph *g,
const ds4_layer_weights *layer) {
return g &&
g->ssd_streaming &&
!g->quality &&
layer &&
layer->ffn_gate_exps &&
layer->ffn_up_exps &&
layer->ffn_down_exps &&
layer->ffn_gate_exps->type == DS4_TENSOR_Q4_K &&
layer->ffn_up_exps->type == DS4_TENSOR_Q4_K &&
layer->ffn_down_exps->type == DS4_TENSOR_Q4_K &&
DS4_N_EXPERT_USED == 6 &&
DS4_N_EXPERT >= 128 &&
getenv("DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP") == NULL;
}
Comment on lines +420 to +442
jobs[job_count++] = {entry.gate, gate_offset + gate_rel, gate_expert_bytes,
NULL, NULL, 0, 0, 0};
jobs[job_count++] = {entry.up, up_offset + gate_rel, gate_expert_bytes,
NULL, NULL, 0, 0, 0};
jobs[job_count++] = {entry.down, down_offset + down_rel, down_expert_bytes,
NULL, NULL, 0, 0, 0};
}

if (job_count != 0) {
const int flushed =
cuda_stream_read_jobs_parallel(jobs, job_count) &&
cuda_stream_selected_upload_read_jobs(jobs, job_count);
cuda_stream_read_jobs_free(jobs, job_count);
if (!flushed) {
cuda_stream_resident_cache_release();
free(jobs);
return 1;
}
}
if (cuda_stream_cache_stats_on()) {
g_stream_cache_stats.seed_calls++;
g_stream_cache_stats.seed_unique += n_experts;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants