Adapt with #20522 #26877 in mamba by IzacharyI · Pull Request #288 · zejunchen-zejun/sglang

IzacharyI · 2026-06-13T10:15:28Z

Motivation

Fix Qwen3.5 GDN linear-attention issues when prefix/radix cache and HIP/flydsl decode are enabled.
There are two problems addressed here:

Accuracy corruption: GDN SSM state can be stored in KV layout for prefill/extend and VK layout for HIP/flydsl decode. Prefix-cache reuse copies mamba state rows between pool slots, but the per-slot layout bitmap was not moved with those rows. A stale layout bit can make the next KV↔VK transpose run against the wrong baseline, corrupting the SSM state and producing non-recoverable garbled output.
Prefill host bubble: mamba tracking repeatedly checked mamba_track_mask.any() / nonzero() inside GDN layer forwards, causing D2H synchronization overhead. This was visible as a prefix-cache performance gap.

Modifications

Move GDN KV/VK slot layout ownership to MambaPool:
- Add MambaPool.state_layout (0=KV, 1=VK).
- Reset allocated/recycled mamba slots to KV.
- Copy state_layout together with mamba state rows in MambaPool.copy_from() / fork_from().
Make GDNAttnBackend share mamba_pool.state_layout instead of allocating a private bitmap.
Keep layout metadata synchronized for speculative verify scatter paths by marking scattered verify states as KV.
Integrate the mamba tracking optimization from upstream:
- Add has_mamba_track_mask, mamba_track_mask_indices, and conv_states_mask_indices to ForwardMetadata.
- Compute mamba tracking gates/indices once during metadata initialization.
- Use the precomputed metadata in GDN forward_extend and _track_mamba_state_extend, avoiding repeated per-layer D2H checks.
- Propagate has_mamba_track_mask through Mamba2Metadata.prepare_decode() and prepare_mixed().

Accuracy Tests

Model	Path	Without fix	With fix
Qwen3.5-27B-FP8, tp=4	flydsl VK decode	garbled around ~iter 50, then persistent	250/250 clean
Qwen3.5-397B-A17B-PTPC, tp=8	HIP inline-asm VK decode	garbled at iter 61, same token-noise pattern as customer log	200/200 clean

Benchmarking and Profiling

Qwen3.5-27B-FP8, tp=4

Version	Input throughput	Mean TTFT
baseline	5320 tok/s	1430 ms
patched	7289 tok/s	1014 ms

Qwen3.5-397B-A17B-PTPC, tp=8

Version	Input throughput	Mean TTFT
baseline	14079 tok/s	544 ms
patched	19386 tok/s	385 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Adapt with sgl-project#20522 sgl-project#26877 in mamba

327d7f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt with #20522 #26877 in mamba#288

Adapt with #20522 #26877 in mamba#288
IzacharyI wants to merge 1 commit into
zejunchen-zejun:Qwen3.5_v0.5.9from
IzacharyI:merged_main_mamba_layout_data_change

IzacharyI commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

IzacharyI commented Jun 13, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant