Skip to content

Adapt with #20522 #26877 in mamba#288

Open
IzacharyI wants to merge 1 commit into
zejunchen-zejun:Qwen3.5_v0.5.9from
IzacharyI:merged_main_mamba_layout_data_change
Open

Adapt with #20522 #26877 in mamba#288
IzacharyI wants to merge 1 commit into
zejunchen-zejun:Qwen3.5_v0.5.9from
IzacharyI:merged_main_mamba_layout_data_change

Conversation

@IzacharyI

Copy link
Copy Markdown

Motivation

Fix Qwen3.5 GDN linear-attention issues when prefix/radix cache and HIP/flydsl decode are enabled.
There are two problems addressed here:

  1. Accuracy corruption: GDN SSM state can be stored in KV layout for prefill/extend and VK layout for HIP/flydsl decode. Prefix-cache reuse copies mamba state rows between pool slots, but the per-slot layout bitmap was not moved with those rows. A stale layout bit can make the next KV↔VK transpose run against the wrong baseline, corrupting the SSM state and producing non-recoverable garbled output.
  2. Prefill host bubble: mamba tracking repeatedly checked mamba_track_mask.any() / nonzero() inside GDN layer forwards, causing D2H synchronization overhead. This was visible as a prefix-cache performance gap.

Modifications

  • Move GDN KV/VK slot layout ownership to MambaPool:
    • Add MambaPool.state_layout (0=KV, 1=VK).
    • Reset allocated/recycled mamba slots to KV.
    • Copy state_layout together with mamba state rows in MambaPool.copy_from() / fork_from().
  • Make GDNAttnBackend share mamba_pool.state_layout instead of allocating a private bitmap.
  • Keep layout metadata synchronized for speculative verify scatter paths by marking scattered verify states as KV.
  • Integrate the mamba tracking optimization from upstream:
    • Add has_mamba_track_mask, mamba_track_mask_indices, and conv_states_mask_indices to ForwardMetadata.
    • Compute mamba tracking gates/indices once during metadata initialization.
    • Use the precomputed metadata in GDN forward_extend and _track_mamba_state_extend, avoiding repeated per-layer D2H checks.
    • Propagate has_mamba_track_mask through Mamba2Metadata.prepare_decode() and prepare_mixed().

Accuracy Tests

Model Path Without fix With fix
Qwen3.5-27B-FP8, tp=4 flydsl VK decode garbled around ~iter 50, then persistent 250/250 clean
Qwen3.5-397B-A17B-PTPC, tp=8 HIP inline-asm VK decode garbled at iter 61, same token-noise pattern as customer log 200/200 clean

Benchmarking and Profiling

Qwen3.5-27B-FP8, tp=4

Version Input throughput Mean TTFT
baseline 5320 tok/s 1430 ms
patched 7289 tok/s 1014 ms

Qwen3.5-397B-A17B-PTPC, tp=8

Version Input throughput Mean TTFT
baseline 14079 tok/s 544 ms
patched 19386 tok/s 385 ms

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
      - /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant