Skip to content

Accelerate module cache copying, support RoPE encoding and Qwen3 LLM#21

Open
Panxy777 wants to merge 1 commit into
yale-sys:mainfrom
Panxy777:pr
Open

Accelerate module cache copying, support RoPE encoding and Qwen3 LLM#21
Panxy777 wants to merge 1 commit into
yale-sys:mainfrom
Panxy777:pr

Conversation

@Panxy777
Copy link
Copy Markdown

@Panxy777 Panxy777 commented Apr 6, 2026

This PR improves the efficiency and compatibility of PromptCache.update by reducing redundant KV copy operations, minimizing CPU–GPU transfer overhead, and adding support for Qwen3 series models and RoPE encoding.

Key changes:

  • Retain only the common prefix with the previously staged cache; re-copy KV for divergent suffix segments.
  • Prefetch module caches to the target GPU before reuse, avoiding repeated CPU-to-GPU transfers.
  • Replace per-module copy with per-layer batched copy:
    • Upload all updated modules to the target device in advance.
    • Concatenate KV tensors across modules along the sequence dimension per layer.
    • Perform a single write per layer into the destination cache slice.
  • Add a fast path for single-module updates to avoid unnecessary concatenation.
  • Extend support to Qwen3 models and RoPE encoding, ensuring compatibility with their KV structures.

Benefits:

  • Reduce copy operations from O(#modules × #layers) to ~O(#layers).
  • Lower kernel launch overhead and Python loop cost.
  • Significantly improve performance in multi-segment suffix updates (e.g., tag switching scenarios).
  • Broader model support including Qwen3 and RoPE-based architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant