Accelerate module cache copying, support RoPE encoding and Qwen3 LLM by Panxy777 · Pull Request #21 · yale-sys/prompt-cache

Panxy777 · 2026-04-06T09:39:38Z

This PR improves the efficiency and compatibility of PromptCache.update by reducing redundant KV copy operations, minimizing CPU–GPU transfer overhead, and adding support for Qwen3 series models and RoPE encoding.

Key changes:

Retain only the common prefix with the previously staged cache; re-copy KV for divergent suffix segments.
Prefetch module caches to the target GPU before reuse, avoiding repeated CPU-to-GPU transfers.
Replace per-module copy with per-layer batched copy:
- Upload all updated modules to the target device in advance.
- Concatenate KV tensors across modules along the sequence dimension per layer.
- Perform a single write per layer into the destination cache slice.
Add a fast path for single-module updates to avoid unnecessary concatenation.
Extend support to Qwen3 models and RoPE encoding, ensuring compatibility with their KV structures.

Benefits:

Reduce copy operations from O(#modules × #layers) to ~O(#layers).
Lower kernel launch overhead and Python loop cost.
Significantly improve performance in multi-segment suffix updates (e.g., tag switching scenarios).
Broader model support including Qwen3 and RoPE-based architectures.

Accelerate module cache copying and support RoPE encoding and Qwen3

0683e7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate module cache copying, support RoPE encoding and Qwen3 LLM#21

Accelerate module cache copying, support RoPE encoding and Qwen3 LLM#21
Panxy777 wants to merge 1 commit into
yale-sys:mainfrom
Panxy777:pr

Panxy777 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Panxy777 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant