feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments by naroam1 · Pull Request #2146 · vllm-project/aibrix

naroam1 · 2026-04-27T08:42:11Z

Summary

Reintroduces the work from #2119 (Dockerfile fix + scheduler-side prefix-cache tracker), but with the tracker auto-disabled when an L2 backend is configured. Matches the architectural constraint @DwyaneShi flagged in his revert rationale (#2119 (comment)) and preserves the upstream worker-reconciliation model bit-for-bit for L1+L2 deployments.

@DwyaneShi pre-approved this direction in #2119 (comment).

Motivation

@DwyaneShi's revert (#2132) correctly identified that a scheduler-side tracker assumes exclusive ownership of the external cache — valid for L1 local but not for L2 distributed (shared across engines). This follow-up makes the tracker opt-out-by-default for L2 users:

The scheduler auto-detects L2 presence via AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND and disables its tracker when L2 is configured, yielding the full upstream worker-reconciliation model unchanged.
L1-only deployments (common for single-engine setups) regain the vllm:external_prefix_cache_hits_total visibility and the warm-TTFT speedup we measured originally.

The Dockerfile fix from #2119 is included here per @DwyaneShi's note that "the Dockerfile improvement for handling missing patch files across newer vLLM versions is useful and can be kept separately in a follow-up PR" — bundling for one-shot review.

Change

`aibrix_offloading_connector_type1.py`

AIBrixOffloadingConnectorScheduler.__init__ computes self._tracker_enabled = not bool(aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND). When True (L1-only), the scheduler maintains _cached_block_hashes, _request_block_hashes, and _request_external_tokens. When False (L2-backed), all tracker entry points short-circuit to no-ops.
get_num_new_matched_tokens, update_state_after_alloc, and receive_connector_worker_meta all gate on _tracker_enabled. When disabled they return early — equivalent to the current upstream behavior introduced by the revert.
_cached_block_hashes is now a bounded OrderedDict (cap _MAX_CACHED_BLOCK_HASHES = 100_000), addressing the unbounded-growth concern @varungup90 raised in the original review.
AIBrixWorkerMeta carries saved_tokens and failed_load_tokens. aggregate() uses intersection + min() for saved_tokens and union + max() for failed_load_tokens for TP correctness.
Worker _newly_saved_tokens / _newly_failed_load_tokens are populated unconditionally in _send_kv_sync_impl / _recv_kv_sync_impl. The scheduler discards these reports when its tracker is disabled — safe to always populate.

`kv_connector/init.py`

Reuses the simplified monkey-patch from fix: Dockerfile for new vLLM versions + implement get_num_new_matched_tokens #2119 (start_load_kv_before_update + wait_for_save with metadata re-bind), which works in tandem with the scheduler-side tracker via vLLM's get_num_new_matched_tokens path.

`Dockerfile.vllm`

Conditional patch step: skip the source patch when no patch file exists for the current VLLM_VERSION (e.g., v0.19.0). Adds apt-get update before debug tool install. Defaults VLLM_VERSION=v0.19.0.

Tests

tests/test_connector_inheritance.py gains a fourth AST-based assertion verifying that Scheduler.__init__ assigns _tracker_enabled and that the gate is driven by AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND.

Untouched

BaseKVCacheManager public API: no changes.
L1Cache callback mechanism: no changes (the existing _l2_ingestion_callback slot is preserved).
Type2/Type1 inheritance scaffolding from refactor(kv_connector): Type2 Connector inherits from Type1 Connector #2125: already on main, untouched.

Testing

Smoke-tested on an L4 GPU with Qwen3-4B-AWQ in L1-only configuration (AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED=1, no L2 backend), 500-prompt flood pattern (same harness as #2119):

Metric	Value
`vllm:external_prefix_cache_hits_total`	1280
Warm TTFT avg (10 refs)	0.146 s
Cold TTFT avg (excl warmup)	0.213 s
Speedup	-31%

AST-based structural tests pass on Python 3.11 + 3.12, no vLLM / GPU dependency required.

L2-backed deployments path: not exercised in our cluster (no L2 backend available), but the gate's no-op behavior is mechanical — when _tracker_enabled = False, all entry points return early and the connector reduces back to the current upstream code path.

Closes the regression introduced by the revert of #2119, keeping @DwyaneShi's L2-safe semantics intact.

@DwyaneShi

…eployments Reintroduces the scheduler-side prefix-cache tracker removed in the revert of vllm-project#2119 (vllm-project#2132), but auto-disabled when an L2 backend is configured. Matches the architectural constraint @DwyaneShi flagged in his revert rationale and preserves the upstream worker-reconciliation model bit-for-bit for L1+L2 deployments. Design: - AIBrixOffloadingConnectorScheduler.__init__ computes self._tracker_enabled = not bool( aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND ). When True (L1-only), the scheduler maintains _cached_block_hashes (bounded LRU OrderedDict, capacity 100k entries), _request_block_hashes, and _request_external_tokens. When False (L2-backed), all tracker entry points short-circuit to no-ops. - get_num_new_matched_tokens, update_state_after_alloc, and receive_connector_worker_meta all gate on _tracker_enabled. With the gate disabled they return early — equivalent to the current upstream behavior introduced by the revert. - _cached_block_hashes is bounded via OrderedDict + a 100k LRU cap, addressing the unbounded-growth concern raised by @varungup90 in the original review. - AIBrixWorkerMeta carries saved_tokens and failed_load_tokens. aggregate() uses intersection + min() for saved_tokens and union + max() for failed_load_tokens for TP correctness. - Worker _newly_saved_tokens / _newly_failed_load_tokens are populated unconditionally in _send_kv_sync_impl / _recv_kv_sync_impl. The scheduler discards these reports when its tracker is disabled — safe to always populate. - The kv_connector/__init__.py monkey-patch reuses the simplified flow from vllm-project#2119 (start_load_kv_before_update + wait_for_save with metadata re-bind), which works in tandem with the scheduler-side tracker via vLLM's get_num_new_matched_tokens path. No changes to BaseKVCacheManager public API or the L1 callback mechanism. Type2/Type1 inheritance scaffolding from vllm-project#2125 is preserved (already on origin/main). AST-based structural tests in tests/test_connector_inheritance.py gain a fourth assertion verifying that Scheduler.__init__ assigns _tracker_enabled and that the gate is driven by AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>

The Dockerfile currently runs `patch -p1 -l -i <patch-file>` unconditionally, which fails for any VLLM_VERSION that doesn't have a matching patch file shipped in the repo (e.g., v0.19.0). Since PR vllm-project#2060 made per-version source patches unnecessary for newer releases (the monkey-patch replaces them), the Dockerfile should handle the missing-patch case gracefully so the same Dockerfile works for: - Older versions with a shipped patch file (applies it) - Newer versions without one (falls through to monkey-patch mode) Also: - Add `apt-get update` before `apt install` (missing update caused the debug tools install to fail on some base images) - Make the debug tools install best-effort (they're RDMA/NIXL debug utilities, not required for runtime) - Bump default VLLM_VERSION to v0.19.0 Signed-off-by: naroa-lurnova <naroa@lurnova.ai>

gemini-code-assist

Code Review

This pull request updates the vLLM version to v0.19.0 and refactors the KV cache connector integration. Key changes include a more robust monkey-patching mechanism for GPUModelRunner and the introduction of a scheduler-side cache tracker to optimize token matching. Review feedback identified several issues in the new tracker logic: incorrect block indexing when updating saved tokens, improper invalidation ranges for failed loads, and a bug in failure detection that fails to account for unaligned local contexts.

…review) Address 3 high-priority bugs Gemini flagged on PR vllm-project#2146 — all pre-existing from vllm-project#2119, all in the scheduler<->worker reporting path of the gated tracker: 1. saved_tokens offset: scheduler indexed block_hashes[0..N) ignoring the worker's actual save offset (aligned_context_len). Wrong as soon as context_len > 0. 2. failed_load tail-invalidation: scheduler dropped the last N entries of block_hashes regardless of where the failed range actually sat; the failure window is the promised load range, not the request tail. 3. shift_len double-count: worker's "seq_recv_len < aligned_query_len" check produced a false positive whenever the local-context tail wasn't block-aligned (seq_recv_len already excludes shift_len, so its max is aligned_query_len - shift_len). Fix by switching the worker->scheduler protocol to range-based reporting: - AIBrixWorkerMeta now carries `saved_blocks` and `failed_load_blocks` as `dict[req_id, (start_block_idx, num_blocks)]` in vLLM block units (engine_block_ntokens), so each entry indexes block_hashes directly. - Worker computes start_block from aligned_context_len // block_size for saves and from local_context_len + seq_recv_len for failures. - Worker captures initial_shift_len before the chunk loop resets shift_len to 0 and uses (aligned_query_len - initial_shift_len) as the partial-load threshold. - aggregate() updated for tuple semantics: intersection + min(count) for saved_blocks, union + max(count) for failed_load_blocks. start_block is deterministic across TP ranks (same scheduler input) so it's preserved on merge. - Scheduler receive_connector_worker_meta consumes the offset directly: marks block_hashes[start..start+count] as cached for saves; pops that exact slice for failures. Also adds an AST test in test_connector_inheritance.py asserting AIBrixWorkerMeta carries the new field names and not the old token- count names, so the offset bug can't regress silently. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>

…review) Address 3 high-priority bugs Gemini flagged in the gated tracker scheduler<->worker reporting path. All 3 are pre-existing from the original vllm-project#2119 implementation (carried over verbatim when reintroducing the tracker), and they didn't surface in the original smoke tests because: - Bug 1 was masked by flood-pattern requests with context_len=0, so the (always-zero) save offset happened to be correct. - Bug 2 only fires when the worker fails to load blocks, which is rare on L1-only. - Bug 3 only fires when local_context_len isn't engine-block- aligned (shift_len > 0), which most prompts in the harness avoided by accident. Switched the worker->scheduler protocol to range-based reporting, which fixes all three at once: - AIBrixWorkerMeta now carries `saved_blocks` and `failed_load_blocks` as `dict[req_id, (start_block_idx, num_blocks)]` in vLLM block units (engine_block_ntokens), so each entry indexes block_hashes directly. - Worker `_send_kv_sync_impl` reports the actual save range [aligned_context_len, aligned_context_len + total_sent), converted to vLLM blocks. Fixes the saved_tokens-offset bug (Gemini comment on line 848). - Worker `_recv_kv_sync_impl` captures `initial_shift_len` before the chunk loop resets `shift_len` to 0, then uses `seq_recv_len < (aligned_query_len - initial_shift_len)` as the partial-load threshold. The failed range starts at `local_context_len + seq_recv_len` (not aligned_context_len + seq_recv_len) so unaligned local context doesn't shift the range. Fixes the shift_len false-positive (Gemini comment on line 1287) and the failed-load tail-of-range invalidation (Gemini comment on line 859). - aggregate() updated for tuple semantics: intersection + min(num_blocks) for saved_blocks, union + max(num_blocks) for failed_load_blocks. start_block is deterministic across TP ranks (same scheduler input) so it's preserved on merge. - Scheduler `receive_connector_worker_meta` consumes the offset directly: marks `block_hashes[start..start+count]` as cached for saves; pops that exact slice for failures. The change subsumes Gemini's suggested local fix for Bug 3 (seq_recv_len < (aligned_query_len - initial_shift)): the math for the failure threshold and the failed-token count is the same; the report just carries the precise block range instead of a count, so the scheduler can invalidate the right slice (Bug 2) instead of falling back to `block_hashes[-failed_blocks:]`. Adds an AST regression test asserting AIBrixWorkerMeta carries the new `saved_blocks` / `failed_load_blocks` fields and not the old token-count names. Smoke-tested on L4 / Qwen3-4B-AWQ / L1-only — same flood harness as vllm-project#2119: external_prefix_cache_hits_total = 1280, warm TTFT avg ~0.154s, no regression. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>

naroam1 · 2026-04-27T09:47:14Z

Smoke-tested again on L4 / Qwen3-4B-AWQ / L1-only after fa4727a (same flood harness as #2119): vllm:external_prefix_cache_hits_total = 1280, warm TTFT avg ≈ 0.154 s. AST regression added at tests/test_connector_inheritance.py::test_aibrix_worker_meta_uses_block_offset_schema to lock the new schema in.

Jeffwan · 2026-04-28T11:23:53Z

/cc @DwyaneShi

DwyaneShi · 2026-05-07T23:18:54Z

+        # are already counted in context_len. We need to LOAD them from
+        # L1 into GPU KV buffer. Tokens to the LEFT of these are local
+        # (computed in a previous step).
+        load_len = seq_request_meta.load_len


Seems not working if L2 is used? load_len would be zero and therefore the loading will be skipped?

DwyaneShi · 2026-05-07T23:21:35Z


 def _apply_gpu_model_runner_patches(module):
-    """Apply patches to an already-imported gpu_model_runner module."""
+    """Apply AIBrix patches to vLLM's GPUModelRunner.


The modified monkey-patch does not support L2 cache anymore. It needs to be compatible with any setup combinations, such as L1 only, L2 only and L1 + L2.

naroam1 added 2 commits April 27, 2026 09:24

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Merge branch 'main' into feat/kv-connector-l2-safe-scheduler-tracker

e29fdb5

Jeffwan requested a review from DwyaneShi April 28, 2026 11:23

DwyaneShi reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146

feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146
naroam1 wants to merge 4 commits into
vllm-project:mainfrom
lurnova:feat/kv-connector-l2-safe-scheduler-tracker

naroam1 commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naroam1 commented Apr 27, 2026

Uh oh!

Jeffwan commented Apr 28, 2026

Uh oh!

DwyaneShi May 7, 2026

Uh oh!

DwyaneShi May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

naroam1 commented Apr 27, 2026

Summary

Motivation

Change

aibrix_offloading_connector_type1.py

kv_connector/__init__.py

Dockerfile.vllm

Tests

Untouched

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naroam1 commented Apr 27, 2026

Uh oh!

Jeffwan commented Apr 28, 2026

Uh oh!

DwyaneShi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

DwyaneShi May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`aibrix_offloading_connector_type1.py`

`kv_connector/init.py`

`Dockerfile.vllm`