feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146
feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146naroam1 wants to merge 4 commits into
Conversation
…eployments Reintroduces the scheduler-side prefix-cache tracker removed in the revert of vllm-project#2119 (vllm-project#2132), but auto-disabled when an L2 backend is configured. Matches the architectural constraint @DwyaneShi flagged in his revert rationale and preserves the upstream worker-reconciliation model bit-for-bit for L1+L2 deployments. Design: - AIBrixOffloadingConnectorScheduler.__init__ computes self._tracker_enabled = not bool( aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND ). When True (L1-only), the scheduler maintains _cached_block_hashes (bounded LRU OrderedDict, capacity 100k entries), _request_block_hashes, and _request_external_tokens. When False (L2-backed), all tracker entry points short-circuit to no-ops. - get_num_new_matched_tokens, update_state_after_alloc, and receive_connector_worker_meta all gate on _tracker_enabled. With the gate disabled they return early — equivalent to the current upstream behavior introduced by the revert. - _cached_block_hashes is bounded via OrderedDict + a 100k LRU cap, addressing the unbounded-growth concern raised by @varungup90 in the original review. - AIBrixWorkerMeta carries saved_tokens and failed_load_tokens. aggregate() uses intersection + min() for saved_tokens and union + max() for failed_load_tokens for TP correctness. - Worker _newly_saved_tokens / _newly_failed_load_tokens are populated unconditionally in _send_kv_sync_impl / _recv_kv_sync_impl. The scheduler discards these reports when its tracker is disabled — safe to always populate. - The kv_connector/__init__.py monkey-patch reuses the simplified flow from vllm-project#2119 (start_load_kv_before_update + wait_for_save with metadata re-bind), which works in tandem with the scheduler-side tracker via vLLM's get_num_new_matched_tokens path. No changes to BaseKVCacheManager public API or the L1 callback mechanism. Type2/Type1 inheritance scaffolding from vllm-project#2125 is preserved (already on origin/main). AST-based structural tests in tests/test_connector_inheritance.py gain a fourth assertion verifying that Scheduler.__init__ assigns _tracker_enabled and that the gate is driven by AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
The Dockerfile currently runs `patch -p1 -l -i <patch-file>` unconditionally, which fails for any VLLM_VERSION that doesn't have a matching patch file shipped in the repo (e.g., v0.19.0). Since PR vllm-project#2060 made per-version source patches unnecessary for newer releases (the monkey-patch replaces them), the Dockerfile should handle the missing-patch case gracefully so the same Dockerfile works for: - Older versions with a shipped patch file (applies it) - Newer versions without one (falls through to monkey-patch mode) Also: - Add `apt-get update` before `apt install` (missing update caused the debug tools install to fail on some base images) - Make the debug tools install best-effort (they're RDMA/NIXL debug utilities, not required for runtime) - Bump default VLLM_VERSION to v0.19.0 Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
There was a problem hiding this comment.
Code Review
This pull request updates the vLLM version to v0.19.0 and refactors the KV cache connector integration. Key changes include a more robust monkey-patching mechanism for GPUModelRunner and the introduction of a scheduler-side cache tracker to optimize token matching. Review feedback identified several issues in the new tracker logic: incorrect block indexing when updating saved tokens, improper invalidation ranges for failed loads, and a bug in failure detection that fails to account for unaligned local contexts.
…review) Address 3 high-priority bugs Gemini flagged on PR vllm-project#2146 — all pre-existing from vllm-project#2119, all in the scheduler<->worker reporting path of the gated tracker: 1. saved_tokens offset: scheduler indexed block_hashes[0..N) ignoring the worker's actual save offset (aligned_context_len). Wrong as soon as context_len > 0. 2. failed_load tail-invalidation: scheduler dropped the last N entries of block_hashes regardless of where the failed range actually sat; the failure window is the promised load range, not the request tail. 3. shift_len double-count: worker's "seq_recv_len < aligned_query_len" check produced a false positive whenever the local-context tail wasn't block-aligned (seq_recv_len already excludes shift_len, so its max is aligned_query_len - shift_len). Fix by switching the worker->scheduler protocol to range-based reporting: - AIBrixWorkerMeta now carries `saved_blocks` and `failed_load_blocks` as `dict[req_id, (start_block_idx, num_blocks)]` in vLLM block units (engine_block_ntokens), so each entry indexes block_hashes directly. - Worker computes start_block from aligned_context_len // block_size for saves and from local_context_len + seq_recv_len for failures. - Worker captures initial_shift_len before the chunk loop resets shift_len to 0 and uses (aligned_query_len - initial_shift_len) as the partial-load threshold. - aggregate() updated for tuple semantics: intersection + min(count) for saved_blocks, union + max(count) for failed_load_blocks. start_block is deterministic across TP ranks (same scheduler input) so it's preserved on merge. - Scheduler receive_connector_worker_meta consumes the offset directly: marks block_hashes[start..start+count] as cached for saves; pops that exact slice for failures. Also adds an AST test in test_connector_inheritance.py asserting AIBrixWorkerMeta carries the new field names and not the old token- count names, so the offset bug can't regress silently. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
…review) Address 3 high-priority bugs Gemini flagged in the gated tracker scheduler<->worker reporting path. All 3 are pre-existing from the original vllm-project#2119 implementation (carried over verbatim when reintroducing the tracker), and they didn't surface in the original smoke tests because: - Bug 1 was masked by flood-pattern requests with context_len=0, so the (always-zero) save offset happened to be correct. - Bug 2 only fires when the worker fails to load blocks, which is rare on L1-only. - Bug 3 only fires when local_context_len isn't engine-block- aligned (shift_len > 0), which most prompts in the harness avoided by accident. Switched the worker->scheduler protocol to range-based reporting, which fixes all three at once: - AIBrixWorkerMeta now carries `saved_blocks` and `failed_load_blocks` as `dict[req_id, (start_block_idx, num_blocks)]` in vLLM block units (engine_block_ntokens), so each entry indexes block_hashes directly. - Worker `_send_kv_sync_impl` reports the actual save range [aligned_context_len, aligned_context_len + total_sent), converted to vLLM blocks. Fixes the saved_tokens-offset bug (Gemini comment on line 848). - Worker `_recv_kv_sync_impl` captures `initial_shift_len` before the chunk loop resets `shift_len` to 0, then uses `seq_recv_len < (aligned_query_len - initial_shift_len)` as the partial-load threshold. The failed range starts at `local_context_len + seq_recv_len` (not aligned_context_len + seq_recv_len) so unaligned local context doesn't shift the range. Fixes the shift_len false-positive (Gemini comment on line 1287) and the failed-load tail-of-range invalidation (Gemini comment on line 859). - aggregate() updated for tuple semantics: intersection + min(num_blocks) for saved_blocks, union + max(num_blocks) for failed_load_blocks. start_block is deterministic across TP ranks (same scheduler input) so it's preserved on merge. - Scheduler `receive_connector_worker_meta` consumes the offset directly: marks `block_hashes[start..start+count]` as cached for saves; pops that exact slice for failures. The change subsumes Gemini's suggested local fix for Bug 3 (seq_recv_len < (aligned_query_len - initial_shift)): the math for the failure threshold and the failed-token count is the same; the report just carries the precise block range instead of a count, so the scheduler can invalidate the right slice (Bug 2) instead of falling back to `block_hashes[-failed_blocks:]`. Adds an AST regression test asserting AIBrixWorkerMeta carries the new `saved_blocks` / `failed_load_blocks` fields and not the old token-count names. Smoke-tested on L4 / Qwen3-4B-AWQ / L1-only — same flood harness as vllm-project#2119: external_prefix_cache_hits_total = 1280, warm TTFT avg ~0.154s, no regression. Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
|
Smoke-tested again on L4 / Qwen3-4B-AWQ / L1-only after fa4727a (same flood harness as #2119): |
|
/cc @DwyaneShi |
| # are already counted in context_len. We need to LOAD them from | ||
| # L1 into GPU KV buffer. Tokens to the LEFT of these are local | ||
| # (computed in a previous step). | ||
| load_len = seq_request_meta.load_len |
There was a problem hiding this comment.
Seems not working if L2 is used? load_len would be zero and therefore the loading will be skipped?
|
|
||
| def _apply_gpu_model_runner_patches(module): | ||
| """Apply patches to an already-imported gpu_model_runner module.""" | ||
| """Apply AIBrix patches to vLLM's GPUModelRunner. |
There was a problem hiding this comment.
The modified monkey-patch does not support L2 cache anymore. It needs to be compatible with any setup combinations, such as L1 only, L2 only and L1 + L2.
Summary
Reintroduces the work from #2119 (Dockerfile fix + scheduler-side prefix-cache tracker), but with the tracker auto-disabled when an L2 backend is configured. Matches the architectural constraint @DwyaneShi flagged in his revert rationale (#2119 (comment)) and preserves the upstream worker-reconciliation model bit-for-bit for L1+L2 deployments.
@DwyaneShi pre-approved this direction in #2119 (comment).
Motivation
@DwyaneShi's revert (#2132) correctly identified that a scheduler-side tracker assumes exclusive ownership of the external cache — valid for L1 local but not for L2 distributed (shared across engines). This follow-up makes the tracker opt-out-by-default for L2 users:
AIBRIX_KV_CACHE_OL_L2_CACHE_BACKENDand disables its tracker when L2 is configured, yielding the full upstream worker-reconciliation model unchanged.vllm:external_prefix_cache_hits_totalvisibility and the warm-TTFT speedup we measured originally.The Dockerfile fix from #2119 is included here per @DwyaneShi's note that "the Dockerfile improvement for handling missing patch files across newer vLLM versions is useful and can be kept separately in a follow-up PR" — bundling for one-shot review.
Change
aibrix_offloading_connector_type1.pyAIBrixOffloadingConnectorScheduler.__init__computesself._tracker_enabled = not bool(aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND). When True (L1-only), the scheduler maintains_cached_block_hashes,_request_block_hashes, and_request_external_tokens. When False (L2-backed), all tracker entry points short-circuit to no-ops.get_num_new_matched_tokens,update_state_after_alloc, andreceive_connector_worker_metaall gate on_tracker_enabled. When disabled they return early — equivalent to the current upstream behavior introduced by the revert._cached_block_hashesis now a boundedOrderedDict(cap_MAX_CACHED_BLOCK_HASHES = 100_000), addressing the unbounded-growth concern @varungup90 raised in the original review.AIBrixWorkerMetacarriessaved_tokensandfailed_load_tokens.aggregate()uses intersection + min() forsaved_tokensand union + max() forfailed_load_tokensfor TP correctness._newly_saved_tokens/_newly_failed_load_tokensare populated unconditionally in_send_kv_sync_impl/_recv_kv_sync_impl. The scheduler discards these reports when its tracker is disabled — safe to always populate.kv_connector/__init__.pystart_load_kv_before_update+wait_for_savewith metadata re-bind), which works in tandem with the scheduler-side tracker via vLLM'sget_num_new_matched_tokenspath.Dockerfile.vllmVLLM_VERSION(e.g.,v0.19.0). Addsapt-get updatebefore debug tool install. DefaultsVLLM_VERSION=v0.19.0.Tests
tests/test_connector_inheritance.pygains a fourth AST-based assertion verifying thatScheduler.__init__assigns_tracker_enabledand that the gate is driven byAIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND.Untouched
BaseKVCacheManagerpublic API: no changes._l2_ingestion_callbackslot is preserved).main, untouched.Testing
Smoke-tested on an L4 GPU with Qwen3-4B-AWQ in L1-only configuration (
AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED=1, no L2 backend), 500-prompt flood pattern (same harness as #2119):vllm:external_prefix_cache_hits_totalAST-based structural tests pass on Python 3.11 + 3.12, no vLLM / GPU dependency required.
L2-backed deployments path: not exercised in our cluster (no L2 backend available), but the gate's no-op behavior is mechanical — when
_tracker_enabled = False, all entry points return early and the connector reduces back to the current upstream code path.Closes the regression introduced by the revert of #2119, keeping @DwyaneShi's L2-safe semantics intact.