Skip to content

feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146

Open
naroam1 wants to merge 4 commits into
vllm-project:mainfrom
lurnova:feat/kv-connector-l2-safe-scheduler-tracker
Open

feat(kv_connector): gated scheduler-side prefix tracker for L1-only deployments#2146
naroam1 wants to merge 4 commits into
vllm-project:mainfrom
lurnova:feat/kv-connector-l2-safe-scheduler-tracker

Conversation

@naroam1
Copy link
Copy Markdown
Contributor

@naroam1 naroam1 commented Apr 27, 2026

Summary

Reintroduces the work from #2119 (Dockerfile fix + scheduler-side prefix-cache tracker), but with the tracker auto-disabled when an L2 backend is configured. Matches the architectural constraint @DwyaneShi flagged in his revert rationale (#2119 (comment)) and preserves the upstream worker-reconciliation model bit-for-bit for L1+L2 deployments.

@DwyaneShi pre-approved this direction in #2119 (comment).

Motivation

@DwyaneShi's revert (#2132) correctly identified that a scheduler-side tracker assumes exclusive ownership of the external cache — valid for L1 local but not for L2 distributed (shared across engines). This follow-up makes the tracker opt-out-by-default for L2 users:

  • The scheduler auto-detects L2 presence via AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND and disables its tracker when L2 is configured, yielding the full upstream worker-reconciliation model unchanged.
  • L1-only deployments (common for single-engine setups) regain the vllm:external_prefix_cache_hits_total visibility and the warm-TTFT speedup we measured originally.

The Dockerfile fix from #2119 is included here per @DwyaneShi's note that "the Dockerfile improvement for handling missing patch files across newer vLLM versions is useful and can be kept separately in a follow-up PR" — bundling for one-shot review.

Change

aibrix_offloading_connector_type1.py

  • AIBrixOffloadingConnectorScheduler.__init__ computes self._tracker_enabled = not bool(aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND). When True (L1-only), the scheduler maintains _cached_block_hashes, _request_block_hashes, and _request_external_tokens. When False (L2-backed), all tracker entry points short-circuit to no-ops.
  • get_num_new_matched_tokens, update_state_after_alloc, and receive_connector_worker_meta all gate on _tracker_enabled. When disabled they return early — equivalent to the current upstream behavior introduced by the revert.
  • _cached_block_hashes is now a bounded OrderedDict (cap _MAX_CACHED_BLOCK_HASHES = 100_000), addressing the unbounded-growth concern @varungup90 raised in the original review.
  • AIBrixWorkerMeta carries saved_tokens and failed_load_tokens. aggregate() uses intersection + min() for saved_tokens and union + max() for failed_load_tokens for TP correctness.
  • Worker _newly_saved_tokens / _newly_failed_load_tokens are populated unconditionally in _send_kv_sync_impl / _recv_kv_sync_impl. The scheduler discards these reports when its tracker is disabled — safe to always populate.

kv_connector/__init__.py

Dockerfile.vllm

  • Conditional patch step: skip the source patch when no patch file exists for the current VLLM_VERSION (e.g., v0.19.0). Adds apt-get update before debug tool install. Defaults VLLM_VERSION=v0.19.0.

Tests

  • tests/test_connector_inheritance.py gains a fourth AST-based assertion verifying that Scheduler.__init__ assigns _tracker_enabled and that the gate is driven by AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND.

Untouched

Testing

Smoke-tested on an L4 GPU with Qwen3-4B-AWQ in L1-only configuration (AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED=1, no L2 backend), 500-prompt flood pattern (same harness as #2119):

Metric Value
vllm:external_prefix_cache_hits_total 1280
Warm TTFT avg (10 refs) 0.146 s
Cold TTFT avg (excl warmup) 0.213 s
Speedup -31%

AST-based structural tests pass on Python 3.11 + 3.12, no vLLM / GPU dependency required.

L2-backed deployments path: not exercised in our cluster (no L2 backend available), but the gate's no-op behavior is mechanical — when _tracker_enabled = False, all entry points return early and the connector reduces back to the current upstream code path.

Closes the regression introduced by the revert of #2119, keeping @DwyaneShi's L2-safe semantics intact.

naroam1 added 2 commits April 27, 2026 09:24
…eployments

Reintroduces the scheduler-side prefix-cache tracker removed in the
revert of vllm-project#2119 (vllm-project#2132), but auto-disabled when an L2 backend is
configured. Matches the architectural constraint @DwyaneShi flagged in
his revert rationale and preserves the upstream worker-reconciliation
model bit-for-bit for L1+L2 deployments.

Design:

- AIBrixOffloadingConnectorScheduler.__init__ computes
  self._tracker_enabled = not bool(
      aibrix_kvcache.envs.AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND
  ).
  When True (L1-only), the scheduler maintains _cached_block_hashes
  (bounded LRU OrderedDict, capacity 100k entries),
  _request_block_hashes, and _request_external_tokens. When False
  (L2-backed), all tracker entry points short-circuit to no-ops.
- get_num_new_matched_tokens, update_state_after_alloc, and
  receive_connector_worker_meta all gate on _tracker_enabled. With
  the gate disabled they return early — equivalent to the current
  upstream behavior introduced by the revert.
- _cached_block_hashes is bounded via OrderedDict + a 100k LRU cap,
  addressing the unbounded-growth concern raised by @varungup90 in
  the original review.
- AIBrixWorkerMeta carries saved_tokens and failed_load_tokens.
  aggregate() uses intersection + min() for saved_tokens and
  union + max() for failed_load_tokens for TP correctness.
- Worker _newly_saved_tokens / _newly_failed_load_tokens are
  populated unconditionally in _send_kv_sync_impl /
  _recv_kv_sync_impl. The scheduler discards these reports when its
  tracker is disabled — safe to always populate.
- The kv_connector/__init__.py monkey-patch reuses the simplified
  flow from vllm-project#2119 (start_load_kv_before_update + wait_for_save with
  metadata re-bind), which works in tandem with the scheduler-side
  tracker via vLLM's get_num_new_matched_tokens path.

No changes to BaseKVCacheManager public API or the L1 callback
mechanism. Type2/Type1 inheritance scaffolding from vllm-project#2125 is
preserved (already on origin/main).

AST-based structural tests in
tests/test_connector_inheritance.py gain a fourth assertion verifying
that Scheduler.__init__ assigns _tracker_enabled and that the gate is
driven by AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND.

Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
The Dockerfile currently runs `patch -p1 -l -i <patch-file>` unconditionally,
which fails for any VLLM_VERSION that doesn't have a matching patch file
shipped in the repo (e.g., v0.19.0).

Since PR vllm-project#2060 made per-version source patches unnecessary for newer
releases (the monkey-patch replaces them), the Dockerfile should handle
the missing-patch case gracefully so the same Dockerfile works for:
- Older versions with a shipped patch file (applies it)
- Newer versions without one (falls through to monkey-patch mode)

Also:
- Add `apt-get update` before `apt install` (missing update caused the
  debug tools install to fail on some base images)
- Make the debug tools install best-effort (they're RDMA/NIXL debug
  utilities, not required for runtime)
- Bump default VLLM_VERSION to v0.19.0

Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the vLLM version to v0.19.0 and refactors the KV cache connector integration. Key changes include a more robust monkey-patching mechanism for GPUModelRunner and the introduction of a scheduler-side cache tracker to optimize token matching. Review feedback identified several issues in the new tracker logic: incorrect block indexing when updating saved tokens, improper invalidation ranges for failed loads, and a bug in failure detection that fails to account for unaligned local contexts.

naroam1 added a commit to lurnova/aibrix that referenced this pull request Apr 27, 2026
…review)

Address 3 high-priority bugs Gemini flagged on PR vllm-project#2146 — all
pre-existing from vllm-project#2119, all in the scheduler<->worker reporting
path of the gated tracker:

1. saved_tokens offset: scheduler indexed block_hashes[0..N) ignoring
   the worker's actual save offset (aligned_context_len). Wrong as
   soon as context_len > 0.
2. failed_load tail-invalidation: scheduler dropped the last N entries
   of block_hashes regardless of where the failed range actually sat;
   the failure window is the promised load range, not the request tail.
3. shift_len double-count: worker's "seq_recv_len < aligned_query_len"
   check produced a false positive whenever the local-context tail
   wasn't block-aligned (seq_recv_len already excludes shift_len, so
   its max is aligned_query_len - shift_len).

Fix by switching the worker->scheduler protocol to range-based
reporting:

- AIBrixWorkerMeta now carries `saved_blocks` and `failed_load_blocks`
  as `dict[req_id, (start_block_idx, num_blocks)]` in vLLM block units
  (engine_block_ntokens), so each entry indexes block_hashes
  directly.
- Worker computes start_block from aligned_context_len // block_size
  for saves and from local_context_len + seq_recv_len for failures.
- Worker captures initial_shift_len before the chunk loop resets
  shift_len to 0 and uses (aligned_query_len - initial_shift_len) as
  the partial-load threshold.
- aggregate() updated for tuple semantics: intersection + min(count)
  for saved_blocks, union + max(count) for failed_load_blocks.
  start_block is deterministic across TP ranks (same scheduler input)
  so it's preserved on merge.
- Scheduler receive_connector_worker_meta consumes the offset
  directly: marks block_hashes[start..start+count] as cached for
  saves; pops that exact slice for failures.

Also adds an AST test in test_connector_inheritance.py asserting
AIBrixWorkerMeta carries the new field names and not the old token-
count names, so the offset bug can't regress silently.

Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
…review)

Address 3 high-priority bugs Gemini flagged in the gated tracker
scheduler<->worker reporting path. All 3 are pre-existing from the
original vllm-project#2119 implementation (carried over verbatim when
reintroducing the tracker), and they didn't surface in the original
smoke tests because:

  - Bug 1 was masked by flood-pattern requests with context_len=0,
    so the (always-zero) save offset happened to be correct.
  - Bug 2 only fires when the worker fails to load blocks, which is
    rare on L1-only.
  - Bug 3 only fires when local_context_len isn't engine-block-
    aligned (shift_len > 0), which most prompts in the harness
    avoided by accident.

Switched the worker->scheduler protocol to range-based reporting,
which fixes all three at once:

  - AIBrixWorkerMeta now carries `saved_blocks` and
    `failed_load_blocks` as `dict[req_id, (start_block_idx,
    num_blocks)]` in vLLM block units (engine_block_ntokens), so
    each entry indexes block_hashes directly.
  - Worker `_send_kv_sync_impl` reports the actual save range
    [aligned_context_len, aligned_context_len + total_sent),
    converted to vLLM blocks. Fixes the saved_tokens-offset bug
    (Gemini comment on line 848).
  - Worker `_recv_kv_sync_impl` captures `initial_shift_len` before
    the chunk loop resets `shift_len` to 0, then uses
    `seq_recv_len < (aligned_query_len - initial_shift_len)` as the
    partial-load threshold. The failed range starts at
    `local_context_len + seq_recv_len` (not aligned_context_len +
    seq_recv_len) so unaligned local context doesn't shift the
    range. Fixes the shift_len false-positive (Gemini comment on
    line 1287) and the failed-load tail-of-range invalidation
    (Gemini comment on line 859).
  - aggregate() updated for tuple semantics: intersection +
    min(num_blocks) for saved_blocks, union + max(num_blocks) for
    failed_load_blocks. start_block is deterministic across TP
    ranks (same scheduler input) so it's preserved on merge.
  - Scheduler `receive_connector_worker_meta` consumes the offset
    directly: marks `block_hashes[start..start+count]` as cached
    for saves; pops that exact slice for failures.

The change subsumes Gemini's suggested local fix for Bug 3
(seq_recv_len < (aligned_query_len - initial_shift)): the math for
the failure threshold and the failed-token count is the same; the
report just carries the precise block range instead of a count, so
the scheduler can invalidate the right slice (Bug 2) instead of
falling back to `block_hashes[-failed_blocks:]`.

Adds an AST regression test asserting AIBrixWorkerMeta carries the
new `saved_blocks` / `failed_load_blocks` fields and not the old
token-count names.

Smoke-tested on L4 / Qwen3-4B-AWQ / L1-only — same flood harness as
vllm-project#2119: external_prefix_cache_hits_total = 1280, warm TTFT avg
~0.154s, no regression.

Signed-off-by: naroa-lurnova <naroa@lurnova.ai>
@naroam1
Copy link
Copy Markdown
Contributor Author

naroam1 commented Apr 27, 2026

Smoke-tested again on L4 / Qwen3-4B-AWQ / L1-only after fa4727a (same flood harness as #2119): vllm:external_prefix_cache_hits_total = 1280, warm TTFT avg ≈ 0.154 s. AST regression added at tests/test_connector_inheritance.py::test_aibrix_worker_meta_uses_block_offset_schema to lock the new schema in.

@Jeffwan
Copy link
Copy Markdown
Collaborator

Jeffwan commented Apr 28, 2026

/cc @DwyaneShi

@Jeffwan Jeffwan requested a review from DwyaneShi April 28, 2026 11:23
# are already counted in context_len. We need to LOAD them from
# L1 into GPU KV buffer. Tokens to the LEFT of these are local
# (computed in a previous step).
load_len = seq_request_meta.load_len
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not working if L2 is used? load_len would be zero and therefore the loading will be skipped?


def _apply_gpu_model_runner_patches(module):
"""Apply patches to an already-imported gpu_model_runner module."""
"""Apply AIBrix patches to vLLM's GPUModelRunner.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modified monkey-patch does not support L2 cache anymore. It needs to be compatible with any setup combinations, such as L1 only, L2 only and L1 + L2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants