Skip to content

[FEAT] RFC-0001: Cache-Aware Admission Control for LLM Serving #1

@NasitSony

Description

@NasitSony

Problem

Current admission control mechanisms primarily rely on queue length, request count, or static concurrency limits.

These approaches treat all requests equally despite significant differences in cache reuse opportunities.

As a result, cache-friendly requests may be delayed while cache-unfriendly requests consume GPU memory and increase latency.

Observation

Experimental results from the LLM Serving Cache prototype showed:

  • Cold request latency: ~8.5s
  • Warm request latency: ~5.5s
  • Prefix-reuse latency: ~1.9s

These results suggest that cache state significantly influences request cost.

Research Question

Can admission decisions improve latency and throughput by incorporating cache-awareness?

Proposed Mechanism

Admission decisions consider:

  • KV cache occupancy
  • Prefix reuse probability
  • Session affinity
  • GPU memory pressure
  • Queue depth

Each request receives an admission score.

Higher scores indicate higher expected cache efficiency.

Example

Request A

  • Existing prefix cache
  • High reuse probability
  • Low incremental memory cost

Request B

  • No cache reuse
  • Large memory footprint
  • New context allocation required

Cache-aware admission may prioritize Request A.

Evaluation

Baseline:

  • FIFO admission
  • Queue-length admission

Proposed:

  • Cache-aware admission

Measure:

  • p50 latency
  • p95 latency
  • Throughput
  • Cache hit ratio
  • GPU memory utilization
  • Request rejection rate

Experimental Variables

Concurrency:

  • 1
  • 5
  • 10
  • 20
  • 50

Cache pressure:

  • Low
  • Medium
  • High

Reuse probability:

  • 0%
  • 25%
  • 50%
  • 75%
  • 100%

Success Criteria

Demonstrate lower latency and improved cache efficiency compared to queue-length-only admission control.

Future Extensions

  • Multi-GPU cache placement
  • Distributed KV cache sharing
  • Prefix-aware routing
  • Cache-aware scheduling
  • Cost-aware inference routing

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions