Problem
Current admission control mechanisms primarily rely on queue length, request count, or static concurrency limits.
These approaches treat all requests equally despite significant differences in cache reuse opportunities.
As a result, cache-friendly requests may be delayed while cache-unfriendly requests consume GPU memory and increase latency.
Observation
Experimental results from the LLM Serving Cache prototype showed:
- Cold request latency: ~8.5s
- Warm request latency: ~5.5s
- Prefix-reuse latency: ~1.9s
These results suggest that cache state significantly influences request cost.
Research Question
Can admission decisions improve latency and throughput by incorporating cache-awareness?
Proposed Mechanism
Admission decisions consider:
- KV cache occupancy
- Prefix reuse probability
- Session affinity
- GPU memory pressure
- Queue depth
Each request receives an admission score.
Higher scores indicate higher expected cache efficiency.
Example
Request A
- Existing prefix cache
- High reuse probability
- Low incremental memory cost
Request B
- No cache reuse
- Large memory footprint
- New context allocation required
Cache-aware admission may prioritize Request A.
Evaluation
Baseline:
- FIFO admission
- Queue-length admission
Proposed:
Measure:
- p50 latency
- p95 latency
- Throughput
- Cache hit ratio
- GPU memory utilization
- Request rejection rate
Experimental Variables
Concurrency:
Cache pressure:
Reuse probability:
Success Criteria
Demonstrate lower latency and improved cache efficiency compared to queue-length-only admission control.
Future Extensions
- Multi-GPU cache placement
- Distributed KV cache sharing
- Prefix-aware routing
- Cache-aware scheduling
- Cost-aware inference routing
Problem
Current admission control mechanisms primarily rely on queue length, request count, or static concurrency limits.
These approaches treat all requests equally despite significant differences in cache reuse opportunities.
As a result, cache-friendly requests may be delayed while cache-unfriendly requests consume GPU memory and increase latency.
Observation
Experimental results from the LLM Serving Cache prototype showed:
These results suggest that cache state significantly influences request cost.
Research Question
Can admission decisions improve latency and throughput by incorporating cache-awareness?
Proposed Mechanism
Admission decisions consider:
Each request receives an admission score.
Higher scores indicate higher expected cache efficiency.
Example
Request A
Request B
Cache-aware admission may prioritize Request A.
Evaluation
Baseline:
Proposed:
Measure:
Experimental Variables
Concurrency:
Cache pressure:
Reuse probability:
Success Criteria
Demonstrate lower latency and improved cache efficiency compared to queue-length-only admission control.
Future Extensions