Skip to content

Commit 40e0639

Browse files
Interleave chunked prefills with single decoding steps (#558)
# Description * Deprioritize chunked prefill when previous step was already a prefill and requests are decoding * Evaluate blocks condition at the time of first chunked prefill * Evaluate remaining conditions at the time of last chunked prefill * Tests will be implemented other PRs: * interleaving logic correctness: e.g. that after step X which was a prefill, step X+1 is a decode * ✅ DONE constraint correctness: verify that we cannot schedule first chunked prefill when there isn't enough blocks --------- Signed-off-by: Sophie du Couédic <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Co-authored-by: Travis Johnson <[email protected]>
1 parent bf07727 commit 40e0639

File tree

3 files changed

+217
-77
lines changed

3 files changed

+217
-77
lines changed

vllm_spyre/envs.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
VLLM_SPYRE_PERF_METRIC_LOGGING_DIR: str = "/tmp"
1414
VLLM_SPYRE_OVERRIDE_SIGNALS_HANDLER: bool = False
1515
VLLM_SPYRE_USE_CHUNKED_PREFILL: bool = False
16+
VLLM_SPYRE_CP_INTERLEAVE_STEPS: bool = True
1617
# Prompt logprobs are behind a flag because they're only supported for
1718
# static batching and require passing back the hidden states for the full
1819
# prefill on every request. This could incur a heavy performance penalty in
@@ -172,6 +173,12 @@ def _backend_backwards_compat() -> str:
172173
# single prefill is used.
173174
"VLLM_SPYRE_USE_CHUNKED_PREFILL":
174175
lambda: bool(int(os.getenv("VLLM_SPYRE_USE_CHUNKED_PREFILL", "0"))),
176+
177+
# Feature Flag
178+
# Works only with chunked prefill enabled. If set, prefill steps are
179+
# interleaved with a decode step
180+
"VLLM_SPYRE_CP_INTERLEAVE_STEPS":
181+
lambda: bool(int(os.getenv("VLLM_SPYRE_CP_INTERLEAVE_STEPS", "1"))),
175182
}
176183
# --8<-- [end:env-vars-definition]
177184

0 commit comments

Comments
 (0)