Enable chunked prefill on aice 1.22 #2070

YuJiankang · 2025-10-23T05:55:35Z

This PR port chunked prefill related patch from deepseek_r1 to aice 1.22,
co-work with HabanaAI/vllm-hpu-extension#381

tvoas · 2025-11-03T03:23:12Z

vllm/worker/hpu_model_runner.py

        if any(context_lens):
-            assert not self.scheduler_config.chunked_prefill_enabled
+            # assert not self.scheduler_config.chunked_prefill_enabled
            # prefix caching


Comment is out of data. Remove deprecated (commented out) assert and replace comment with something like:

# prefix caching or chunked prefill

tvoas · 2025-11-03T08:50:55Z

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

ikurtchen · 2025-11-04T02:46:54Z

Recommend adjustments be made to start_gaudi_vllm_server.sh" to expose chunk prefill controls. Right now, chunk size will be set to max_model_len even if -e "--enable-chunked-prefill"`` is present in the command line. No way to specify chunk size.

Yes, the start_gaudi_vllm_server.sh sets max_num_batched_tokens to max_model_len by default. And from scheduler code, max_num_batched_tokens is one of the config which controls chunk size. And looks like the chunk size can be changed at runtime, for example: def _chunk_new_tokens_to_schedule():

        # Get the number of tokens to allocate to this prefill slot
        prefill_slot_budget = (
            remaining_token_budget if partial_prefill_metadata is None else 
            partial_prefill_budget_lookup_list[
                partial_prefill_metadata.schedulable_prefills])

        ...

        num_new_tokens = min(num_new_tokens, remaining_token_budget,
                             prefill_slot_budget)

The value in the partial_prefill_budget_lookup_list is controled by max_num_partial_prefills, and when there're more than one prefills (if we set max_num_partial_prefills > 1), the chunk size will be half.

Currently this patch is doing chunked prefill with prompt length (align to block size), I'm thinking if we can change this and follow the chunk size provided by scheduler, but we may need to consider padding and warmup combinations. Do you guys have some suggestion on this?

ikurtchen · 2025-11-04T02:52:48Z

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>

YuJiankang · 2025-11-25T02:33:11Z

Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape?

I have added the logic to pad the decode tokens when warmup, thanks

YuJiankang · 2025-11-25T02:49:20Z

@czhu15 @yangulei @taotod please help to review, thanks a lot

Signed-off-by: jkyu <[email protected]>

taotod · 2025-11-27T05:48:51Z

@YuJiankang Please fix all the pre-commit issues.

taotod · 2025-11-27T06:27:03Z

@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario.

czhu15 · 2025-11-25T02:53:12Z

examples/offline_inference/basic/chunked_prefill.py

+os.environ["VLLM_SKIP_WARMUP"] = "true"
+os.environ['VLLM_CONTIGUOUS_PA'] = 'false'
+os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
+os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'


do we need below env vars for aice/v1.22.0 branch?
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'

Don't need, removed.

czhu15 · 2025-11-25T02:54:04Z

examples/offline_inference/basic/chunked_prefill.py

+os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
+os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'
+os.environ['PT_HPU_LAZY_MODE']='1'
+os.environ['VLLM_DELAYED_SAMPLING']='false'


does chunked prefill conflict with delayed sampling?

Yes, not every forward pass generates a token for chunked prefill, which is incompatible with current delay sampling.

czhu15 · 2025-11-27T01:11:47Z

vllm/attention/backends/hpu_attn.py

+                           value: torch.Tensor, kv_cache: torch.Tensor,
+                           attn_metadata: HPUAttentionMetadata,
+                           is_prefill: bool) -> HPUAttentionData:
+        attn_data: HPUAttentionData = HPUAttentionData()


will be good to add a description on the preprocess_forward API, including the purpose of this API, arguments, return values of the API.

Done, add the description

czhu15 · 2025-11-27T01:15:20Z

vllm/attention/backends/hpu_attn.py

+            slot_mapping = attn_metadata.slot_mapping.flatten(
+            ) if attn_metadata.slot_mapping is not None else None
+            batch_size = attn_metadata.num_prefills
+        # Convert Flat inputs into 2D Inputs


wrong comment? should be 3D input, i.e [batch_size, seq_len, hidden_size]

Done, fix the comment

czhu15 · 2025-11-27T06:22:04Z

vllm/attention/backends/hpu_attn.py

+        attn_metadata: HPUAttentionMetadata,
+        output: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Forward pass with xFormers and PagedAttention.


wrong comment?

Done, fix the comment

czhu15 · 2025-11-27T07:00:09Z

vllm/envs.py

    "VLLM_SLEEP_WHEN_IDLE":
    lambda: bool(int(os.getenv("VLLM_SLEEP_WHEN_IDLE", "0"))),
+
+    # Use chunked prefill with dynamic input shapes for HPU backend.


what's the meaning of VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT? when should this be set?

VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT is used to optimize the perf for prefill bs=1. it will use the chunk size for the query length to reduce padding.

czhu15 · 2025-11-27T07:04:36Z

vllm/worker/hpu_model_runner.py

+                paddings = [max_len - q for q in temp_query_lens]
            paddings = [0] + paddings[:-1]
            paddings = list(itertools.accumulate(paddings))
+            for i, seq_group_metadata in enumerate(seq_group_metadata_list):


why need add these lines?

selected_token_indices contains only sequences that require sampling . The paddings list must match this set one-to-one, so when chunked prefill enabled, the prompt sequence that needn't sampled(don't output the token as prefill is not finished) must be removed from paddings, otherwise there will be mismatch and cause accuracy issue.

czhu15 · 2025-11-27T07:08:58Z

vllm/worker/hpu_model_runner.py

                                 align_worker=align_worker)

        selected_token_indices = None
+        temp_query_lens = query_lens.copy()


suggest to rename temp_query_lens to a more meaningful name.

czhu15 · 2025-11-27T07:11:11Z

vllm/worker/hpu_model_runner.py

            logger_msg = "Multimodal bucket : " + str(self.multimodal_buckets)
            logger.info(logger_msg)

+        if max_batch_size < 1:


when will max_batch_size < 1? should we print a warning message or exception if this case is not expected?

for chunked prefill, self.max_num_batched_tokens is set to the chunk size and it may be less than max_seq_len In the previous code logic, which cause max_batch_size=0, max_batch_size = min(self.max_num_seqs, self.max_num_batched_tokens // max_seq_len). I have updated the code to fix in hpu extension to get the correct max_seq_len for chunked prefill case.

czhu15 · 2025-11-27T07:12:06Z

vllm/worker/hpu_model_runner.py

+                        num_iters=3,
+                        align_worker=False,
+                        is_dummy_run=False) -> None:
+        phase = 'mix'


pls add a description on what's the purpose of warmup_scenario_mix, what it does.
can you re-use current warmup_scenario function? seems lots of common code there.

taotod · 2025-12-02T06:18:56Z

vllm/attention/backends/hpu_attn.py

+            if attn_metadata is None or attn_metadata.block_list is None:
+
+                block_list = attn_metadata.block_list if attn_metadata \
+                and attn_metadata.block_list is not None else None


It looks that block_list is always None in this case (attn_metadata is none or attn_metadata.block_list is none)

For the first chunk of the sequence, the block_list is always None, and enter this loop.

taotod · 2025-12-02T06:36:42Z

vllm/attention/backends/hpu_attn.py

+
+            prompt_output = out.reshape(prefill_batch_size, prefill_seq_len,
+                                        prefill_hidden_size)
+        htorch.core.mark_step()


Need put mark_step into prefill logic? Or else there is no HPU operation before execting mark_step.

Added for debugging, not needed in formal code and has been removed.

taotod · 2025-12-02T06:37:02Z

vllm/attention/backends/hpu_attn.py

+                **self.common_attention_args(attn_metadata.decode_block_list, attn_data.key_cache,
+                                             attn_data.value_cache,
+                                             attn_metadata.block_size))
+        htorch.core.mark_step()


Need put mark_step into decode logic? Or else there is no HPU operation before execting mark_step.

Added for debugging, not needed in formal code and has been removed.

YuJiankang · 2025-12-04T03:31:05Z

@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario.

Done, updated the README.md

YuJiankang requested review from PatrykWo, afierka-intel, jikunshang, kzawora-intel, madamczyk-intel, mgawarkiewicz-intel, michalkuligowski, mswiniarsk, vivekgoe and xuechendi as code owners October 23, 2025 05:55

YuJiankang force-pushed the chunked_prefill branch from 6f24ce2 to da62a3d Compare October 23, 2025 05:59

YuJiankang mentioned this pull request Oct 23, 2025

Enable chunked prefill on aice 1.22 HabanaAI/vllm-hpu-extension#381

Open

YuJiankang force-pushed the chunked_prefill branch 2 times, most recently from 7a60181 to dd5eb1f Compare October 23, 2025 08:32

tvoas reviewed Nov 3, 2025

View reviewed changes

YuJiankang force-pushed the chunked_prefill branch 5 times, most recently from ff0da82 to 0f10ea6 Compare November 21, 2025 08:48

YuJiankang and others added 3 commits November 21, 2025 11:55

Enable chunked prefill on aice 1.22

a3c4822

Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>

Fix the accuracy issue for prefill batch size larger than 1

ee6b7ed

use dynamic chunked prefill

64fdbc6

YuJiankang force-pushed the chunked_prefill branch 3 times, most recently from aaa66c1 to f087809 Compare November 25, 2025 01:40

Enable warmup for chunked prefill

f087809

Signed-off-by: jkyu <[email protected]>

czhu15 reviewed Nov 27, 2025

View reviewed changes

taotod reviewed Dec 2, 2025

View reviewed changes

tvoas mentioned this pull request Dec 4, 2025

Dev/aice/v1.22.0/mnps and pp fix #2165

Open

YuJiankang force-pushed the chunked_prefill branch from 7815110 to 9d42fa5 Compare December 4, 2025 07:55

Refine the code according to the comments

9d42fa5

Enable chunked prefill on aice 1.22 #2070

Are you sure you want to change the base?

Enable chunked prefill on aice 1.22 #2070

Uh oh!

Conversation

YuJiankang commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvoas commented Nov 3, 2025

Uh oh!

ikurtchen commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikurtchen commented Nov 4, 2025

Uh oh!

YuJiankang commented Nov 25, 2025

Uh oh!

YuJiankang commented Nov 25, 2025

Uh oh!

taotod commented Nov 27, 2025

Uh oh!

taotod commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuJiankang commented Dec 4, 2025

YuJiankang commented Oct 23, 2025 •

edited by github-actions bot

Loading

ikurtchen commented Nov 4, 2025 •

edited

Loading