-
Notifications
You must be signed in to change notification settings - Fork 133
Enable chunked prefill on aice 1.22 #2070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aice/v1.22.0
Are you sure you want to change the base?
Conversation
6f24ce2 to
da62a3d
Compare
7a60181 to
dd5eb1f
Compare
vllm/worker/hpu_model_runner.py
Outdated
| if any(context_lens): | ||
| assert not self.scheduler_config.chunked_prefill_enabled | ||
| # assert not self.scheduler_config.chunked_prefill_enabled | ||
| # prefix caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is out of data. Remove deprecated (commented out) assert and replace comment with something like:
# prefix caching or chunked prefillThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Recommend adjustments be made to |
Yes, the The value in the Currently this patch is doing chunked prefill with prompt length (align to block size), I'm thinking if we can change this and follow the chunk size provided by scheduler, but we may need to consider padding and warmup combinations. Do you guys have some suggestion on this? |
|
Another question: when there're prefill and decode in one batch, is the decode tokens not padded? The tokens for prefill and decode will concat and send to model.forward(), will this cause dynamic shape? |
ff0da82 to
0f10ea6
Compare
Co-authored-by: Jiang, Zhoulong <[email protected]> Signed-off-by: jkyu <[email protected]>
aaa66c1 to
f087809
Compare
I have added the logic to pad the decode tokens when warmup, thanks |
Signed-off-by: jkyu <[email protected]>
|
@YuJiankang Please fix all the pre-commit issues. |
|
@YuJiankang , please update scripts/README.md how to enable chunked prefill and the recommend scenario. |
| os.environ["VLLM_SKIP_WARMUP"] = "true" | ||
| os.environ['VLLM_CONTIGUOUS_PA'] = 'false' | ||
| os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' | ||
| os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need below env vars for aice/v1.22.0 branch?
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0'
os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need, removed.
| os.environ['VLLM_MLA_PERFORM_MATRIX_ABSORPTION']='0' | ||
| os.environ['VLLM_MTP_PRINT_ACCPET_RATE']='0' | ||
| os.environ['PT_HPU_LAZY_MODE']='1' | ||
| os.environ['VLLM_DELAYED_SAMPLING']='false' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does chunked prefill conflict with delayed sampling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not every forward pass generates a token for chunked prefill, which is incompatible with current delay sampling.
| value: torch.Tensor, kv_cache: torch.Tensor, | ||
| attn_metadata: HPUAttentionMetadata, | ||
| is_prefill: bool) -> HPUAttentionData: | ||
| attn_data: HPUAttentionData = HPUAttentionData() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be good to add a description on the preprocess_forward API, including the purpose of this API, arguments, return values of the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, add the description
vllm/attention/backends/hpu_attn.py
Outdated
| slot_mapping = attn_metadata.slot_mapping.flatten( | ||
| ) if attn_metadata.slot_mapping is not None else None | ||
| batch_size = attn_metadata.num_prefills | ||
| # Convert Flat inputs into 2D Inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment? should be 3D input, i.e [batch_size, seq_len, hidden_size]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, fix the comment
vllm/attention/backends/hpu_attn.py
Outdated
| attn_metadata: HPUAttentionMetadata, | ||
| output: Optional[torch.Tensor] = None, | ||
| ) -> torch.Tensor: | ||
| """Forward pass with xFormers and PagedAttention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, fix the comment
| "VLLM_SLEEP_WHEN_IDLE": | ||
| lambda: bool(int(os.getenv("VLLM_SLEEP_WHEN_IDLE", "0"))), | ||
|
|
||
| # Use chunked prefill with dynamic input shapes for HPU backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT? when should this be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM_HPU_CHUNKED_PREFILL_DYNAMIC_INPUT is used to optimize the perf for prefill bs=1. it will use the chunk size for the query length to reduce padding.
vllm/worker/hpu_model_runner.py
Outdated
| paddings = [max_len - q for q in temp_query_lens] | ||
| paddings = [0] + paddings[:-1] | ||
| paddings = list(itertools.accumulate(paddings)) | ||
| for i, seq_group_metadata in enumerate(seq_group_metadata_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need add these lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selected_token_indices contains only sequences that require sampling . The paddings list must match this set one-to-one, so when chunked prefill enabled, the prompt sequence that needn't sampled(don't output the token as prefill is not finished) must be removed from paddings, otherwise there will be mismatch and cause accuracy issue.
vllm/worker/hpu_model_runner.py
Outdated
| align_worker=align_worker) | ||
|
|
||
| selected_token_indices = None | ||
| temp_query_lens = query_lens.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to rename temp_query_lens to a more meaningful name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
vllm/worker/hpu_model_runner.py
Outdated
| logger_msg = "Multimodal bucket : " + str(self.multimodal_buckets) | ||
| logger.info(logger_msg) | ||
|
|
||
| if max_batch_size < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will max_batch_size < 1? should we print a warning message or exception if this case is not expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for chunked prefill, self.max_num_batched_tokens is set to the chunk size and it may be less than max_seq_len In the previous code logic, which cause max_batch_size=0, max_batch_size = min(self.max_num_seqs, self.max_num_batched_tokens // max_seq_len). I have updated the code to fix in hpu extension to get the correct max_seq_len for chunked prefill case.
| num_iters=3, | ||
| align_worker=False, | ||
| is_dummy_run=False) -> None: | ||
| phase = 'mix' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a description on what's the purpose of warmup_scenario_mix, what it does.
can you re-use current warmup_scenario function? seems lots of common code there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if attn_metadata is None or attn_metadata.block_list is None: | ||
|
|
||
| block_list = attn_metadata.block_list if attn_metadata \ | ||
| and attn_metadata.block_list is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks that block_list is always None in this case (attn_metadata is none or attn_metadata.block_list is none)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first chunk of the sequence, the block_list is always None, and enter this loop.
vllm/attention/backends/hpu_attn.py
Outdated
|
|
||
| prompt_output = out.reshape(prefill_batch_size, prefill_seq_len, | ||
| prefill_hidden_size) | ||
| htorch.core.mark_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need put mark_step into prefill logic? Or else there is no HPU operation before execting mark_step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for debugging, not needed in formal code and has been removed.
vllm/attention/backends/hpu_attn.py
Outdated
| **self.common_attention_args(attn_metadata.decode_block_list, attn_data.key_cache, | ||
| attn_data.value_cache, | ||
| attn_metadata.block_size)) | ||
| htorch.core.mark_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need put mark_step into decode logic? Or else there is no HPU operation before execting mark_step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for debugging, not needed in formal code and has been removed.
Done, updated the README.md |
7815110 to
9d42fa5
Compare

This PR port chunked prefill related patch from deepseek_r1 to aice 1.22,
co-work with HabanaAI/vllm-hpu-extension#381