Skip to content

Commit 2bc6529

Browse files
irexyclvhan028CyCle1024
authored
support context parallel (#3951)
* use driver flag * update * accurate mask iter * use fast divmod * remove cp_O * remove unused * return the last token's logprobs if include_stop_str_in_output is requested (#4000) * [Fix] device args in chat cli when using pytorch engine (#3999) * [Fix] device args in chat cli when using pytorch engine * [Fix] change device into device_type in chat cli * fix NULL raw data * add attn_cp_size to cli * build cutlass::FastDivmod on host * use single buffer * udpate comm * use two stage reduce * remove unused * better AllreduceResidualRMSnorm * fix max_session_len * update docs * fix embedding/lm_head split * use same split_k on different cp_rank * always use seperate reduce for cp * add cp configuration parameter * remove redundant parameters * remove redundant parameters * fix build * fix xgrammar build * update docs * remove unused * fix test_attention * unify attn split_k reduction w/ w/o cp * fix nccl found * update reduce * fix windows build * remove print * revert is_driver_ * prevent create new allocator * use Store to write partial_ML * use expressive names * use cdiv * remove separate_reduce * apply attention sink on cp_rank0 * move cp_utils.* to kernels/attention * update cli description --------- Co-authored-by: Lyu Han <[email protected]> Co-authored-by: CyCle1024 <[email protected]>
1 parent efbba83 commit 2bc6529

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+690
-489
lines changed

CMakeLists.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ project(TurboMind LANGUAGES CXX CUDA)
1919
if (MSVC)
2020
# use standard conformant preprocessor
2121
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:/Zc:preprocessor>)
22+
add_compile_options($<$<COMPILE_LANGUAGE:CXX>:/Zc:__cplusplus>)
2223
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler=/Zc:preprocessor -Xcompiler=/Zc:__cplusplus")
2324
endif ()
2425

@@ -101,6 +102,10 @@ if(NOT xgrammar_POPULATED)
101102

102103
# Bring the populated content into the build
103104
add_subdirectory(${xgrammar_SOURCE_DIR} ${xgrammar_BINARY_DIR})
105+
if(TARGET xgrammar)
106+
target_compile_options(xgrammar PRIVATE $<$<CXX_COMPILER_ID:MSVC>:/utf-8>)
107+
target_compile_options(xgrammar PRIVATE $<$<C_COMPILER_ID:MSVC>:/utf-8>)
108+
endif()
104109
endif()
105110

106111
# the environment variable

benchmark/profile_throughput.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,7 @@ def parse_args():
327327
tb_group._group_actions.append(dtype_act)
328328

329329
ArgumentHelper.dp(tb_group)
330+
ArgumentHelper.cp(tb_group)
330331
ArgumentHelper.model_format(tb_group, default='hf')
331332
ArgumentHelper.num_tokens_per_iter(tb_group)
332333
ArgumentHelper.max_prefill_iters(tb_group)
@@ -344,6 +345,7 @@ def main():
344345
max_batch_size=args.concurrency // args.dp,
345346
tp=args.tp,
346347
dp=args.dp,
348+
cp=args.cp,
347349
cache_max_entry_count=args.cache_max_entry_count,
348350
cache_block_seq_len=args.cache_block_seq_len,
349351
model_format=args.model_format,

builder/windows/generate.ps1

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
cmake .. -A x64 -T "v142,cuda=$env:CUDA_PATH" `
1+
cmake .. -A x64 -T "v143,cuda=$env:CUDA_PATH" `
22
-DCMAKE_BUILD_TYPE=Release `
33
-DCMAKE_INSTALL_PREFIX=install `
44
-DBUILD_PY_FFI=ON `
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Context Parallel
2+
3+
When the memory on a single GPU is insufficient to deploy a model, it is often deployed using tensor parallelism (TP), which generally requires `num_key_value_heads` to be divisible by `TP`. If you want to deploy with `TP > num_key_value_heads`, the kv-heads should be duplicated to meet the divisibility requirement. However, this has two disadvantages:
4+
5+
1. The amount of available kv_cache is halved, which reducing the maximum supported session length.
6+
2. The maximum inference batch size is reduced, leading to lower throughput.
7+
8+
To address this issue, the TurboMind inference backend supports setting `attn_dp_size`, which avoids creating copies of kv-heads, but this introduces data imbalance. To eliminate data imbalance, TurboMind supports sequence parallelism, which allowing kv_cache to be stored interleaved on different cp_ranks. See the example below:
9+
10+
```
11+
cp_rank=2, prompt_len=5, generation_len=4
12+
kv_cache stored on cp_rank0: 0, 2, 4, 6, 8
13+
kv_cache stored on cp_rank1: 1, 3, 5, 7
14+
```
15+
16+
## Usage
17+
18+
Taking Intern-S1 / Qwen3-235B-A22B as an example, their `num_key_value_heads` is 4. If you want to deploy with `TP=8` and avoid duplication of kv_cache, you can deploy in the following way:
19+
20+
```
21+
lmdeploy serve api_server internlm/Intern-S1 --tp 8 --cp 2
22+
23+
lmdeploy serve api_server Qwen/Qwen3-235B-A22B --tp 8 --cp 2
24+
```

docs/en/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ Documentation
103103
advance/pytorch_multinodes.md
104104
advance/pytorch_profiling.md
105105
advance/metrics.md
106+
advance/context_parallel.md
106107

107108
.. toctree::
108109
:maxdepth: 1
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# 序列并行
2+
3+
在单卡显存不足以部署模型的时候,通常会以 `TP` 的方式进行部署,而这一般要求 `num_key_value_heads``TP` 整除。如果要以 `TP > num_key_value_heads` 的方式进行部署,需要创建 kv-heads 的副本,以满足整除需求。但是这样会有两个缺点:
4+
5+
1. 可用的 kvcache 数量减半,进而减少请求最大推理长度
6+
2. 降低推理的最大 batch 数量,减少吞吐量。
7+
8+
为了解决这个问题,TurboMind 推理后端支持设置 `attn_dp_size`,避免了创建 kv-heads 的副本,但是这会引入数据的不均衡性。为了消除数据的不均衡,TurboMind 支持了序列并行,支持将 kv_cache 交错存储到不同的 cp_rank 上,例如
9+
10+
```
11+
cp_rank=2, prompt_len=5, generation_len=4
12+
kv_cache stored on cp_rank0: 0, 2, 4, 6, 8
13+
kv_cache stored on cp_rank1: 1, 3, 5, 7
14+
```
15+
16+
## 使用说明
17+
18+
`Intern-S1` / `Qwen3-235B-A22B` 为例,他们的 `num_key_value_heads` 为 4,若要用 `TP=8` 的方式部署,并避免 kv_cache 的拷贝,可以用如下的方式部署
19+
20+
```
21+
lmdeploy serve api_server internlm/Intern-S1 --tp 8 --cp 2
22+
23+
lmdeploy serve api_server Qwen/Qwen3-235B-A22B --tp 8 --cp 2
24+
```

docs/zh_cn/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ LMDeploy 工具箱提供以下核心功能:
104104
advance/pytorch_multinodes.md
105105
advance/pytorch_profiling.md
106106
advance/metrics.md
107+
advance/context_parallel.md
107108

108109
.. toctree::
109110
:maxdepth: 1

lmdeploy/cli/cli.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ def add_parser_chat():
7676
ArgumentHelper.model_format(tb_group)
7777
ArgumentHelper.rope_scaling_factor(tb_group)
7878
ArgumentHelper.communicator(tb_group)
79+
ArgumentHelper.cp(tb_group)
7980

8081
@staticmethod
8182
def add_parser_checkenv():

lmdeploy/cli/serve.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ def add_parser_api_server():
112112
model_format = ArgumentHelper.model_format(pt_group)
113113
hf_overrides = ArgumentHelper.hf_overrides(pt_group)
114114
disable_metrics = ArgumentHelper.disable_metrics(pt_group)
115-
ArgumentHelper.dp(pt_group)
115+
dp = ArgumentHelper.dp(pt_group)
116116
ArgumentHelper.ep(pt_group)
117117
ArgumentHelper.enable_microbatch(pt_group)
118118
ArgumentHelper.enable_eplb(pt_group)
@@ -137,6 +137,8 @@ def add_parser_api_server():
137137
tb_group._group_actions.append(model_format)
138138
tb_group._group_actions.append(hf_overrides)
139139
tb_group._group_actions.append(disable_metrics)
140+
tb_group._group_actions.append(dp)
141+
ArgumentHelper.cp(tb_group)
140142
ArgumentHelper.rope_scaling_factor(tb_group)
141143
ArgumentHelper.num_tokens_per_iter(tb_group)
142144
ArgumentHelper.max_prefill_iters(tb_group)
@@ -235,6 +237,8 @@ def api_server(args):
235237
from lmdeploy.messages import TurbomindEngineConfig
236238
backend_config = TurbomindEngineConfig(dtype=args.dtype,
237239
tp=args.tp,
240+
dp=args.dp,
241+
cp=args.cp,
238242
max_batch_size=max_batch_size,
239243
session_len=args.session_len,
240244
model_format=args.model_format,
@@ -253,7 +257,7 @@ def api_server(args):
253257

254258
from lmdeploy.messages import VisionConfig
255259
vision_config = VisionConfig(args.vision_max_batch_size)
256-
if args.dp == 1:
260+
if args.dp == 1 or backend == 'turbomind':
257261
from lmdeploy.serve.openai.api_server import serve as run_api_server
258262

259263
run_api_server(args.model_path,

lmdeploy/cli/utils.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,16 @@ def ep(parser):
188188
default=1,
189189
help='expert parallelism. dp is required when pytorch engine is used.')
190190

191+
@staticmethod
192+
def cp(parser):
193+
"""Add argument cp to parser."""
194+
195+
return parser.add_argument(
196+
'--cp',
197+
type=int,
198+
default=1,
199+
help='context parallelism size in attention for turbomind backend, tp % cp should be 0.')
200+
191201
@staticmethod
192202
def dp_rank(parser):
193203
"""Add argument dp_rank to parser."""

0 commit comments

Comments
 (0)