support context parallel #3951

irexyc · 2025-09-09T11:46:20Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

…uested (InternLM#4000)

* [Fix] device args in chat cli when using pytorch engine * [Fix] change device into device_type in chat cli

lvhan028 · 2025-11-03T09:20:18Z

may resolve the build error on windows platform

lmdeploy/cli/utils.py

lmdeploy/messages.py

lvhan028 · 2025-11-03T09:28:39Z

src/turbomind/models/llama/LlamaBatch.h

    const int      tp_rank_;
    const DataType data_type_;
    const bool     debug_;
+    const bool     is_driver_;


Consider renaming is_driver_ to be more specific. The current name is vague - what exactly is it driving or controlling?

lzhangzz · 2025-11-12T09:19:44Z

src/turbomind/models/llama/LlamaBatch.cc

    tp_rank_(model->tp_rank_),
    data_type_(data_type),
    debug_(isDebug()),
+    is_driver_(param.attn_tp_rank == 0 && param.attn_cp_rank == 0),


At current setting, this is the same as tp_rank_ == 0, is_driver_ is not needed.

lzhangzz · 2025-11-13T14:12:21Z

lmdeploy/turbomind/turbomind.py


        self._postprocess_config(tm_model.tm_config, engine_config)

+        print(yaml.safe_dump(self.config_dict))


Control log level

lzhangzz · 2025-11-15T05:02:59Z

src/turbomind/models/llama/LlamaBatch.cc

+    // for context parallel, we use symm_alloc_ and both prefill and decode stage have reduce process
+    // w/o context parallel, we use common alloc and only decode stage has reduce process
+    // perhaps it would be more appropriate to put this buffer in the unified_attention_layer.
+    Allocator     alloc          = param_.attn_cp_size > 1 ? symm_alloc_ : core::Context::alloc(kDEVICE);


This will create a new allocator which is not needed in the case. Use core::Context::device_alloc() to get the device allocator in current context.

lzhangzz · 2025-11-17T05:54:07Z

src/turbomind/kernels/attention/attention_universal.h

            if (qi_begin + qi < qi_end && ri == 0 && check_h(hi)) {
-                params.partial_M[index] = M;
-                params.partial_L[index] = L;
+                params.partial_ML[index * 2]     = M;


make the partial_ML a pointer to float2 so that load / store can be vectorized.

lzhangzz · 2025-11-17T05:55:49Z

src/turbomind/models/llama/cp_utils.cu

@@ -0,0 +1,20 @@
+// Copyright (c) OpenMMLab. All rights reserved.
+
+#include "src/turbomind/models/llama/cp_utils.h"


move cp_utils.* to kernels/attention

lzhangzz · 2025-11-17T08:44:49Z

src/turbomind/kernels/attention/kv_cache_utils_v2.cu

        }
    }

+    int cp_quo, cp_rem;


use expressive names, e.g. local_ti and local_ti_rank

lzhangzz · 2025-11-17T08:47:51Z

src/turbomind/kernels/attention/decoding_template.h

    }

-    const int tile_count      = cdiv(std::min(params.max_k_len, params.window_size), Kernel::CTA_S);
+    const int max_cp_k_len    = (params.max_k_len + params.cp_size - 1) / params.cp_size;


lzhangzz · 2025-11-17T08:48:41Z

src/turbomind/kernels/attention/attention_template.h

    }();

-    const int tile_count      = cdiv(std::min(params.max_k_len, params.window_size), Kernel::CTA_S);
+    const int max_cp_k_len    = (params.max_k_len + params.cp_size - 1) / params.cp_size;


lzhangzz · 2025-11-17T09:03:13Z

src/turbomind/kernels/attention/attention_universal.h

            const int qi = offset.y / CTA_H;
            const int ti = history_len;

+            int cp_quo, cp_rem;


use expressive names

lzhangzz · 2025-11-17T14:12:00Z

src/turbomind/kernels/attention/attention_universal.h

            });
        }

        const bool separate_reduce = need_separate_reduce(cta_map.split_count());


This code path can be removed.

lzhangzz · 2025-11-17T14:13:15Z

src/turbomind/kernels/attention/attention_universal.h


        Impl::Merge(frag_O, frag_M, frag_L, params.inv_sqrt_dh, storage);

        if (params.sinks && iter_end == tile_count) {


attention sink should be applied to cp rank 0 ONLY

lmdeploy/cli/utils.py

irexyc and others added 8 commits September 22, 2025 13:49

use driver flag

c1dae3a

update

bb27b62

accurate mask iter

0fe88bc

use fast divmod

5c02779

remove cp_O

53654ad

remove unused

e3dd4f7

return the last token's logprobs if include_stop_str_in_output is req…

1f75dd6

…uested (InternLM#4000)

[Fix] device args in chat cli when using pytorch engine (InternLM#3999)

be504d3

* [Fix] device args in chat cli when using pytorch engine * [Fix] change device into device_type in chat cli

irexyc force-pushed the cp branch from 240e668 to be504d3 Compare September 23, 2025 02:26

Merge remote-tracking branch 'origin/main' into cp2

25a8fb8

lzhangzz self-requested a review September 23, 2025 05:46

irexyc added 12 commits September 23, 2025 11:19

fix NULL raw data

77ef52a

add attn_cp_size to cli

29cf813

build cutlass::FastDivmod on host

0044d4f

use single buffer

e4050a4

udpate comm

f44ef96

use two stage reduce

a329b29

Merge remote-tracking branch 'github/main' into cp2

dafcd64

remove unused

c9649c0

better AllreduceResidualRMSnorm

52766d2

fix max_session_len

b783d5c

Merge remote-tracking branch 'github/main' into cp

c39373a

update docs

47a349b

lvhan028 marked this pull request as ready for review October 30, 2025 13:38

lvhan028 changed the title ~~[WIP] support context parallel~~ support context parallel Oct 30, 2025

lvhan028 added the enhancement New feature or request label Oct 30, 2025

lvhan028 reviewed Nov 3, 2025

View reviewed changes

lmdeploy/cli/utils.py Outdated Show resolved Hide resolved

lvhan028 reviewed Nov 3, 2025

View reviewed changes

lmdeploy/messages.py Show resolved Hide resolved

lvhan028 reviewed Nov 3, 2025

View reviewed changes

irexyc added 14 commits November 5, 2025 04:10

always use seperate reduce for cp

8c5b289

add cp configuration parameter

4005547

remove redundant parameters

1d2b098

remove redundant parameters

77920f8

fix build

f54ca43

fix xgrammar build

1ac3080

update docs

7872225

remove unused

0f82ef1

fix test_attention

1b3bb9c

unify attn split_k reduction w/ w/o cp

56b9e27

fix nccl found

4211c3c

Merge remote-tracking branch 'github/main' into cp

9e94de6

update reduce

7099940

fix windows build

4807303

lzhangzz reviewed Nov 17, 2025

View reviewed changes

irexyc added 3 commits November 17, 2025 13:17

remove print

4277f7b

revert is_driver_

68d6756

prevent create new allocator

e4f310a

lzhangzz reviewed Nov 17, 2025

View reviewed changes

irexyc added 6 commits November 18, 2025 05:46

use Store to write partial_ML

bc7b84f

use expressive names

cee4114

use cdiv

f6c68bf

remove separate_reduce

bd137b1

apply attention sink on cp_rank0

1fd6c54

move cp_utils.* to kernels/attention

1da69bf

lvhan028 reviewed Nov 19, 2025

View reviewed changes

lmdeploy/cli/utils.py Outdated Show resolved Hide resolved

lzhangzz approved these changes Nov 19, 2025

View reviewed changes

update cli description

37ac180

lvhan028 approved these changes Nov 19, 2025

View reviewed changes

lvhan028 merged commit 2bc6529 into InternLM:main Nov 19, 2025
9 checks passed


		self._postprocess_config(tm_model.tm_config, engine_config)

		print(yaml.safe_dump(self.config_dict))

		@@ -0,0 +1,20 @@
		// Copyright (c) OpenMMLab. All rights reserved.

		#include "src/turbomind/models/llama/cp_utils.h"


		Impl::Merge(frag_O, frag_M, frag_L, params.inv_sqrt_dh, storage);

		if (params.sinks && iter_end == tile_count) {

support context parallel #3951

support context parallel #3951

Uh oh!

Conversation

irexyc commented Sep 9, 2025

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

lvhan028 commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants