Fix `cuda::memcpy async` edge cases and add more tests #6608

bernhardmgruber · 2025-11-12T22:04:58Z

Broken example in [BUG] cuda::memcpy_async hangs in some examples #6601 does not hang anymore
Codegen for the example in cuda::memcpy_async with cuda::barrier implementation is inefficient on sm90+ #5995 is still optimal, we just have more code now for computing the thread rank of the CG group
Codegen of the above using a custom 1D thread block and is_thread_block_group_v optimal

Fixes: #6601

copy-pr-bot · 2025-11-12T22:05:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-11-12T22:05:17Z

/ok to test cca4271

bernhardmgruber · 2025-11-12T22:07:52Z

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

+  const unsigned int tid             = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
  const unsigned int warp_id         = tid / 32;
  const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0
  return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0


The old logic is wrong for any _Group that is not a full thread block.

bernhardmgruber · 2025-11-12T22:08:38Z

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

+[[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool
+__elect_from_group(const cooperative_groups::thread_block& __g) noexcept
 {
-  // cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do
-  const unsigned int tid             = __g.thread_rank();
+  // Cannot call __g.thread_rank(), because we only forward declared the thread_block type
+  // cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do
+  const unsigned int tid             = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;


@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!

Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?

bernhardmgruber · 2025-11-13T09:39:48Z

/ok to test ce7f528

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

bernhardmgruber · 2025-11-13T11:09:15Z

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

+  // use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes
+  auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block());
+  if (threadIdx.x < 8)
+  {
+    static_assert(thread_block_size >= 8);
+    printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank());
+    cuda::memcpy_async(
+      tiled_groups,
+      &dest->data[tiled_groups.meta_group_rank() * 16],
+      &source->data[tiled_groups.meta_group_rank() * 16],
+      sizeof(T) * 8,
+      *bar);


Remark: the possibility of this is incredibly clever and unholy at the same time.

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp

miscco · 2025-11-13T11:35:37Z

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

+
+This trait is ``true`` if ``T`` represents a CUDA thread block.
+For example, ``cooperative_groups::thread_block`` satisfies this trait.
+Users are encouraged to specialize this trait for their own groups.


I believe we should make clear that this talks about a full thread group and not just a single thread?

This was the original bug wasnt it?

Yes. How would you like to improve the documentation? I am already happy with it. What's missing or unclear?

Something like full CUDA thread block or something that indicates that we need all dimensions?

I updated the wording to spell if Group represents the full CUDA thread block. It does not matter which dimensionality the thread block has, this is abstracted by __g.thread_rank().

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h

miscco · 2025-11-13T12:48:44Z

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

+
+This trait is ``true`` if ``T`` represents a CUDA thread block.
+For example, ``cooperative_groups::thread_block`` satisfies this trait.
+Users are encouraged to specialize this trait for their own groups.


Something like full CUDA thread block or something that indicates that we need all dimensions?

fbusato · 2025-11-14T23:45:07Z

docs/libcudacxx/extended_api/asynchronous_operations/memcpy_async.rst

+
+Additionally:
+
   - If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``


question. Are these constraints evaluated in assertions?

We already assert that pointers are aligned. I added now that the pipeline is not quit.

I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.

libcudacxx/include/cuda/__memcpy_async/elect_one.h

libcudacxx/include/cuda/__memcpy_async/group_traits.h

error: A __device__ variable template cannot have a const qualified type on Windows

github-actions · 2025-11-17T19:25:26Z

😬 CI Workflow Results

🟥 Finished in 3h 55m: Pass: 97%/88 | Total: 21h 20m | Max: 3h 48m | Hits: 99%/213035

See results here.

bernhardmgruber · 2025-11-17T23:41:06Z

I am afraid some tests are timeouting with NVRTC on Turing :S

fbusato · 2025-11-17T23:45:36Z

I am afraid some tests are timeouting with NVRTC on Turing :S

Turing + NVRTC looks more an edge case

github-project-automation bot added this to CCCL Nov 12, 2025

github-project-automation bot moved this to Todo in CCCL Nov 12, 2025

bernhardmgruber changed the title ~~Fix cuda::memcpy async edge cases~~ Fix cuda::memcpy async edge cases and add more tests Nov 12, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 12, 2025

bernhardmgruber commented Nov 12, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 9ee0408 to ce7f528 Compare November 13, 2025 09:38

miscco reviewed Nov 13, 2025

View reviewed changes

bernhardmgruber commented Nov 13, 2025

View reviewed changes

libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h Outdated Show resolved Hide resolved

bernhardmgruber commented Nov 13, 2025

View reviewed changes

libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp Show resolved Hide resolved

miscco reviewed Nov 13, 2025

View reviewed changes

bernhardmgruber force-pushed the fix_memcpy_async branch from c4a1509 to c23d96d Compare November 13, 2025 12:17

bernhardmgruber marked this pull request as ready for review November 13, 2025 12:17

bernhardmgruber requested review from a team as code owners November 13, 2025 12:17

bernhardmgruber requested review from alliepiper and griwes November 13, 2025 12:17

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 13, 2025

miscco approved these changes Nov 13, 2025

View reviewed changes

bernhardmgruber force-pushed the fix_memcpy_async branch from 97cddd0 to 3099002 Compare November 13, 2025 15:45

This comment has been minimized.

Sign in to view

fbusato reviewed Nov 14, 2025

View reviewed changes

bernhardmgruber added 20 commits November 17, 2025 09:16

Add group tests

8dd31b6

More tests

3360450

Fix

ad38283

Optimize for CG again

766d68d

Not ugly enough

acb72bb

Fix clang

f69831a

use a trait

cb883aa

Extend documentation

ee06c87

Reviewer feedback

2167dc3

asdfklnasgn

a7cfd19

Augment this_thread_block_1D with is_thread_block_group_v

babecd0

Naming and uglyness

c922971

Missing include

210e4e2

Move traits to new header, move elect to other header

4cea967

Fix

2142f12

Docs fix

b582410

Update wording

6c13b07

Try to fix MSVC error

89ef7ea

error: A __device__ variable template cannot have a const qualified type on Windows

Reviewer feedback: Full qualification

f42d0e6

Assert pipeline is active

f13080a

bernhardmgruber force-pushed the fix_memcpy_async branch from 3099002 to f13080a Compare November 17, 2025 08:41

This comment has been minimized.

Sign in to view

increase libcu++ CI timeout

602c0cc

bernhardmgruber requested a review from a team as a code owner November 17, 2025 15:27

fbusato approved these changes Nov 17, 2025

View reviewed changes

bernhardmgruber added the backport branch/3.2.x label Nov 18, 2025


		Additionally:

		- If Shape is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``

Fix cuda::memcpy async edge cases and add more tests #6608

Are you sure you want to change the base?

Fix cuda::memcpy async edge cases and add more tests #6608

Conversation

bernhardmgruber commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

bernhardmgruber commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

bernhardmgruber commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Nov 17, 2025

😬 CI Workflow Results

🟥 Finished in 3h 55m: Pass: 97%/88 | Total: 21h 20m | Max: 3h 48m | Hits: 99%/213035

Uh oh!

bernhardmgruber commented Nov 17, 2025

Uh oh!

fbusato commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix `cuda::memcpy async` edge cases and add more tests #6608

Fix `cuda::memcpy async` edge cases and add more tests #6608

bernhardmgruber commented Nov 12, 2025 •

edited

Loading