Skip to content

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Nov 12, 2025

Fixes: #6601

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 12, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber bernhardmgruber changed the title Fix cuda::memcpy async edge cases Fix cuda::memcpy async edge cases and add more tests Nov 12, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 12, 2025
@bernhardmgruber
Copy link
Contributor Author

/ok to test cca4271

const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
const unsigned int warp_id = tid / 32;
const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0
return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old logic is wrong for any _Group that is not a full thread block.

Comment on lines 61 to 66
[[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool
__elect_from_group(const cooperative_groups::thread_block& __g) noexcept
{
// cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do
const unsigned int tid = __g.thread_rank();
// Cannot call __g.thread_rank(), because we only forward declared the thread_block type
// cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do
const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the fix_memcpy_async branch 2 times, most recently from 9ee0408 to ce7f528 Compare November 13, 2025 09:38
@bernhardmgruber
Copy link
Contributor Author

/ok to test ce7f528

Comment on lines +132 to +144
// use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes
auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block());
if (threadIdx.x < 8)
{
static_assert(thread_block_size >= 8);
printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank());
cuda::memcpy_async(
tiled_groups,
&dest->data[tiled_groups.meta_group_rank() * 16],
&source->data[tiled_groups.meta_group_rank() * 16],
sizeof(T) * 8,
*bar);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: the possibility of this is incredibly clever and unholy at the same time.

This trait is ``true`` if ``T`` represents a CUDA thread block.
For example, ``cooperative_groups::thread_block`` satisfies this trait.
Users are encouraged to specialize this trait for their own groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should make clear that this talks about a full thread group and not just a single thread?

This was the original bug wasnt it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. How would you like to improve the documentation? I am already happy with it. What's missing or unclear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like full CUDA thread block or something that indicates that we need all dimensions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the wording to spell if Group represents the full CUDA thread block. It does not matter which dimensionality the thread block has, this is abstracted by __g.thread_rank().

@bernhardmgruber bernhardmgruber marked this pull request as ready for review November 13, 2025 12:17
@bernhardmgruber bernhardmgruber requested review from a team as code owners November 13, 2025 12:17
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Nov 13, 2025
This trait is ``true`` if ``T`` represents a CUDA thread block.
For example, ``cooperative_groups::thread_block`` satisfies this trait.
Users are encouraged to specialize this trait for their own groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like full CUDA thread block or something that indicates that we need all dimensions?

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.


Additionally:

- If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. Are these constraints evaluated in assertions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already assert that pointers are aligned. I added now that the pipeline is not quit.

I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner November 17, 2025 15:27
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 3h 55m: Pass: 97%/88 | Total: 21h 20m | Max: 3h 48m | Hits: 99%/213035

See results here.

@bernhardmgruber
Copy link
Contributor Author

I am afraid some tests are timeouting with NVRTC on Turing :S

@fbusato
Copy link
Contributor

fbusato commented Nov 17, 2025

I am afraid some tests are timeouting with NVRTC on Turing :S

Turing + NVRTC looks more an edge case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[BUG] cuda::memcpy_async hangs in some examples

3 participants