-
Notifications
You must be signed in to change notification settings - Fork 294
Fix cuda::memcpy async edge cases and add more tests
#6608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cuda::memcpy async edge casescuda::memcpy async edge cases and add more tests
|
/ok to test cca4271 |
| const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; | ||
| const unsigned int warp_id = tid / 32; | ||
| const unsigned int uniform_warp_id = __shfl_sync(0xFFFFFFFF, warp_id, 0); // broadcast from lane 0 | ||
| return uniform_warp_id == 0 && ::cuda::ptx::elect_sync(0xFFFFFFFF); // elect a leader thread among warp 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old logic is wrong for any _Group that is not a full thread block.
| [[nodiscard]] _CCCL_DEVICE _CCCL_FORCEINLINE bool | ||
| __elect_from_group(const cooperative_groups::thread_block& __g) noexcept | ||
| { | ||
| // cooperative groups maps a multidimensional thread id into the thread rank the same way as warps do | ||
| const unsigned int tid = __g.thread_rank(); | ||
| // Cannot call __g.thread_rank(), because we only forward declared the thread_block type | ||
| // cooperative groups (and we here) maps a multidimensional thread id into the thread rank the same way as warps do | ||
| const unsigned int tid = threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pciolkosz if we had a cooperative_groups::thread_block<1> or some other way to detect that the block is 1D, we could save a lot of special register reads here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could just add a cuda::thread_block_group<1> which would fulfill the Group concept and give us efficient codegen here. @miscco and @pciolkosz what do you think?
This comment has been minimized.
This comment has been minimized.
9ee0408 to
ce7f528
Compare
|
/ok to test ce7f528 |
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
| // use 2 groups of 4 threads to copy 8 items each, but spread them 16 bytes | ||
| auto tiled_groups = cg::tiled_partition<4>(cg::this_thread_block()); | ||
| if (threadIdx.x < 8) | ||
| { | ||
| static_assert(thread_block_size >= 8); | ||
| printf("%u copying 8 items at meta group rank %u\n", threadIdx.x, tiled_groups.meta_group_rank()); | ||
| cuda::memcpy_async( | ||
| tiled_groups, | ||
| &dest->data[tiled_groups.meta_group_rank() * 16], | ||
| &source->data[tiled_groups.meta_group_rank() * 16], | ||
| sizeof(T) * 8, | ||
| *bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: the possibility of this is incredibly clever and unholy at the same time.
libcudacxx/test/libcudacxx/cuda/memcpy_async/group_memcpy_async_16b.pass.cpp
Show resolved
Hide resolved
| This trait is ``true`` if ``T`` represents a CUDA thread block. | ||
| For example, ``cooperative_groups::thread_block`` satisfies this trait. | ||
| Users are encouraged to specialize this trait for their own groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should make clear that this talks about a full thread group and not just a single thread?
This was the original bug wasnt it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. How would you like to improve the documentation? I am already happy with it. What's missing or unclear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like full CUDA thread block or something that indicates that we need all dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the wording to spell if Group represents the full CUDA thread block. It does not matter which dimensionality the thread block has, this is abstracted by __g.thread_rank().
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/__memcpy_async/cp_async_bulk_shared_global.h
Outdated
Show resolved
Hide resolved
c4a1509 to
c23d96d
Compare
| This trait is ``true`` if ``T`` represents a CUDA thread block. | ||
| For example, ``cooperative_groups::thread_block`` satisfies this trait. | ||
| Users are encouraged to specialize this trait for their own groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like full CUDA thread block or something that indicates that we need all dimensions?
97cddd0 to
3099002
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
|
||
| Additionally: | ||
|
|
||
| - If *Shape* is :ref:`cuda::aligned_size_t <libcudacxx-extended-api-memory-aligned-size>`, ``source`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question. Are these constraints evaluated in assertions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already assert that pointers are aligned. I added now that the pipeline is not quit.
I cannot easily check whether the parameters are the same across all threads of a group and whether all threads of that group also called the API. It may be possible with some block-wide operations, but seems a bit much for an assertion.
error: A __device__ variable template cannot have a const qualified type on Windows
3099002 to
f13080a
Compare
This comment has been minimized.
This comment has been minimized.
😬 CI Workflow Results🟥 Finished in 3h 55m: Pass: 97%/88 | Total: 21h 20m | Max: 3h 48m | Hits: 99%/213035See results here. |
|
I am afraid some tests are timeouting with NVRTC on Turing :S |
Turing + NVRTC looks more an edge case |
cuda::memcpy_asynchangs in some examples #6601 does not hang anymorecuda::memcpy_asyncwithcuda::barrierimplementation is inefficient on sm90+ #5995 is still optimal, we just have more code now for computing the thread rank of the CG groupis_thread_block_group_voptimalFixes: #6601