[BUG] `cuda::memcpy_async` hangs in some examples

Some (internal) `cuda::memcpy_async` examples began to hang after #5996.

Example: https://godbolt.org/z/W5KvMh36e
```c++
__global__ void kernel_pipeline(double* __restrict__ ary) {
    __shared__ alignas(16) double val[16];
    __shared__ cuda::barrier<cuda::thread_scope_block> bar;

    if (threadIdx.x == 0) {
        init(&bar, 32);
    }
    __syncthreads();

    if (threadIdx.x == 0) {
        cuda::memcpy_async(val, ary,
            cuda::aligned_size_t<16>(16 * sizeof(double)),
            bar);
    }

    bar.arrive_and_wait();
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] `cuda::memcpy_async` hangs in some examples #6601

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] cuda::memcpy_async hangs in some examples #6601

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `cuda::memcpy_async` hangs in some examples #6601