`hipMallocFromPoolAsync` broken on ROCm 7.2

It seems AMDGPU.jl fails to allocate any device memory, as reported in #869 and lately while having a CI runner using ROCm 7.2 container.

### Description

On ROCm 7.2 (AMD Radeon RX 7800 XT / gfx1101), any attempt to allocate device memory via
AMDGPU.jl fails immediately. Even the simplest operation crashes:

```julia
julia> using AMDGPU
julia> AMDGPU.ROCArray([1, 2, 3])
# → error / silent NULL pointer
```

### Possible cause

AMDGPU.jl currently uses `hipMallocFromPoolAsync` exclusively for all `HIPBuffer` allocations.
On ROCm 7.2 this function appears to be broken (returns a NULL pointer or errors) for at least
some GPU/driver combinations, with no fallback path.

This should be confirmed by:

```julia
using AMDGPU, AMDGPU.HIP

dev          = AMDGPU.device()
stream       = AMDGPU.HIPStream(); # pretty printing fix in #900 
default_pool = HIP.default_memory_pool(dev)

ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMallocFromPoolAsync(ptr_ref, 24, default_pool, stream)
HIP.wait(stream)
println("ptr: $(ptr_ref[])")   # returns NULL on affected systems
```

While the synchronous fallback works correctly:

```julia
ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMalloc(ptr_ref, 24)
println("ptr: $(ptr_ref[])")   # non-NULL, works fine
```

### Environment

| | |
|---|---|
| ROCm version | 7.2 (`rocm/dev-ubuntu-24.04:7.2-complete`) |
| GPU | AMD Radeon RX 7800 XT (DID `0x747e`, gfx1101) |
| Julia | 1.12 |
| AMDGPU.jl | master |

### Possible fix

Branch [`lr/rocm7`](https://github.com/JuliaGPU/AMDGPU.jl/tree/lr/rocm7) adds a per-device broken flag with an automatic explicit fallback to `hipMalloc`:

- A global `POOL_ALLOC_BROKEN::LockedObject(Set{Int64})` tracks device IDs where pool
  allocation has failed.
- On first allocation, `hipMallocFromPoolAsync` is attempted. If it returns NULL (or exhausts
  all retry phases), the device is added to `POOL_ALLOC_BROKEN` and all subsequent allocations
  on that device use synchronous `hipMalloc`.
- `HIPBuffer` gains a `pool_alloc::Bool` field so the correct free function is used
  (`hipFreeAsync` for pool buffers, `hipFree` otherwise) — mixing them causes memory corruption.

The change is in
[`src/runtime/memory/hip.jl`](https://github.com/JuliaGPU/AMDGPU.jl/blob/lr/rocm7/src/runtime/memory/hip.jl).

I would be happy to get any insights and ideally others with access to AMD GPU to test. I will give it a try asap on MI300a, once I finalise the setup there.

_Disclaimer: I iterated on the issue with Claude code. The main purpose is debugging and ideally we can work towards a proper fix after input from others._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hipMallocFromPoolAsync` broken on ROCm 7.2 #901

Description

Possible cause

Environment

Possible fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


ROCm version	7.2 (`rocm/dev-ubuntu-24.04:7.2-complete`)
GPU	AMD Radeon RX 7800 XT (DID `0x747e`, gfx1101)
Julia	1.12
AMDGPU.jl	master

hipMallocFromPoolAsync broken on ROCm 7.2 #901

Description

Description

Possible cause

Environment

Possible fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`hipMallocFromPoolAsync` broken on ROCm 7.2 #901