It seems AMDGPU.jl fails to allocate any device memory, as reported in #869 and lately while having a CI runner using ROCm 7.2 container.
Description
On ROCm 7.2 (AMD Radeon RX 7800 XT / gfx1101), any attempt to allocate device memory via
AMDGPU.jl fails immediately. Even the simplest operation crashes:
julia> using AMDGPU
julia> AMDGPU.ROCArray([1, 2, 3])
# → error / silent NULL pointer
Possible cause
AMDGPU.jl currently uses hipMallocFromPoolAsync exclusively for all HIPBuffer allocations.
On ROCm 7.2 this function appears to be broken (returns a NULL pointer or errors) for at least
some GPU/driver combinations, with no fallback path.
This should be confirmed by:
using AMDGPU, AMDGPU.HIP
dev = AMDGPU.device()
stream = AMDGPU.HIPStream(); # pretty printing fix in #900
default_pool = HIP.default_memory_pool(dev)
ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMallocFromPoolAsync(ptr_ref, 24, default_pool, stream)
HIP.wait(stream)
println("ptr: $(ptr_ref[])") # returns NULL on affected systems
While the synchronous fallback works correctly:
ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMalloc(ptr_ref, 24)
println("ptr: $(ptr_ref[])") # non-NULL, works fine
Environment
|
|
| ROCm version |
7.2 (rocm/dev-ubuntu-24.04:7.2-complete) |
| GPU |
AMD Radeon RX 7800 XT (DID 0x747e, gfx1101) |
| Julia |
1.12 |
| AMDGPU.jl |
master |
Possible fix
Branch lr/rocm7 adds a per-device broken flag with an automatic explicit fallback to hipMalloc:
- A global
POOL_ALLOC_BROKEN::LockedObject(Set{Int64}) tracks device IDs where pool
allocation has failed.
- On first allocation,
hipMallocFromPoolAsync is attempted. If it returns NULL (or exhausts
all retry phases), the device is added to POOL_ALLOC_BROKEN and all subsequent allocations
on that device use synchronous hipMalloc.
HIPBuffer gains a pool_alloc::Bool field so the correct free function is used
(hipFreeAsync for pool buffers, hipFree otherwise) — mixing them causes memory corruption.
The change is in
src/runtime/memory/hip.jl.
I would be happy to get any insights and ideally others with access to AMD GPU to test. I will give it a try asap on MI300a, once I finalise the setup there.
Disclaimer: I iterated on the issue with Claude code. The main purpose is debugging and ideally we can work towards a proper fix after input from others.
It seems AMDGPU.jl fails to allocate any device memory, as reported in #869 and lately while having a CI runner using ROCm 7.2 container.
Description
On ROCm 7.2 (AMD Radeon RX 7800 XT / gfx1101), any attempt to allocate device memory via
AMDGPU.jl fails immediately. Even the simplest operation crashes:
Possible cause
AMDGPU.jl currently uses
hipMallocFromPoolAsyncexclusively for allHIPBufferallocations.On ROCm 7.2 this function appears to be broken (returns a NULL pointer or errors) for at least
some GPU/driver combinations, with no fallback path.
This should be confirmed by:
While the synchronous fallback works correctly:
Environment
rocm/dev-ubuntu-24.04:7.2-complete)0x747e, gfx1101)Possible fix
Branch
lr/rocm7adds a per-device broken flag with an automatic explicit fallback tohipMalloc:POOL_ALLOC_BROKEN::LockedObject(Set{Int64})tracks device IDs where poolallocation has failed.
hipMallocFromPoolAsyncis attempted. If it returns NULL (or exhaustsall retry phases), the device is added to
POOL_ALLOC_BROKENand all subsequent allocationson that device use synchronous
hipMalloc.HIPBuffergains apool_alloc::Boolfield so the correct free function is used(
hipFreeAsyncfor pool buffers,hipFreeotherwise) — mixing them causes memory corruption.The change is in
src/runtime/memory/hip.jl.I would be happy to get any insights and ideally others with access to AMD GPU to test. I will give it a try asap on MI300a, once I finalise the setup there.
Disclaimer: I iterated on the issue with Claude code. The main purpose is debugging and ideally we can work towards a proper fix after input from others.