Skip to content

hipMallocFromPoolAsync broken on ROCm 7.2 #901

@luraess

Description

@luraess

It seems AMDGPU.jl fails to allocate any device memory, as reported in #869 and lately while having a CI runner using ROCm 7.2 container.

Description

On ROCm 7.2 (AMD Radeon RX 7800 XT / gfx1101), any attempt to allocate device memory via
AMDGPU.jl fails immediately. Even the simplest operation crashes:

julia> using AMDGPU
julia> AMDGPU.ROCArray([1, 2, 3])
# → error / silent NULL pointer

Possible cause

AMDGPU.jl currently uses hipMallocFromPoolAsync exclusively for all HIPBuffer allocations.
On ROCm 7.2 this function appears to be broken (returns a NULL pointer or errors) for at least
some GPU/driver combinations, with no fallback path.

This should be confirmed by:

using AMDGPU, AMDGPU.HIP

dev          = AMDGPU.device()
stream       = AMDGPU.HIPStream(); # pretty printing fix in #900 
default_pool = HIP.default_memory_pool(dev)

ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMallocFromPoolAsync(ptr_ref, 24, default_pool, stream)
HIP.wait(stream)
println("ptr: $(ptr_ref[])")   # returns NULL on affected systems

While the synchronous fallback works correctly:

ptr_ref = Ref{Ptr{Cvoid}}()
HIP.hipMalloc(ptr_ref, 24)
println("ptr: $(ptr_ref[])")   # non-NULL, works fine

Environment

ROCm version 7.2 (rocm/dev-ubuntu-24.04:7.2-complete)
GPU AMD Radeon RX 7800 XT (DID 0x747e, gfx1101)
Julia 1.12
AMDGPU.jl master

Possible fix

Branch lr/rocm7 adds a per-device broken flag with an automatic explicit fallback to hipMalloc:

  • A global POOL_ALLOC_BROKEN::LockedObject(Set{Int64}) tracks device IDs where pool
    allocation has failed.
  • On first allocation, hipMallocFromPoolAsync is attempted. If it returns NULL (or exhausts
    all retry phases), the device is added to POOL_ALLOC_BROKEN and all subsequent allocations
    on that device use synchronous hipMalloc.
  • HIPBuffer gains a pool_alloc::Bool field so the correct free function is used
    (hipFreeAsync for pool buffers, hipFree otherwise) — mixing them causes memory corruption.

The change is in
src/runtime/memory/hip.jl.

I would be happy to get any insights and ideally others with access to AMD GPU to test. I will give it a try asap on MI300a, once I finalise the setup there.

Disclaimer: I iterated on the issue with Claude code. The main purpose is debugging and ideally we can work towards a proper fix after input from others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions