Skip to content

[Issue]: TopoSystem::MatchGpuAndNic segfaults on RDNA 4 (gfx1201) + Pollara ionic NICs #374

@jlochhead

Description

@jlochhead

Problem Description

I have a small GPU lab with two machines, each has 2 x R9700 PCIe GPUs and 2 x Pollara NICs. I wanted to run disaggregated inference with MoRI instead of RIXL, but I encountered this issue.

I didn't submit a PR because I'm not sure if you want PCIe GPUs supported.

Hardware:

  • GPU: AMD Radeon AI PRO R9700 (gfx1201, RDNA 4)
  • NIC: AMD Pollara AI NIC (ionic_0, ionic_1)
  • ROCm: 7.2.0
  • MoRI: bundled in vllm/vllm-openai-rocm:nightly (vLLM 0.22.1rc1)

Symptom:MoRIIOConnector segfaults on first KV transfer request with the following stack trace:

mori::application::CollectAndSortCandidates mori::application::TopoSystem::MatchAllGpusAndNics mori::application::TopoSystem::MatchGpuAndNic mori::io::RdmaManager::Search mori::io::ControlPlaneServer::BuildRdmaConn mori::io::RdmaBackend::CreateSession

Root cause:
In src/application/topology/pci.cpp, CreateTopoNodePciFrom() only recognizes PCI class 0x1200 (Processing Accelerator) as a GPU:

cpp

} else if (cls == 0x1200) {
    return TopoNodePci::CreateGpu(bus, numa);
}

RDNA 4 consumer/workstation GPUs (R9700) enumerate as PCI class 0x0300 (VGA compatible controller).

Proposed fix:

} else if (cls == 0x1200 || cls == 0x0300 || cls == 0x0302) {
    return TopoNodePci::CreateGpu(bus, numa);
}

Where 0x0302 covers 3D controllers used by some workstation GPUs.

Verification:
Applied this fix locally, rebuilt MoRI, and confirmed P/D disaggregation works end-to-end on R9700 + Pollara NICs with no segfault.

Operating System

Ubuntu 24.04

CPU

AMD EPYC 9535 64-Core Processor

GPU

AMD R9700

ROCm Version

RoCM 7.2

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo --support output
Paste output here

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions