Problem Description
I have a small GPU lab with two machines, each has 2 x R9700 PCIe GPUs and 2 x Pollara NICs. I wanted to run disaggregated inference with MoRI instead of RIXL, but I encountered this issue.
I didn't submit a PR because I'm not sure if you want PCIe GPUs supported.
Hardware:
- GPU: AMD Radeon AI PRO R9700 (gfx1201, RDNA 4)
- NIC: AMD Pollara AI NIC (ionic_0, ionic_1)
- ROCm: 7.2.0
- MoRI: bundled in
vllm/vllm-openai-rocm:nightly (vLLM 0.22.1rc1)
Symptom:MoRIIOConnector segfaults on first KV transfer request with the following stack trace:
mori::application::CollectAndSortCandidates mori::application::TopoSystem::MatchAllGpusAndNics mori::application::TopoSystem::MatchGpuAndNic mori::io::RdmaManager::Search mori::io::ControlPlaneServer::BuildRdmaConn mori::io::RdmaBackend::CreateSession
Root cause:
In src/application/topology/pci.cpp, CreateTopoNodePciFrom() only recognizes PCI class 0x1200 (Processing Accelerator) as a GPU:
cpp
} else if (cls == 0x1200) {
return TopoNodePci::CreateGpu(bus, numa);
}
RDNA 4 consumer/workstation GPUs (R9700) enumerate as PCI class 0x0300 (VGA compatible controller).
Proposed fix:
} else if (cls == 0x1200 || cls == 0x0300 || cls == 0x0302) {
return TopoNodePci::CreateGpu(bus, numa);
}
Where 0x0302 covers 3D controllers used by some workstation GPUs.
Verification:
Applied this fix locally, rebuilt MoRI, and confirmed P/D disaggregation works end-to-end on R9700 + Pollara NICs with no segfault.
Operating System
Ubuntu 24.04
CPU
AMD EPYC 9535 64-Core Processor
GPU
AMD R9700
ROCm Version
RoCM 7.2
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
rocminfo --support output
Additional Information
No response
Problem Description
I have a small GPU lab with two machines, each has 2 x R9700 PCIe GPUs and 2 x Pollara NICs. I wanted to run disaggregated inference with MoRI instead of RIXL, but I encountered this issue.
I didn't submit a PR because I'm not sure if you want PCIe GPUs supported.
Hardware:
vllm/vllm-openai-rocm:nightly(vLLM 0.22.1rc1)Symptom:
MoRIIOConnectorsegfaults on first KV transfer request with the following stack trace:mori::application::CollectAndSortCandidates mori::application::TopoSystem::MatchAllGpusAndNics mori::application::TopoSystem::MatchGpuAndNic mori::io::RdmaManager::Search mori::io::ControlPlaneServer::BuildRdmaConn mori::io::RdmaBackend::CreateSessionRoot cause:
In
src/application/topology/pci.cpp,CreateTopoNodePciFrom()only recognizes PCI class0x1200(Processing Accelerator) as a GPU:cpp
RDNA 4 consumer/workstation GPUs (R9700) enumerate as PCI class
0x0300(VGA compatible controller).Proposed fix:
Where
0x0302covers 3D controllers used by some workstation GPUs.Verification:
Applied this fix locally, rebuilt MoRI, and confirmed P/D disaggregation works end-to-end on R9700 + Pollara NICs with no segfault.
Operating System
Ubuntu 24.04
CPU
AMD EPYC 9535 64-Core Processor
GPU
AMD R9700
ROCm Version
RoCM 7.2
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
rocminfo --support output
Additional Information
No response