Skip to content

Add ROCm/HIP compatibility to CuMesh, enabling all 3 extensions#30

Open
ZJLi2013 wants to merge 1 commit into
JeffreyXiang:mainfrom
PhysicalAI-AIM:rocm
Open

Add ROCm/HIP compatibility to CuMesh, enabling all 3 extensions#30
ZJLi2013 wants to merge 1 commit into
JeffreyXiang:mainfrom
PhysicalAI-AIM:rocm

Conversation

@ZJLi2013
Copy link
Copy Markdown

@ZJLi2013 ZJLi2013 commented Apr 8, 2026

Add ROCm/HIP compatibility to CuMesh, enabling all 3 extensions (cumesh._C, cumesh._cubvh, cumesh._cumesh_xatlas) to compile and run on AMD GPUs.

Changes (4 files)

  • src/atlas.cu: Replace cuda::std::plus() with portable cub::Sum() — the CCCL cuda::std header is not available on HIP
  • src/clean_up.cu: Use rocprim::tuple for int3_decomposer on HIP via #ifdefcuda::std::tuple and thrust::tuple are both unavailable/broken on ROCm 6.4
  • src/dtypes.cuh: Add __host__ qualifier to all Vec3f and QEM methods — hipCUB's DeviceSegmentedReduce template instantiation requires host-callable constructors
  • setup.py: Guard cubvh-specific nvcc flags (--extended-lambda, -U__CUDA_NO_HALF_*) behind IS_HIP check — these flags are NVIDIA-specific and cause errors with hipcc

What works

  • All CUB device algorithms (RadixSort, Scan, Select, Reduce, ReduceByKey, SegmentedReduce) are automatically converted to hipCUB by PyTorch's hipify
  • All CUDA runtime APIs (cudaMalloc, cudaMemcpy, etc.) are automatically converted to HIP equivalents
  • cubvh submodule (half precision + Eigen) compiles without additional changes
  • xatlas (CPU-only) is unaffected

What is NOT covered

  • No functional changes to CUDA code paths — all changes are additive #ifdef guards or portable replacements
  • No new dependencies introduced

Usage on AMD GPUs

git clone --recursive https://github.com/JeffreyXiang/CuMesh.git
cd CuMesh
GPU_ARCHS=gfx942 pip install . --no-build-isolation

Test Environment

GPU: AMD Instinct MI300X (gfx942)
ROCm: 6.4.3
PyTorch: 2.6.0
Docker: rocm/pytorch:rocm6.4.3_ubuntu24.04_py3.12_pytorch_release_2.6.0

Test Status
Compile all 3 extensions (_C, _cubvh, _xatlas)
Import all modules
Mesh init + read (vertex/face roundtrip)
Compute face normals
Compute vertex normals
Mesh simplification (4→2 faces)
Remove duplicate faces

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds ROCm/HIP compatibility to CuMesh so that all three compiled extensions can build and run on AMD GPUs, primarily by replacing CUDA-only types/APIs with portable alternatives and gating NVIDIA-specific build flags.

Changes:

  • Use cub::Sum() in atlas.cu to avoid relying on CCCL cuda::std headers on HIP.
  • Introduce a HIP-specific int3_decomposer implementation using rocprim::tuple for radix sort key decomposition.
  • Make Vec3f/QEM methods host-callable to satisfy hipCUB template instantiation requirements; gate NVIDIA-only nvcc flags for cubvh in setup.py.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/dtypes.cuh Adds __host__ qualifiers to vector/QEM methods to support hipCUB instantiation constraints.
src/clean_up.cu Adds HIP-specific tuple-based decomposer for int3 key decomposition during radix sort.
src/atlas.cu Uses cub::Sum() unconditionally for reduce-by-key operator portability.
setup.py Skips NVIDIA-only nvcc flags when building under HIP/ROCm.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/clean_up.cu
Comment on lines +230 to +236
#if defined(__HIP_PLATFORM_AMD__)
#include <rocprim/types/tuple.hpp>
struct int3_decomposer
{
__host__ __device__ ::rocprim::tuple<int&, int&, int&> operator()(int3& key) const
{
return ::rocprim::tuple<int&, int&, int&>{key.x, key.y, key.z};
Comment thread src/dtypes.cuh
Comment on lines +144 to 152
__host__ __device__ __forceinline__ Vec3f Vec3f::normalized() const {
float inv_norm = rsqrtf(x * x + y * y + z * z);
return Vec3f(x * inv_norm, y * inv_norm, z * inv_norm);
}


__device__ __forceinline__ void Vec3f::normalize() {
__host__ __device__ __forceinline__ void Vec3f::normalize() {
float inv_norm = rsqrtf(x * x + y * y + z * z);
x *= inv_norm;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants