Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions docs/src/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,20 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
If using OpenMPI, the status of ROCm support can be checked via the
[`MPI.has_rocm()`](@ref) function.

### Safe MPI communication
When using GPU-aware MPI (CUDA or ROCm), it is required to synchronize the (task-local) stream, before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results.

With CUDA:
```
CUDA.synchronize()
MPI.Isend(my_CuArray, mpi_comm, dest, tag)
```
And with AMDGPU:
```
AMDGPU.synchronize()
MPI.Allreduce!(my_ROCArray, +, mpi_comm)
```

### Multiple GPUs per node

In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
Expand Down
Loading