diff --git a/docs/src/usage.md b/docs/src/usage.md index ee14cc31e..3c87d1916 100644 --- a/docs/src/usage.md +++ b/docs/src/usage.md @@ -115,6 +115,20 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank). If using OpenMPI, the status of ROCm support can be checked via the [`MPI.has_rocm()`](@ref) function. +### Safe MPI communication +When using GPU-aware MPI (CUDA or ROCm), it is required to synchronize the (task-local) stream, before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results. + +With CUDA: +``` +CUDA.synchronize() +MPI.Isend(my_CuArray, mpi_comm, dest, tag) +``` +And with AMDGPU: +``` +AMDGPU.synchronize() +MPI.Allreduce!(my_ROCArray, +, mpi_comm) +``` + ### Multiple GPUs per node In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.