JuliaParallel · abussy · Apr 30, 2026 · Apr 30, 2026
diff --git a/docs/src/usage.md b/docs/src/usage.md
@@ -115,6 +115,20 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
+### Safe MPI communication
+When using GPU-aware MPI (CUDA or ROCm), it is required to synchronize the (task-local) stream, before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results.
+
+With CUDA:
+```
+CUDA.synchronize()
+MPI.Isend(my_CuArray, mpi_comm, dest, tag)
+```
+And with AMDGPU:
+```
+AMDGPU.synchronize()
+MPI.Allreduce!(my_ROCArray, +, mpi_comm)
+```
+
 ### Multiple GPUs per node
 
 In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.