Skip to content

Move towards using MPL in the GPU version#335

Merged
samhatfield merged 14 commits into
developfrom
move_towards_gpu_mpl
Feb 25, 2026
Merged

Move towards using MPL in the GPU version#335
samhatfield merged 14 commits into
developfrom
move_towards_gpu_mpl

Conversation

@samhatfield
Copy link
Copy Markdown
Collaborator

This is a slightly less brute-force alternative to PR #334 which also lays the groundwork for eventually relying entirely on MPL in the GPU code path. Let me explain...

With this branch, if you disable GPU_AWARE_MPI, an MPI library is not required by ecTrans. No such library will be linked against and there will be no calls to MPI in any compiled object code. Whether MPI is called "under the hood" of MPL depends entirely on whether you compiled FIAT with or without MPI. In the latter case, the MPI serial fallback will be used. This means you can test on GPU platforms without an MPI installation by simply building FIAT without MPI and disabling GPU_AWARE_MPI.

For now, GPU_AWARE_MPI requires direct calls to MPI, hence only for that configuration do we need to link against MPI::MPI_Fortran explicitly. Eventually we should have support to pass GPU buffers to MPL, and when that happens we can finally delete all references to MPI from ecTrans and rely entirely on MPL, much as we already do for the CPU version.

@wdeconinck
Copy link
Copy Markdown
Collaborator

This is great step!

  1. I would keep using HAVE_MPI. It is customary and shorter

  2. Since you're on this now, I was thinking to immediately take advantage of MPL with MPI_F08 backend part of this PR.
    I have created a fiat PR Export availability of MPL_F08 to downstream packages fiat#74, to be merged in a.s.a.p., that you can query to see if fiat was compiled with MPI_F08 API. You can already use the variable fiat_HAVE_MPL_F08 even if this PR is not merged as it will be evaluating to FALSE when not defined.

    • If fiat has MPI_F08, then we can use MPL directly even for GPU-aware MPI, and we can already test this.
    • If was not compiled with MPI_F08 (so previous releases or MPL_F77_DEPRECATED=ON), we need to keep using MPI_F08 explicitly for now.

So the logic needs to be a bit different for this to work.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

samhatfield commented Nov 24, 2025

  1. I would keep using HAVE_MPI. It is customary and shorter

There are preexisting references to ectrans_HAVE_MPI, e.g. in transi and in ectrans-import.cmake.in. Is it the case that this variable is automatically set by ecbuild_add_option( FEATURE MPI ... )? If so, now that that option doesn't exist anymore, I would have to replace those instances with HAVE_MPI (and set( HAVE_MPI ${fiat_HAVE_MPI} )). Not a problem, but then I wonder if it's better simply to delete the line from ectrans-import.cmake.in, as this is not a feature of ecTrans anymore.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

samhatfield commented Nov 24, 2025

Following offline discussions with @wdeconinck, I've added support for the MPI_F08 feature (on by default) of FIAT. This further reduces the configurations where it's necessary to call MPI directly (what I call "raw" MPI). The only remaining configuration in fact is when ecTrans is being built against a FIAT version earlier than {next version to be released} (a new release with MPI_F08 compatibility hasn't been made yet).

If we in future made {next FIAT version to be released} as the minimum supported FIAT version, we could simply delete all raw MPI calls.

I will do some testing to make sure everything is working, before this can be merged.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

Problems on LUMI... I wonder if we have to add an exception for CCE.

@wdeconinck
Copy link
Copy Markdown
Collaborator

Problems on LUMI... I wonder if we have to add an exception for CCE.

I think this is again this Cray issue biting us: #157 (comment)
The MPI_F08 API for Cray at least seems broken... An exception seems warranted, but also we should see if this was fixed in the mean time on LUMI with a CCE.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

Unfortunately I think we will have to enable MPL_F77_DEPRECATED when building FIAT on LUMI. I get numerous MPI errors when testing even ecTrans 1.7.0, when FIAT:develop is used. I'll document and "fix" this in a separate PR.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

Wow, what a nightmare. After a lot of tedious debugging, I noticed that I had removed the GPU_AWARE_MPI by accident. That's why the LUMI adjoint test failed (in fact you could argue other tests were failing but silently). When I put this back (correctly), the AC GPU tests started failing. The issue is "cannot find MPL_RECV/SEND", which may indicate an issue with passing GPU buffers to those subroutines. It looks like we may have to fall back on MPI_F77 for NVHPC.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

During debugging I noticed some issues with TRMTOLAD and TRLTOMAD which at one point I thought were the culprits, but it turned out to be a red herring. Still, we should fix those, so I've opened another PR (#340) and rebased this branch against that one.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

The plot thickens: TRGTOL builds fine on AC GPU with MPL_F08. TRLTOG does not, even though in both cases MPL_RECV and MPL_SEND are called in the same way with the same type of arguments.

Comment thread src/trans/gpu/CMakeLists.txt Outdated
Comment thread src/trans/gpu/internal/trmtolad_mod.F90
@samhatfield
Copy link
Copy Markdown
Collaborator Author

I completely forgot about this PR. The non-GPU-aware MPI functionality is currently broken, so it would be good to get these changes in so it's fixed.

To remind you: with this PR, when GPU-aware MPI is disabled we fall back on MPL. This means that we don't need to search for an MPI library when GPU-aware MPI is disabled. That search is currently missing, which is why configuring currently fails when GPU-aware MPI is disabled.

Based on my experiments above, it seems that we can't yet rely on MPL for direct GPU-GPU communication, so I suggest that for now we continue to rely on raw MPI calls.

Happy to merge this @wdeconinck?

@wdeconinck
Copy link
Copy Markdown
Collaborator

OK for me; but can we verify it works on lumi-g?

@samhatfield
Copy link
Copy Markdown
Collaborator Author

OK for me; but can we verify it works on lumi-g?

I'll take a look.

@samhatfield
Copy link
Copy Markdown
Collaborator Author

LUMI seems to be a bit messed up at the moment. We use CCE 17 in the CI. Well, this isn't available anymore, only CCE 19, and I'm not even able to build FIAT with that version (internal compiler error).

@samhatfield samhatfield added this to the 1.8.0 milestone Feb 17, 2026
samhatfield and others added 9 commits February 24, 2026 17:49
This is set when we enabled GPU-aware communication and FIAT doesn't
support MPI_F08 (either because it's disabled, or because we're using an
older version of FIAT which doesn't have any MPI_F08 at all).
Co-authored-by: Willem Deconinck <willem.deconinck@ecmwf.int>
@wdeconinck
Copy link
Copy Markdown
Collaborator

Just adding a link to ecmwf-ifs/fiat#90 here. We should revisit the use of raw MPI when MPL supports device resident arrays.

@samhatfield samhatfield merged commit 7dbe7a5 into develop Feb 25, 2026
47 of 49 checks passed
@samhatfield samhatfield deleted the move_towards_gpu_mpl branch February 25, 2026 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request gpu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants