Move towards using MPL in the GPU version by samhatfield · Pull Request #335 · ecmwf-ifs/ectrans

samhatfield · 2025-11-14T15:15:20Z

This is a slightly less brute-force alternative to PR #334 which also lays the groundwork for eventually relying entirely on MPL in the GPU code path. Let me explain...

With this branch, if you disable GPU_AWARE_MPI, an MPI library is not required by ecTrans. No such library will be linked against and there will be no calls to MPI in any compiled object code. Whether MPI is called "under the hood" of MPL depends entirely on whether you compiled FIAT with or without MPI. In the latter case, the MPI serial fallback will be used. This means you can test on GPU platforms without an MPI installation by simply building FIAT without MPI and disabling GPU_AWARE_MPI.

For now, GPU_AWARE_MPI requires direct calls to MPI, hence only for that configuration do we need to link against MPI::MPI_Fortran explicitly. Eventually we should have support to pass GPU buffers to MPL, and when that happens we can finally delete all references to MPI from ecTrans and rely entirely on MPL, much as we already do for the CPU version.

wdeconinck · 2025-11-24T11:12:11Z

This is great step!

I would keep using HAVE_MPI. It is customary and shorter
Since you're on this now, I was thinking to immediately take advantage of MPL with MPI_F08 backend part of this PR.
I have created a fiat PR Export availability of MPL_F08 to downstream packages fiat#74, to be merged in a.s.a.p., that you can query to see if fiat was compiled with MPI_F08 API. You can already use the variable fiat_HAVE_MPL_F08 even if this PR is not merged as it will be evaluating to FALSE when not defined.
- If fiat has MPI_F08, then we can use MPL directly even for GPU-aware MPI, and we can already test this.
- If was not compiled with MPI_F08 (so previous releases or MPL_F77_DEPRECATED=ON), we need to keep using MPI_F08 explicitly for now.

So the logic needs to be a bit different for this to work.

samhatfield · 2025-11-24T12:20:16Z

I would keep using HAVE_MPI. It is customary and shorter

There are preexisting references to ectrans_HAVE_MPI, e.g. in transi and in ectrans-import.cmake.in. Is it the case that this variable is automatically set by ecbuild_add_option( FEATURE MPI ... )? If so, now that that option doesn't exist anymore, I would have to replace those instances with HAVE_MPI (and set( HAVE_MPI ${fiat_HAVE_MPI} )). Not a problem, but then I wonder if it's better simply to delete the line from ectrans-import.cmake.in, as this is not a feature of ecTrans anymore.

samhatfield · 2025-11-24T16:09:50Z

Following offline discussions with @wdeconinck, I've added support for the MPI_F08 feature (on by default) of FIAT. This further reduces the configurations where it's necessary to call MPI directly (what I call "raw" MPI). The only remaining configuration in fact is when ecTrans is being built against a FIAT version earlier than {next version to be released} (a new release with MPI_F08 compatibility hasn't been made yet).

If we in future made {next FIAT version to be released} as the minimum supported FIAT version, we could simply delete all raw MPI calls.

I will do some testing to make sure everything is working, before this can be merged.

samhatfield · 2025-11-24T16:46:54Z

Problems on LUMI... I wonder if we have to add an exception for CCE.

wdeconinck · 2025-11-25T01:31:51Z

Problems on LUMI... I wonder if we have to add an exception for CCE.

I think this is again this Cray issue biting us: #157 (comment)
The MPI_F08 API for Cray at least seems broken... An exception seems warranted, but also we should see if this was fixed in the mean time on LUMI with a CCE.

samhatfield · 2025-11-25T11:46:45Z

Unfortunately I think we will have to enable MPL_F77_DEPRECATED when building FIAT on LUMI. I get numerous MPI errors when testing even ecTrans 1.7.0, when FIAT:develop is used. I'll document and "fix" this in a separate PR.

samhatfield · 2025-11-26T16:40:52Z

Wow, what a nightmare. After a lot of tedious debugging, I noticed that I had removed the GPU_AWARE_MPI by accident. That's why the LUMI adjoint test failed (in fact you could argue other tests were failing but silently). When I put this back (correctly), the AC GPU tests started failing. The issue is "cannot find MPL_RECV/SEND", which may indicate an issue with passing GPU buffers to those subroutines. It looks like we may have to fall back on MPI_F77 for NVHPC.

samhatfield · 2025-11-26T16:47:30Z

During debugging I noticed some issues with TRMTOLAD and TRLTOMAD which at one point I thought were the culprits, but it turned out to be a red herring. Still, we should fix those, so I've opened another PR (#340) and rebased this branch against that one.

samhatfield · 2025-11-26T17:13:34Z

The plot thickens: TRGTOL builds fine on AC GPU with MPL_F08. TRLTOG does not, even though in both cases MPL_RECV and MPL_SEND are called in the same way with the same type of arguments.

samhatfield · 2026-02-10T14:50:33Z

I completely forgot about this PR. The non-GPU-aware MPI functionality is currently broken, so it would be good to get these changes in so it's fixed.

To remind you: with this PR, when GPU-aware MPI is disabled we fall back on MPL. This means that we don't need to search for an MPI library when GPU-aware MPI is disabled. That search is currently missing, which is why configuring currently fails when GPU-aware MPI is disabled.

Based on my experiments above, it seems that we can't yet rely on MPL for direct GPU-GPU communication, so I suggest that for now we continue to rely on raw MPI calls.

Happy to merge this @wdeconinck?

wdeconinck · 2026-02-11T09:46:29Z

OK for me; but can we verify it works on lumi-g?

samhatfield · 2026-02-11T09:49:52Z

OK for me; but can we verify it works on lumi-g?

I'll take a look.

samhatfield · 2026-02-11T17:08:19Z

LUMI seems to be a bit messed up at the moment. We use CCE 17 in the CI. Well, this isn't available anymore, only CCE 19, and I'm not even able to build FIAT with that version (internal compiler error).

This is set when we enabled GPU-aware communication and FIAT doesn't support MPI_F08 (either because it's disabled, or because we're using an older version of FIAT which doesn't have any MPI_F08 at all).

Co-authored-by: Willem Deconinck <willem.deconinck@ecmwf.int>

wdeconinck · 2026-02-25T09:18:23Z

Just adding a link to ecmwf-ifs/fiat#90 here. We should revisit the use of raw MPI when MPL supports device resident arrays.

samhatfield added enhancement New feature or request gpu labels Nov 14, 2025

samhatfield mentioned this pull request Nov 14, 2025

Make sure MPI is found when GPU and MPI enabled #334

Closed

samhatfield requested a review from wdeconinck November 24, 2025 09:15

samhatfield force-pushed the move_towards_gpu_mpl branch from 8f77c9d to d9dba97 Compare November 26, 2025 15:51

wdeconinck reviewed Nov 27, 2025

View reviewed changes

Comment thread src/trans/gpu/CMakeLists.txt Outdated

wdeconinck reviewed Nov 27, 2025

View reviewed changes

Comment thread src/trans/gpu/internal/trmtolad_mod.F90

samhatfield force-pushed the move_towards_gpu_mpl branch from b40c54a to 61593ab Compare February 10, 2026 14:18

samhatfield requested a review from wdeconinck February 10, 2026 14:47

wdeconinck approved these changes Feb 11, 2026

View reviewed changes

samhatfield added this to the 1.8.0 milestone Feb 17, 2026

samhatfield force-pushed the move_towards_gpu_mpl branch from 1cc80b1 to 727f8f2 Compare February 24, 2026 16:23

samhatfield added 5 commits February 24, 2026 17:49

Fix references to TRLTOMAD and TRMTOLAD

e0974ad

Add missing directives to TRMTOLAD

f94e63e

Make TRMTOLAD and TRLTOMAD more symmetric

4f4bb20

Fall back to MPL when GPU-aware MPI is disabled

72f00dc

Remove duplicate HAVE_MPI variable

9b2adfc

samhatfield and others added 9 commits February 24, 2026 17:49

Fix 1-indexing bug

00ad660

Remove redundant GPU-aware MPI definition

cd96e44

Fix reference to deprecated HAVE_MPI

0cbcdaf

Rename ectrans_HAVE_MPI to HAVE_MPI

cac89a7

Introduce USE_RAW_MPI definition

9e64734

This is set when we enabled GPU-aware communication and FIAT doesn't support MPI_F08 (either because it's disabled, or because we're using an older version of FIAT which doesn't have any MPI_F08 at all).

Add else statement to handle non raw MPI case

5b1cf97

Add back missing definition

7442848

Correct condition for linking against MPI::MPI_Fortran

4a764fe

Co-authored-by: Willem Deconinck <willem.deconinck@ecmwf.int>

Forget about fiat_MPL_F08 for now

499a4c1

samhatfield force-pushed the move_towards_gpu_mpl branch from 727f8f2 to 499a4c1 Compare February 24, 2026 17:49

samhatfield merged commit 7dbe7a5 into develop Feb 25, 2026
47 of 49 checks passed

samhatfield deleted the move_towards_gpu_mpl branch February 25, 2026 10:28

Conversation

samhatfield commented Nov 14, 2025

Uh oh!

wdeconinck commented Nov 24, 2025

Uh oh!

samhatfield commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samhatfield commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samhatfield commented Nov 24, 2025

Uh oh!

wdeconinck commented Nov 25, 2025

Uh oh!

samhatfield commented Nov 25, 2025

Uh oh!

samhatfield commented Nov 26, 2025

Uh oh!

samhatfield commented Nov 26, 2025

Uh oh!

samhatfield commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

samhatfield commented Feb 10, 2026

Uh oh!

wdeconinck commented Feb 11, 2026

Uh oh!

samhatfield commented Feb 11, 2026

Uh oh!

samhatfield commented Feb 11, 2026

Uh oh!

wdeconinck commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samhatfield commented Nov 24, 2025 •

edited

Loading

samhatfield commented Nov 24, 2025 •

edited

Loading