Add support for TransformerEngine flash attention in WAN #299

cpersson-amd · 2025-12-16T18:54:02Z

This PR implements the following:

TransformerEngine flash attention for WAN training and inference.
A new fsdp sharding parallelism optimized for use on GPUs.
Some minor changes to allow for training on flax version 0.11.2.

The code has been tested on WAN 2.1 (training and inference) and flux (only training) using GPUs.

google-cla · 2025-12-16T18:54:07Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

entrpn · 2025-12-30T17:43:09Z

@cpersson-amd I've been out on PTO for a month. I'll take a closer look at this next week. Meanwhile, can you update your branch with the latest in main. Thanks.

entrpn · 2026-01-05T16:29:13Z

src/maxdiffusion/configs/base14.yml


 # Parallelism
-mesh_axes: ['data', 'fsdp', 'tensor']
+mesh_axes: ['data', 'fsdp_tpu', 'tensor']


why rename this to fsdp_tpu?

Some of the latest changes in the pyconfig.py file hardcoded the "fsdp" name into the code which necessitated a change of name in all configs. fsdp_tpu was chosen for clarity, let me know if you prefer this be reverted

I would prefer this to be reverted. Can you explain why these changes with the hardcoded fsdp do not apply to gpu?

Alright, I have reverted the change of name. The additional sharding rules in the pyconfig file are not used by the cudnn_te_flash attention function and does not need to be added. The change was initially made to prevent the mismatch between the "fsdp_tpu" and hardcoded "fsdp" names, which would add duplicate rules for e.g. "activation_length". This change can also be reverted without issue.

src/maxdiffusion/models/wan/transformers/transformer_wan.py

src/maxdiffusion/models/attention_flax.py

entrpn

In general the PR looks good, but I'm still unsure if adding another axes, fsdp_batch, is really necessary. I would prefer not to add it. The other major thing is switching the mesh_axes from data, fsdp, tensor to data, tensor, fsdp.

entrpn · 2026-01-08T22:42:00Z

src/maxdiffusion/configs/base_wan_14b.yml


 # Parallelism
-mesh_axes: ['data', 'fsdp', 'tensor']
+mesh_axes: ['data', 'tensor', 'fsdp', 'fsdp_batch']


I'm worried changing the axis order will introduce a big change to our performance in general. Is there a reason for changing the order of fsdp and tensor?

Also what does fsdp_batch do? Is it really necessary to introduce a new axis?

The only reason for changing the order was to have the, somewhat similar, fsdp and fsdp_batch axes side by side. I have now changed it so that the order is [data, fsdp_batch, fsdp, tensor] so as to align with how it is done in maxtext. I did not realize this could effect performance, my bad.

The current of implementation of "fsdp" is more similar to that of sequence/context parallelism. The new implementation "fsdp_batch" is a more classic fsdp implementation, where the input is sharded across the batch dimension instead of the sequence dimension. The "fsdp_batch" parallelism is substantially faster on GPUs (~10.5% faster when training on PusaV1) compared to the current "fsdp" parallelism. I therefore think it is important to include in this PR.

For the WAN2.1 model I would suggest renaming the parallelisms to more closely reflect how the actual sharding is done, for example:
fsdp -> context (alternatively sequence)
fsdp_batch -> fsdp

What do you think?

Yes I agree they should be renamed, let me test this branch next week on TPU to make sure it doesn't affect the current performance and then we can do the name change.

src/maxdiffusion/models/wan/autoencoder_kl_wan.py

src/maxdiffusion/models/attention_flax.py

…lection

cpersson-amd added 3 commits December 16, 2025 15:22

add flash attn te support for wan

84e25e3

add gpu optimized sharding parallelism

2a03765

sharding bugfixes

5e21db3

cpersson-amd marked this pull request as draft December 17, 2025 00:18

cpersson-amd marked this pull request as ready for review December 17, 2025 10:21

cpersson-amd closed this Dec 17, 2025

cpersson-amd reopened this Dec 17, 2025

generalize across sharding parallelisms

a7345e2

cpersson-amd force-pushed the main branch from 9ca3d79 to a7345e2 Compare December 17, 2025 10:39

Merge with main

530916f

entrpn reviewed Jan 5, 2026

View reviewed changes

cpersson-amd added 3 commits January 8, 2026 11:06

fix issue with inference using fsdp + te flash attention

efc4800

revert fsdp_tpu name change

3da1dfd

update readme with wan2.1 gpu notes

6a82bd4

entrpn reviewed Jan 8, 2026

View reviewed changes

cpersson-amd added 2 commits January 9, 2026 11:20

re-order parallelism axes and revert dynamic context parallel axes se…

f5c41d9

…lection

remove unused max_utils imports

07b4d29

Add support for TransformerEngine flash attention in WAN #299

Are you sure you want to change the base?

Add support for TransformerEngine flash attention in WAN #299

Uh oh!

Conversation

cpersson-amd commented Dec 16, 2025

Uh oh!

google-cla bot commented Dec 16, 2025

Uh oh!

entrpn commented Dec 30, 2025

Uh oh!

entrpn Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

cpersson-amd Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

entrpn Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

cpersson-amd Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

entrpn left a comment

Choose a reason for hiding this comment

Uh oh!

entrpn Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

cpersson-amd Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

entrpn Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants