Optimize beam search & add flash attention+xformers support #105

xsank · 2025-11-13T06:49:19Z

SDPA erformance improvement is approximately 50%, flash attention nearly 100%, depends on the data and the batch size.
The greater the difference in audio length, the better the optimization effect. If you use batch size=1, no effect.
@kaituoxu

Xujianzhong · 2025-11-14T02:42:06Z

@xsank The test did not show any performance improvement.

lxp3 · 2025-11-14T02:52:28Z

fireredasr/models/module/transformer_decoder.py

+            is_finished_n = is_finished.sum().item()
+            active_mask = ~is_finished.squeeze()  
+            #active_indices = self.filter_indexes[M][active_mask]
+            active_indices = torch.nonzero_static(active_mask, size=M - int(is_finished_n)).squeeze(1)


It's wired to get a error here. The environment is torch==2.4.0+cu121, A100, flash_attn-2.8.3-cp310-cp310-linux_x86_64.whl.

Traceback (most recent call last):
File "/data/user/lxp/tools/python/speech/batch_fireredaed.py", line 184, in
main(args)
File "/data/user/lxp/tools/python/speech/batch_fireredaed.py", line 144, in main
texts = model.transcribe_aed(
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/data/user/lxp/asr/FireRedASR/fireredasr/models/fireredasr.py", line 82, in transcribe_aed
hyps = self.model.transcribe(
File "/data/user/lxp/asr/FireRedASR/fireredasr/models/fireredasr_aed.py", line 33, in transcribe
nbest_hyps = self.decoder.batch_beam_search(
File "/data/user/lxp/asr/FireRedASR/fireredasr/models/module/transformer_decoder.py", line 216, in batch_beam_search
active_indices = torch.nonzero_static(active_mask, size=M - int(is_finished_n)).squeeze(1)
NotImplementedError: Could not run 'aten::nonzero_static' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted dur
ing the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolution
s. 'aten::nonzero_static' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor
, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMTIA, Auto
gradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradMeta, AutogradNestedTensor, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, F
uncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

I tested it on torch 2.7.1+python 3.12

Sonnet 4.5 replaces the ops of torch.nonzero_static with torch.nonzero which solved the problem.

# Update finished state is_finished = t_ys.eq(self.eos_id) is_finished_n = is_finished.sum().item() active_mask = ~is_finished.squeeze() active_indices = torch.nonzero(active_mask, as_tuple=False).squeeze(1)

Thanks for the support of flash attention, it's really fast!

Sonnet 4.5 replaces the ops of torch.nonzero_static with torch.nonzero which solved the problem.

# Update finished state is_finished = t_ys.eq(self.eos_id) is_finished_n = is_finished.sum().item() active_mask = ~is_finished.squeeze() active_indices = torch.nonzero(active_mask, as_tuple=False).squeeze(1)

Thanks for the support of flash attention, it's really fast!

There is a small problem, torch.nonzero have one more transfer from gpu to cpu than the torch.nonzero_static，which would cost about 5% performance loss.

xsank · 2025-11-14T03:39:43Z

@xsank The test did not show any performance improvement.

@Xujianzhong which test? let me see see.

kaituoxu · 2025-11-24T05:17:03Z

Thanks for your PR, we will review.

xsank added 7 commits November 7, 2025 18:07

opt beam search & add fa ing...

4a475bb

fix flash attention

c052a4f

optimize flash attention

7eaadfc

for flash attention, casual mask is useless

e0e24b9

add missing commit

277303f

add attention meta

c2ae62b

only fa init meta

8b8b6a5

This was referenced Nov 13, 2025

用torch原生的flash attention性能更好 #80

Open

推理优化 #46

Open

xsank added 2 commits November 14, 2025 09:57

add xformers support

d9aee69

xformers no need to use seq lens now

66f9512

lxp3 reviewed Nov 14, 2025

View reviewed changes

xsank changed the title ~~Optimize beam search & add flash attention support~~ Optimize beam search & add flash attention+xformers support Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize beam search & add flash attention+xformers support #105

Optimize beam search & add flash attention+xformers support #105

Uh oh!

xsank commented Nov 13, 2025 •

edited

Loading

Uh oh!

Xujianzhong commented Nov 14, 2025

Uh oh!

lxp3 Nov 14, 2025

Uh oh!

xsank Nov 14, 2025

Uh oh!

lxp3 Nov 17, 2025

Uh oh!

xsank Nov 17, 2025

Uh oh!

xsank commented Nov 14, 2025 •

edited

Loading

Uh oh!

kaituoxu commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize beam search & add flash attention+xformers support #105

Are you sure you want to change the base?

Optimize beam search & add flash attention+xformers support #105

Uh oh!

Conversation

xsank commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xujianzhong commented Nov 14, 2025

Uh oh!

lxp3 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

xsank Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

lxp3 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

xsank Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

xsank commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaituoxu commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xsank commented Nov 13, 2025 •

edited

Loading

xsank commented Nov 14, 2025 •

edited

Loading