Skip to content

Conversation

@sanchitintel
Copy link

@sanchitintel sanchitintel commented Nov 21, 2025

Summary

Adds MoE GEMM implementation for MXFP4/MXFP8 (FP4/FP8 weights & E8M0 scales, with group-wise quantization) with CuTe interface.
If users don't select copy atoms for loading activations, weights & storing output, then they would be chosen automatically. Users can pass void as corresponding copy atom template parameters, but the copy atoms chosen automatically may not not always attain the best performance, so users can specify custom copy atoms.

Support for int4 weights with BF16/FP16 scales has also been added.

Weights are in plain format, and have not been prepacked.

Details

BMG doesn't support MXFP4/MXFP8/int4 natively, so it's converted to either FP16 or BF16, depending upon the activation.

Currently, it assumes WG_K & SG_K are both equal to 32.

Performance

Largely depends upon scaledMM performance in #633

cc @CaoZhongZ @mayuyuace @pengzhao-intel

@sanchitintel sanchitintel changed the title Add MXFP4 MoE GEMM with CuTe interface MXFP4 MoE GEMM example with CuTe interface Nov 21, 2025
@sanchitintel sanchitintel changed the title MXFP4 MoE GEMM example with CuTe interface MXFP4/MXFP8/int4 MoE GEMM example with CuTe interface Nov 30, 2025
@sanchitintel sanchitintel marked this pull request as ready for review November 30, 2025 21:31
@sanchitintel sanchitintel changed the title MXFP4/MXFP8/int4 MoE GEMM example with CuTe interface MXFP4/MXFP8/int4 weights support in CuTe interface MoE GEMM example Nov 30, 2025
@mayuyuace
Copy link

mayuyuace commented Dec 1, 2025

Please note that when weight=int4, there is a default zero point which is 8.
But as I tested, reorder u4 to bf16/fp16 does not have the default zero point.
Does reorder offer a zero-points interface? Otherwise, the speed will be significantly slower.

@mayuyuace
Copy link

Another question is that mxfp4 scales data type is ue8m0, storage data type is uint8.
Did you add the casting when dequantize in kernel?

@sanchitintel
Copy link
Author

sanchitintel commented Dec 1, 2025

Please note that when weight=int4, there is a default zero point which is 8

That'd depend upon the quantization-type.

Does reorder offer a zero-points interface?

No

Did you add the casting when dequantize in kernel?

I reinterpreted cast only because igc loads more data than necessary (discards the rest) when I used ue8m0, so I reinterpret casted to int8 for loads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants