MXFP4/MXFP8/int4 weights support in CuTe interface MoE GEMM example #640
+297
−42
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds MoE GEMM implementation for MXFP4/MXFP8 (FP4/FP8 weights & E8M0 scales, with group-wise quantization) with CuTe interface.
If users don't select copy atoms for loading activations, weights & storing output, then they would be chosen automatically. Users can pass void as corresponding copy atom template parameters, but the copy atoms chosen automatically may not not always attain the best performance, so users can specify custom copy atoms.
Support for int4 weights with BF16/FP16 scales has also been added.
Weights are in plain format, and have not been prepacked.
Details
BMG doesn't support MXFP4/MXFP8/int4 natively, so it's converted to either FP16 or BF16, depending upon the activation.
Currently, it assumes WG_K & SG_K are both equal to 32.
Performance
Largely depends upon scaledMM performance in #633
cc @CaoZhongZ @mayuyuace @pengzhao-intel