Xe2 scaledMMs with MX format weights #633

sanchitintel · 2025-11-17T13:59:28Z

Summary

Based on code in cute/tutorial/xe_gemm.hpp.
GEMMs with micro-scaling format (MXFP8 & MXFP4) weights & scales.
Used groupwise scaling for K.
E8M0 Scales sized (K/32, N), as in gpt-oss, are used by default,
OCP MX format has K=32 for groupwise quantization of B, but larger group sizes are supported to demonstrate creating scaledMMs with custom B groupwise quantization schemes.
Activation is BF16/FP16, as BMG/PVC don't support MXFP4 or MXFP8 natively, so the MMA compute happens in BF16/FP16 (with FP32 accum). Weights are converted to FP16 or BF16, depending upon the activation.
Weights are in plain format, and have NOT been prepacked.
For converting & shuffling scales, @petercad added an efficient reorder.
Also, For int4, BF16/FP16 scales are also supported (which is not MXFP4 format), but users should select a tiling scheme that wouldn't cause register spills.
Only SG_K = WG_K = 32 is currently supported. SG_N must be either 16 or 32.

Caution

When the number of output workgroup tiles aren't a multiple of the number of Xe cores, i.e. when the computation has a tail, the scaledMM perf would look worse than the case in which the number of output workgroup tiles are a multiple of the number of Xe cores, because the hardware remains underutilized when the output tiles corresponding to the tail are processed. In practice, if these scaledMM mainloops would be ported to a Group GEMM, the issue of tail-latency would dissipate, as the tail of individual GEMM problems won't matter, and only the tail corresponding to all the output tiles would matter.

Important

Please don't be dissuaded to see N=2880/5760, K=2880 performance below if you intend to use this scaledMM mainloop in a Grouped GEMM, as tail-latency of individual GEMM problems wouldn't matter at all, as described above.

Details

Known bottlenecks for BF16 -

Weights have not been prepacked. Perf hit is roughly 3%-4%. The weight-format that XeTLA uses is different from having weights already in VNNI-16 format once converted to BF16/FP16 (XeTLA does use that idea but thread-value layout is different in sycl-tla, so once the weights of that format are converted to FP16/BF16, they're not already in VNNI-16 format), and a reorder from that format into VNNI-16 is currently unsupported in sycl-tla.
The BMG hardware natively supports fused elementwise BF16 x FP32 -> FP32 multiplication, i.e. static_cast of BF16 to FP32 need not be done first. However, igc is unable to identify this pattern, and the cast is materialized, so both the inputs to mul are float. This degrades performance for applying scales when the activation is BF16 (because we apply scales to converted BF16 weights). Instead of fusing BF16 -> FP32 conversion with mul, igc is using a separate shl instead, causing slowdown.
This code is not using an SLM pipeline, as it was slow due to inefficient SLM <-> registers transfers. However, XeTLA GEMMs are not using an SLM pipeline either.

Issues 2 was identified by Peter & is being tracked for igc. However, he provided corresponding asm code as a workaround that I've since added.

Performance on BMG B580

B Quantization Group size 32 (as in OCP defined MX formats)

Activation dtype	Weights dtype	Output dtype	ColumnMajor B or RowMajor	M	N	K	Throughput
bf16	MXFP4 e2m1	bf16	R	512	5120	8192	95.610 TF/s
bf16	MXFP4 e2m1	bf16	C	512	5120	8192	94.456 TF/s
half	MXFP4 e2m1	half	R	512	5120	8192	94.322 TF/s
half	MXFP4 e2m1	half	C	512	5120	8192	98.667 TF/s
bf16	MXFP8 e4m3	bf16	R	512	5120	8192	84.259 TF/s
bf16	MXFP8 e4m3	bf16	C	512	5120	8192	86.077 TF/s
half	MXFP8 e4m3	half	R	512	5120	8192	83.488 TF/s
half	MXFP8 e4m3	half	C	512	5120	8192	92.294 TF/s
bf16	MXFP8 e5m2	bf16	R	512	5120	8192	92.034 TF/s
bf16	MXFP8 e5m2	bf16	C	512	5120	8192	93.410 TF/s
half	MXFP8 e5m2	half	R	512	5120	8192	105.179 TF/s
half	MXFP8 e5m2	half	C	512	5120	8192	101.152 TF/s
bf16	MXFP4 e2m1	bf16	R	1024	5760	2880	83.970 TF/s
bf16	MXFP4 e2m1	bf16	C	1024	5760	2880	82.836 TF/s
half	MXFP4 e2m1	half	R	1024	5760	2880	83.800 TF/s
half	MXFP4 e2m1	half	C	1024	5760	2880	88.413 TF/s
bf16	MXFP8 e4m3	bf16	R	1024	5760	2880	73.095 TF/s
bf16	MXFP8 e4m3	bf16	C	1024	5760	2880	74.334 TF/s
half	MXFP8 e4m3	half	R	1024	5760	2880	74.184 TF/s
half	MXFP8 e4m3	half	C	1024	5760	2880	81.370 TF/s
bf16	MXFP8 e5m2	bf16	R	1024	5760	2880	79.910 TF/s
bf16	MXFP8 e5m2	bf16	C	1024	5760	2880	80.088 TF/s
half	MXFP8 e5m2	half	R	1024	5760	2880	92.397 TF/s
half	MXFP8 e5m2	half	C	1024	5760	2880	90.915 TF/s
bf16	MXFP4 e2m1	bf16	R	1024	2880	2880	72.697 TF/s
bf16	MXFP4 e2m1	bf16	C	1024	2880	2880	72.057 TF/s
half	MXFP4 e2m1	half	R	1024	2880	2880	70.301 TF/s
half	MXFP4 e2m1	half	C	1024	2880	2880	74.313 TF/s
bf16	MXFP8 e4m3	bf16	R	1024	2880	2880	63.283 TF/s
bf16	MXFP8 e4m3	bf16	C	1024	2880	2880	64.257 TF/s
half	MXFP8 e4m3	half	R	1024	2880	2880	62.185 TF/s
half	MXFP8 e4m3	half	C	1024	2880	2880	68.923 TF/s
bf16	MXFP8 e5m2	bf16	R	1024	2880	2880	69.001 TF/s
bf16	MXFP8 e5m2	bf16	C	1024	2880	2880	70.065 TF/s
half	MXFP8 e5m2	half	R	1024	2880	2880	76.565 TF/s
half	MXFP8 e5m2	half	C	1024	2880	2880	75.611 TF/s

Further tuning is possible on a case-by-case basis.
e.g. bf16 x fp8_e4m3 can be tuned further for ColumnMajor B, if required.
In DL models, weights are usually in ColumnMajor B format.

Benchmarking instructions

# Set GPU ID of the GPU whose min frequency you wish to change to max frequency
export GPUID=0

sudo sh -c "cat /sys/class/drm/card$GPUID/device/tile0/gt0/freq0/max_freq > /sys/class/drm/card$GPUID/device/tile0/gt0/freq0/min_freq"

Build instructions

Please do not use -DDPCPP_HOST_COMPILER=g++-13 (for now, I'll later revise the code to make it compatible with g++. It's related to the sycl kernel launch).

source /opt/intel/oneapi/setvars.sh
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu
export CMAKE_BUILD_TYPE=Release
export IGC_VISAOptions="-perfmodel"
export IGC_VectorAliasBBThreshold=100000000000
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread" 
export CC=icx
export CXX=icpx 
mkdir build; cd build
cmake .. -GNinja -DCUTLASS_ENABLE_EXAMPLES=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON  -DCUTLASS_ENABLE_SYCL=ON -DCUTLASS_SYCL_PROFILING_ENABLED=ON -DCUTLASS_ENABLE_BENCHMARKS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-ftemplate-backtrace-limit=0 -fdiagnostics-color=always" -DDPCPP_SYCL_TARGET=intel_gpu_bmg_g21 
ninja examples/cute/tutorial/cute_tutorial_xe_gemm_microscaling

cc @pengzhao-intel @EikanWang @CaoZhongZ

sanchitintel · 2025-11-30T09:08:46Z

The CI failures are unrelated, and have been affecting all recent PRs. Thanks!

pengzhao-intel · 2025-11-30T19:21:28Z

thanks for the PR, how do you calculate the throughtput? what's the peak of 4bit computation in B580?

sanchitintel · 2025-11-30T20:31:18Z

Hi @pengzhao-intel,

how do you calculate the throughtput?

Only compute corresponding to MMA is being considered for calculating the throughput, so the actual throughput measured with a profiler would be higher, since we don't currently account for conversion of 4-bit weights, scaling, and FP32 -> BF16 conversion (which is just a mov instruction, so it doesn't matter much, but the former two have a substantial overhead).

Throughput is being computed as (2 * M * N * K)/ latency.

Another caveat is that this PR uses usm tensors (for simplicity) instead of device-only tensors, so the throughput with device-only tensor would be higher, as we also observed for the GEMMs in examples/cute/tutorial/xe_gemm.cpp.

what's the peak of 4bit computation in B580

Please go through the description of the PR - BMG doesn't support 4-bit MMA natively. 4-bit weights are converted to FP16 or BF16, depending upon the activation. For BF16/FP16, the peak throughput of B580 is ~117 TFLOPs/s.

~~@pengzhao-intel, given bottlenecks specific to BF16, please advise if we can compare against XeTLA performance with FP16 activation instead. Thanks~~

petercad · 2025-12-03T22:32:07Z

@sanchitintel -- on the use of shl for bf16->f32 conversion: this comes from sycl::bfloat16_t::operator float() -- it is fast and perfectly good for standalone conversion. The only problem is that IGC doesn't recognize it as a bf16->f32 conversion and therefore isn't able to fuse that conversion into the mul.

The hardware supports fusing BF16/FP16 to FP32 conversion for one of the inputs to mul. Based on some reference code provided by Peter Caday.

sanchitintel · 2025-12-04T07:25:29Z

Thanks for your inputs, @petercad! I revised the description and also added the scaling code in assembly that you provided.

I also added some clarifying details on observed performance.

sanchitintel force-pushed the gemm_microscaling_weights branch from a55d296 to e53e636 Compare November 19, 2025 00:39

sanchitintel changed the title ~~Add GEMMs with microscaling format weights~~ Xe2 GEMMs with microscaling format weights as well as int4 weights with FP16/BF16 scales Nov 30, 2025

sanchitintel force-pushed the gemm_microscaling_weights branch from e53e636 to b0e9006 Compare November 30, 2025 08:04

sanchitintel marked this pull request as ready for review November 30, 2025 08:28

Add MX GEMMs as well as int4 weights with BF16/FP16 scales support

15ead41

sanchitintel force-pushed the gemm_microscaling_weights branch from 336b849 to 15ead41 Compare November 30, 2025 08:38

sanchitintel requested a review from jiyang1011 November 30, 2025 08:40

sanchitintel mentioned this pull request Nov 30, 2025

MXFP4/MXFP8/int4 weights support in CuTe interface MoE GEMM example #640

Open

sanchitintel requested a review from petercad November 30, 2025 21:00

This was referenced Dec 3, 2025

[cutlass] support fp8/int4/mxfp4 weights grouped gemm vllm-project/vllm-xpu-kernels#88

Open

[WIP] SLM pipelined GEMM design #653

Open

sanchitintel added 2 commits December 3, 2025 23:15

Add scaling code based on asm code provided by Peter

f9d5d91

The hardware supports fusing BF16/FP16 to FP32 conversion for one of the inputs to mul. Based on some reference code provided by Peter Caday.

Merge branch 'main' into gemm_microscaling_weights

4522264

Fix typo in comments

a4222db

sanchitintel changed the title ~~Xe2 GEMMs with microscaling format weights as well as int4 weights with FP16/BF16 scales~~ Xe2 GEMMs with microscaling format weights Dec 4, 2025

Move static_assert

00bed6a

sanchitintel changed the title ~~Xe2 GEMMs with microscaling format weights~~ Xe2 scaledMMs with MX format weights Dec 4, 2025

Move static_assert again :(

c158992

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Xe2 scaledMMs with MX format weights #633

Xe2 scaledMMs with MX format weights #633

sanchitintel commented Nov 17, 2025 •

edited

Loading

Uh oh!

sanchitintel commented Nov 30, 2025 •

edited

Loading

Uh oh!

pengzhao-intel commented Nov 30, 2025

Uh oh!

sanchitintel commented Nov 30, 2025 •

edited

Loading

Uh oh!

petercad commented Dec 3, 2025

Uh oh!

sanchitintel commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xe2 scaledMMs with MX format weights #633

Are you sure you want to change the base?

Xe2 scaledMMs with MX format weights #633

Conversation

sanchitintel commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Performance on BMG B580

B Quantization Group size 32 (as in OCP defined MX formats)

Benchmarking instructions

Build instructions

Uh oh!

sanchitintel commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengzhao-intel commented Nov 30, 2025

Uh oh!

sanchitintel commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercad commented Dec 3, 2025

Uh oh!

sanchitintel commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sanchitintel commented Nov 17, 2025 •

edited

Loading

sanchitintel commented Nov 30, 2025 •

edited

Loading

sanchitintel commented Nov 30, 2025 •

edited

Loading