Skip to content

Conversation

@sanchitintel
Copy link

@sanchitintel sanchitintel commented Nov 17, 2025

Summary

  • Based on code in cute/tutorial/xe_gemm.hpp.
  • GEMMs with micro-scaling format (MXFP8 & MXFP4) weights & scales.
  • Used groupwise scaling for K.
  • E8M0 Scales sized (K/32, N), as in gpt-oss, are used by default,
  • OCP MX format has K=32 for groupwise quantization of B, but larger group sizes are supported to demonstrate creating scaledMMs with custom B groupwise quantization schemes.
  • Activation is BF16/FP16, as BMG/PVC don't support MXFP4 or MXFP8 natively, so the MMA compute happens in BF16/FP16 (with FP32 accum). Weights are converted to FP16 or BF16, depending upon the activation.
  • Weights are in plain format, and have NOT been prepacked.
  • For converting & shuffling scales, @petercad added an efficient reorder.
  • Also, For int4, BF16/FP16 scales are also supported (which is not MXFP4 format), but users should select a tiling scheme that wouldn't cause register spills.
  • Only SG_K = WG_K = 32 is currently supported. SG_N must be either 16 or 32.

Caution

When the number of output workgroup tiles aren't a multiple of the number of Xe cores, i.e. when the computation has a tail, the scaledMM perf would look worse than the case in which the number of output workgroup tiles are a multiple of the number of Xe cores, because the hardware remains underutilized when the output tiles corresponding to the tail are processed. In practice, if these scaledMM mainloops would be ported to a Group GEMM, the issue of tail-latency would dissipate, as the tail of individual GEMM problems won't matter, and only the tail corresponding to all the output tiles would matter.

Important

Please don't be dissuaded to see N=2880/5760, K=2880 performance below if you intend to use this scaledMM mainloop in a Grouped GEMM, as tail-latency of individual GEMM problems wouldn't matter at all, as described above.

Details

Known bottlenecks for BF16 -

  1. Weights have not been prepacked. Perf hit is roughly 3%-4%. The weight-format that XeTLA uses is different from having weights already in VNNI-16 format once converted to BF16/FP16 (XeTLA does use that idea but thread-value layout is different in sycl-tla, so once the weights of that format are converted to FP16/BF16, they're not already in VNNI-16 format), and a reorder from that format into VNNI-16 is currently unsupported in sycl-tla.
  2. The BMG hardware natively supports fused elementwise BF16 x FP32 -> FP32 multiplication, i.e. static_cast of BF16 to FP32 need not be done first. However, igc is unable to identify this pattern, and the cast is materialized, so both the inputs to mul are float. This degrades performance for applying scales when the activation is BF16 (because we apply scales to converted BF16 weights). Instead of fusing BF16 -> FP32 conversion with mul, igc is using a separate shl instead, causing slowdown.
  3. This code is not using an SLM pipeline, as it was slow due to inefficient SLM <-> registers transfers. However, XeTLA GEMMs are not using an SLM pipeline either.

Issues 2 was identified by Peter & is being tracked for igc. However, he provided corresponding asm code as a workaround that I've since added.

Performance on BMG B580

B Quantization Group size 32 (as in OCP defined MX formats)

Activation dtype Weights dtype Output dtype ColumnMajor B or RowMajor M N K Throughput
bf16 MXFP4 e2m1 bf16 R 512 5120 8192 95.610 TF/s
bf16 MXFP4 e2m1 bf16 C 512 5120 8192 94.456 TF/s
half MXFP4 e2m1 half R 512 5120 8192 94.322 TF/s
half MXFP4 e2m1 half C 512 5120 8192 98.667 TF/s
bf16 MXFP8 e4m3 bf16 R 512 5120 8192 84.259 TF/s
bf16 MXFP8 e4m3 bf16 C 512 5120 8192 86.077 TF/s
half MXFP8 e4m3 half R 512 5120 8192 83.488 TF/s
half MXFP8 e4m3 half C 512 5120 8192 92.294 TF/s
bf16 MXFP8 e5m2 bf16 R 512 5120 8192 92.034 TF/s
bf16 MXFP8 e5m2 bf16 C 512 5120 8192 93.410 TF/s
half MXFP8 e5m2 half R 512 5120 8192 105.179 TF/s
half MXFP8 e5m2 half C 512 5120 8192 101.152 TF/s
bf16 MXFP4 e2m1 bf16 R 1024 5760 2880 83.970 TF/s
bf16 MXFP4 e2m1 bf16 C 1024 5760 2880 82.836 TF/s
half MXFP4 e2m1 half R 1024 5760 2880 83.800 TF/s
half MXFP4 e2m1 half C 1024 5760 2880 88.413 TF/s
bf16 MXFP8 e4m3 bf16 R 1024 5760 2880 73.095 TF/s
bf16 MXFP8 e4m3 bf16 C 1024 5760 2880 74.334 TF/s
half MXFP8 e4m3 half R 1024 5760 2880 74.184 TF/s
half MXFP8 e4m3 half C 1024 5760 2880 81.370 TF/s
bf16 MXFP8 e5m2 bf16 R 1024 5760 2880 79.910 TF/s
bf16 MXFP8 e5m2 bf16 C 1024 5760 2880 80.088 TF/s
half MXFP8 e5m2 half R 1024 5760 2880 92.397 TF/s
half MXFP8 e5m2 half C 1024 5760 2880 90.915 TF/s
bf16 MXFP4 e2m1 bf16 R 1024 2880 2880 72.697 TF/s
bf16 MXFP4 e2m1 bf16 C 1024 2880 2880 72.057 TF/s
half MXFP4 e2m1 half R 1024 2880 2880 70.301 TF/s
half MXFP4 e2m1 half C 1024 2880 2880 74.313 TF/s
bf16 MXFP8 e4m3 bf16 R 1024 2880 2880 63.283 TF/s
bf16 MXFP8 e4m3 bf16 C 1024 2880 2880 64.257 TF/s
half MXFP8 e4m3 half R 1024 2880 2880 62.185 TF/s
half MXFP8 e4m3 half C 1024 2880 2880 68.923 TF/s
bf16 MXFP8 e5m2 bf16 R 1024 2880 2880 69.001 TF/s
bf16 MXFP8 e5m2 bf16 C 1024 2880 2880 70.065 TF/s
half MXFP8 e5m2 half R 1024 2880 2880 76.565 TF/s
half MXFP8 e5m2 half C 1024 2880 2880 75.611 TF/s

Further tuning is possible on a case-by-case basis.
e.g. bf16 x fp8_e4m3 can be tuned further for ColumnMajor B, if required.
In DL models, weights are usually in ColumnMajor B format.

Benchmarking instructions

# Set GPU ID of the GPU whose min frequency you wish to change to max frequency
export GPUID=0

sudo sh -c "cat /sys/class/drm/card$GPUID/device/tile0/gt0/freq0/max_freq > /sys/class/drm/card$GPUID/device/tile0/gt0/freq0/min_freq"

Build instructions

Please do not use -DDPCPP_HOST_COMPILER=g++-13 (for now, I'll later revise the code to make it compatible with g++. It's related to the sycl kernel launch).

​source /opt/intel/oneapi/setvars.sh
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu
export CMAKE_BUILD_TYPE=Release
export IGC_VISAOptions="-perfmodel"
export IGC_VectorAliasBBThreshold=100000000000
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread" 
export CC=icx
export CXX=icpx 
mkdir build; cd build
​cmake .. -GNinja -DCUTLASS_ENABLE_EXAMPLES=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON  -DCUTLASS_ENABLE_SYCL=ON -DCUTLASS_SYCL_PROFILING_ENABLED=ON -DCUTLASS_ENABLE_BENCHMARKS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-ftemplate-backtrace-limit=0 -fdiagnostics-color=always" -DDPCPP_SYCL_TARGET=intel_gpu_bmg_g21 
ninja examples/cute/tutorial/cute_tutorial_xe_gemm_microscaling

cc @pengzhao-intel @EikanWang @CaoZhongZ

@sanchitintel sanchitintel force-pushed the gemm_microscaling_weights branch from a55d296 to e53e636 Compare November 19, 2025 00:39
@sanchitintel sanchitintel changed the title Add GEMMs with microscaling format weights Xe2 GEMMs with microscaling format weights as well as int4 weights with FP16/BF16 scales Nov 30, 2025
@sanchitintel sanchitintel force-pushed the gemm_microscaling_weights branch from e53e636 to b0e9006 Compare November 30, 2025 08:04
@sanchitintel sanchitintel marked this pull request as ready for review November 30, 2025 08:28
@sanchitintel sanchitintel force-pushed the gemm_microscaling_weights branch from 336b849 to 15ead41 Compare November 30, 2025 08:38
@sanchitintel
Copy link
Author

sanchitintel commented Nov 30, 2025

The CI failures are unrelated, and have been affecting all recent PRs. Thanks!

@pengzhao-intel
Copy link

thanks for the PR, how do you calculate the throughtput? what's the peak of 4bit computation in B580?

@sanchitintel
Copy link
Author

sanchitintel commented Nov 30, 2025

Hi @pengzhao-intel,

how do you calculate the throughtput?

Only compute corresponding to MMA is being considered for calculating the throughput, so the actual throughput measured with a profiler would be higher, since we don't currently account for conversion of 4-bit weights, scaling, and FP32 -> BF16 conversion (which is just a mov instruction, so it doesn't matter much, but the former two have a substantial overhead).

Throughput is being computed as (2 * M * N * K)/ latency.

Another caveat is that this PR uses usm tensors (for simplicity) instead of device-only tensors, so the throughput with device-only tensor would be higher, as we also observed for the GEMMs in examples/cute/tutorial/xe_gemm.cpp.

what's the peak of 4bit computation in B580

Please go through the description of the PR - BMG doesn't support 4-bit MMA natively. 4-bit weights are converted to FP16 or BF16, depending upon the activation. For BF16/FP16, the peak throughput of B580 is ~117 TFLOPs/s.

@pengzhao-intel, given bottlenecks specific to BF16, please advise if we can compare against XeTLA performance with FP16 activation instead. Thanks

@petercad
Copy link

petercad commented Dec 3, 2025

@sanchitintel -- on the use of shl for bf16->f32 conversion: this comes from sycl::bfloat16_t::operator float() -- it is fast and perfectly good for standalone conversion. The only problem is that IGC doesn't recognize it as a bf16->f32 conversion and therefore isn't able to fuse that conversion into the mul.

The hardware supports fusing BF16/FP16 to FP32 conversion for one of the inputs to mul. Based on some reference code provided by Peter Caday.
@sanchitintel
Copy link
Author

Thanks for your inputs, @petercad! I revised the description and also added the scaling code in assembly that you provided.

I also added some clarifying details on observed performance.

@sanchitintel sanchitintel changed the title Xe2 GEMMs with microscaling format weights as well as int4 weights with FP16/BF16 scales Xe2 GEMMs with microscaling format weights Dec 4, 2025
@sanchitintel sanchitintel changed the title Xe2 GEMMs with microscaling format weights Xe2 scaledMMs with MX format weights Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants