Skip to content

Conversation

@am17an
Copy link
Collaborator

@am17an am17an commented Dec 10, 2025

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is 10% slower than master 25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

  • Figure out why we don't see better results
  • Address NMSE error b/w q8_0 and mxfp4
Model Test t/s c6f6e4f t/s mxfp4 Speedup
gpt-oss 20B MXFP4 MoE pp512 10560.64 13304.80 1.26
gpt-oss 20B MXFP4 MoE pp1024 10659.15 13515.51 1.27
gpt-oss 20B MXFP4 MoE pp2048 10801.35 13715.10 1.27
gpt-oss 20B MXFP4 MoE pp4096 10854.04 13806.59 1.27
gpt-oss 20B MXFP4 MoE pp8192 10688.23 13525.14 1.27
gpt-oss 20B MXFP4 MoE pp16384 10140.17 11587.72 1.14

Note: This PR was developed on @JohannesGaessler's server with a 5090 provided by NVIDIA. So thanks to them both!

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025
@am17an am17an marked this pull request as draft December 10, 2025 10:59
@easyfab
Copy link

easyfab commented Dec 10, 2025

Nice speedup ,

Master:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp512 5614.78 ± 40.21
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp2048 4729.89 ± 10.28
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 tg128 204.28 ± 0.53
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 6460.61 ± 65.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 6624.29 ± 24.83
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 221.47 ± 0.25

PR:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp512 6473.65 ± 37.97
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 pp2048 5346.78 ± 4.23
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 0 tg128 205.29 ± 0.30
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp512 7754.67 ± 53.15
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 pp2048 7917.86 ± 20.30
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 221.23 ± 0.21

Comment on lines 740 to 744
if (sign > 0.0f) {
return static_cast<uint8_t>(best_i); // 0..7
} else {
return static_cast<uint8_t>(best_i | 0x8); // 8..15
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.

More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.

Comment on lines 824 to 827
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 0] = compress(aux_q4[1]) << 16 | compress(aux_q4[0]);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 1] = compress(aux_q4[3]) << 16 | compress(aux_q4[2]);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 2] = compress(aux_q4[1] >> 4) << 16 | compress(aux_q4[0] >> 4);
x_qs[i * MMQ_MMA_TILE_X_K_FP4 + k0 + 3] = compress(aux_q4[3] >> 4) << 16 | compress(aux_q4[2] >> 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the code you should be suffering from a 4-way shared memory bank conflict.

return 0;
}

const uint8_t sign_bit = x < 0.0f ? 0x8 : 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.

}

#define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4)
#define MMQ_MMA_TILE_X_K_FP4 (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.

Comment on lines +127 to +150
const uint8_t q_lo_0 = __shfl_sync(0xFFFFFFFF, q_val, base, WARP_SIZE);
const uint8_t q_lo_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 1, WARP_SIZE);
const uint8_t q_hi_0 = __shfl_sync(0xFFFFFFFF, q_val, base + 16, WARP_SIZE);
const uint8_t q_hi_1 = __shfl_sync(0xFFFFFFFF, q_val, base + 17, WARP_SIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment to explain the permutation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment on top of this function

@am17an
Copy link
Collaborator Author

am17an commented Dec 12, 2025

I used 512 as MMQ_ITER_K so that all tile sizes remain the same, and it seems to be faster than the previous version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants