Skip to content

Conversation

@Fridge003
Copy link
Collaborator

@Fridge003 Fridge003 commented Nov 17, 2025

Motivation

When launching dpsk-r1-fp4 with MTP and TP4, the draft model will use bfloat16 fused moe triton kernels.
So it requires some tuning.

Modifications

Accuracy Tests

Benchmarking and Profiling

# Launch
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server \
    --model-path nvidia/DeepSeek-R1-0528-FP4-v2 \
    --trust-remote-code \
    --attention-backend trtllm_mla \
    --moe-runner-backend flashinfer_trtllm \
    --quantization modelopt_fp4 \
    --tp 4 \
    --speculative-algorithm EAGLE \
    --kv-cache-dtype fp8_e4m3

# Profile
python3 -m sglang.bench_one_batch_server --model nvidia/DeepSeek-R1-0528-FP4-v2 --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 20 --skip-warmup --profile --profile-steps 10

Main:
截屏2025-11-17 16 06 15

This PR:
截屏2025-11-17 16 05 47

Checklist

@Fridge003 Fridge003 removed the run-ci label Nov 17, 2025
@Fridge003 Fridge003 marked this pull request as ready for review November 18, 2025 00:07
@Fridge003 Fridge003 changed the title Add bfloat16 tuned fused moe config for B200 Add bfloat16 tuned fused moe config for Dpsk-MTP layer on B200 Nov 18, 2025
@Fridge003 Fridge003 merged commit 85ae508 into main Nov 18, 2025
87 of 120 checks passed
@Fridge003 Fridge003 deleted the baizhou/fused-moe branch November 18, 2025 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants