Normalize gates on expert dim before calculating seq_aux_loss #11160

lshpku · 2025-11-03T12:37:01Z

PR types

Bug fixes

PR changes

Models

Description

在计算 seq_aux_loss 前对输入的 gates 在 expert 维度（axis=-1）进行归一化，与 Megatron 对齐

对应Megatron代码

计算 seq_aux_loss 的代码在这一段：https://github.com/NVIDIA/Megatron-LM/blob/d3f1af4ed0f9e0cb8bc97cc6f6e288c2d096b443/megatron/core/transformer/moe/router.py#L514-L525
注意到在计算 seq_aux_loss 之前，首先需要对 logits 做归一化： https://github.com/NVIDIA/Megatron-LM/blob/d3f1af4ed0f9e0cb8bc97cc6f6e288c2d096b443/megatron/core/transformer/moe/moe_utils.py#L647

对应DeepseekV3论文

归一化这一操作对应原文红框所示的算式：

即是说，在计算 seq_aux_loss 前，需要先对 score 在 Nr（num routed experts）维度进行归一化

修改效果

修改后，我们的 _cal_seq_aux_loss 可以和 Megatron 做到 逐位对齐

如果不修改，我们算出来的 aux_loss 会比 Megatron 大概大 20 倍，但这个倍数不是固定的，取决于具体归一化时的 expert 维之和；因此本 PR 的效果和直接把 aux_loss_alpha 缩小 20 倍在少量 step 内类似，但长期来看还是不能和直接归一化的准确性相比

paddle-bot · 2025-11-03T12:37:10Z

Thanks for your contribution!

Normalize gates on expert dim before calculating seq_aux_loss

3a614fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize gates on expert dim before calculating seq_aux_loss #11160

Normalize gates on expert dim before calculating seq_aux_loss #11160

Uh oh!

lshpku commented Nov 3, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Normalize gates on expert dim before calculating seq_aux_loss #11160

Are you sure you want to change the base?

Normalize gates on expert dim before calculating seq_aux_loss #11160

Uh oh!

Conversation

lshpku commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

对应Megatron代码

对应DeepseekV3论文

修改效果

Uh oh!

paddle-bot bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lshpku commented Nov 3, 2025 •

edited

Loading