Normalize gates on expert dim before calculating seq_aux_loss #11160
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Models
Description
在计算 seq_aux_loss 前对输入的 gates 在 expert 维度(axis=-1)进行归一化,与 Megatron 对齐
对应Megatron代码
计算 seq_aux_loss 的代码在这一段:https://github.com/NVIDIA/Megatron-LM/blob/d3f1af4ed0f9e0cb8bc97cc6f6e288c2d096b443/megatron/core/transformer/moe/router.py#L514-L525
注意到在计算 seq_aux_loss 之前,首先需要对 logits 做归一化: https://github.com/NVIDIA/Megatron-LM/blob/d3f1af4ed0f9e0cb8bc97cc6f6e288c2d096b443/megatron/core/transformer/moe/moe_utils.py#L647
对应DeepseekV3论文
归一化这一操作对应原文红框所示的算式:

即是说,在计算 seq_aux_loss 前,需要先对 score 在 Nr(num routed experts)维度进行归一化
修改效果
修改后,我们的 _cal_seq_aux_loss 可以和 Megatron 做到 逐位对齐
如果不修改,我们算出来的 aux_loss 会比 Megatron 大概大 20 倍,但这个倍数不是固定的,取决于具体归一化时的 expert 维之和;因此本 PR 的效果和直接把 aux_loss_alpha 缩小 20 倍在少量 step 内类似,但长期来看还是不能和直接归一化的准确性相比