[Enhance] add model internal metrics #1271

nil0x9 · 2025-11-17T10:58:41Z

No description provided.

HAOCHENYE · 2025-11-17T12:07:06Z

xtuner/v1/model/compose/intern_s1/modeling_vision.py

                                   device=hidden_states.device)

-        attn_output: torch.Tensor = self.attn_impl_func(  # type: ignore
+        attn_output, extra_info = self.attn_impl_func(  # type: ignore


It's not a good idea to change the attention function signature like this. Instead, we should define an AttnOutput type (using TypedDict or namedtuple) to represent the attention result."

HAOCHENYE · 2025-11-17T12:13:14Z

xtuner/v1/ops/flash_attn/flash_sink_varlen_attn_gpt_oss.py

        ctx.cu_seqlen = cu_seqlen

-        return o
+        return o, lse


The lse should only be returned when specific flags are passed in. Otherwise, only the attention output should be returned. This makes for a cleaner attention interface.

xtuner/v1/train/trainer.py

HAOCHENYE · 2025-11-17T13:00:48Z

xtuner/v1/utils/internal_metrics.py

+
+    def register_attn_extra_info_hook(self, module, layer_name=None):
+        def hook(module, input, output):
+            extra_info = output[1]


Suggest refactoring with the AttnOutput

xtuner/v1/utils/internal_metrics.py

HAOCHENYE · 2025-11-17T13:03:09Z

xtuner/v1/utils/internal_metrics.py

+
+        # do dummy forward to get metrics
+        for i in range(0, len(data_batches), self.intra_layer_micro_batch):
+            data_batch = data_batches[i : i + self.intra_layer_micro_batch]


Is it necessary to consider intra_layer_micro_batch?

xtuner/v1/utils/internal_metrics.py

1. attn logits metrics (max lse / max logits) 2. add moe metrics (mean & max router logits, maxvio & drop ratio) 3. model layerwise weight rms norm 4. est global batch tokens

…ce it is no longer needed

[Enhance] refactor internal metrics to use TypedDict

[Fix] fix rms_norm no_grad

nil0x9 changed the title ~~Dev add model internal metrics~~ [Enhance] add model internal metrics Nov 17, 2025

HAOCHENYE reviewed Nov 17, 2025

View reviewed changes

HAOCHENYE changed the base branch from main to dev November 17, 2025 13:36

HAOCHENYE force-pushed the dev-add-model-internal-metrics branch from f4014fc to 97f1564 Compare November 17, 2025 13:37

nil0x9 force-pushed the dev-add-model-internal-metrics branch 9 times, most recently from e557bee to 39f7de0 Compare November 19, 2025 11:32

nil0x9 added 8 commits November 19, 2025 19:49

[Enhance] add the following monitors to xtuner trainer

45d4028

1. attn logits metrics (max lse / max logits) 2. add moe metrics (mean & max router logits, maxvio & drop ratio) 3. model layerwise weight rms norm 4. est global batch tokens

[Fix] fix excessive recompile caused by input list change

048d46e

[Fix] remove return_tokens_per_expert_global in additional kwargs sin…

d3d01ff

…ce it is no longer needed

[Enhance] drop rms_norm hook impl. to avoid precision problem (fp8)

8387923

[Enhance] refactor internal metrics to use TypedDict

[Enhance] Add type hint and fix typos and code styles

9357c40

[Fix][Temp] move attn stats to global to avoid recompiles

45f99ed

[Enhance] resolve some lint issues

528c00a

[Fix] fix rms_norm no_grad

[Fix][Temp] Avoid recompiling on globals and closure variables

0a5f008

nil0x9 force-pushed the dev-add-model-internal-metrics branch from 39f7de0 to 0a5f008 Compare November 19, 2025 11:54

[Enhance][Temp] add environ var that serves as switch for attn_lse hook

f7ea4ce

HAOCHENYE merged commit 217dd0f into InternLM:dev Nov 19, 2025
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhance] add model internal metrics #1271

[Enhance] add model internal metrics #1271

Uh oh!

nil0x9 commented Nov 17, 2025

Uh oh!

HAOCHENYE Nov 17, 2025

Uh oh!

HAOCHENYE Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE Nov 17, 2025

Uh oh!

Uh oh!

HAOCHENYE Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Enhance] add model internal metrics #1271

[Enhance] add model internal metrics #1271

Uh oh!

Conversation

nil0x9 commented Nov 17, 2025

Uh oh!

HAOCHENYE Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HAOCHENYE Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants