-
Notifications
You must be signed in to change notification settings - Fork 391
[Enhance] add model internal metrics #1271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhance] add model internal metrics #1271
Conversation
| device=hidden_states.device) | ||
|
|
||
| attn_output: torch.Tensor = self.attn_impl_func( # type: ignore | ||
| attn_output, extra_info = self.attn_impl_func( # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good idea to change the attention function signature like this. Instead, we should define an AttnOutput type (using TypedDict or namedtuple) to represent the attention result."
| ctx.cu_seqlen = cu_seqlen | ||
|
|
||
| return o | ||
| return o, lse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lse should only be returned when specific flags are passed in. Otherwise, only the attention output should be returned. This makes for a cleaner attention interface.
|
|
||
| def register_attn_extra_info_hook(self, module, layer_name=None): | ||
| def hook(module, input, output): | ||
| extra_info = output[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest refactoring with the AttnOutput
|
|
||
| # do dummy forward to get metrics | ||
| for i in range(0, len(data_batches), self.intra_layer_micro_batch): | ||
| data_batch = data_batches[i : i + self.intra_layer_micro_batch] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to consider intra_layer_micro_batch?
f4014fc to
97f1564
Compare
e557bee to
39f7de0
Compare
1. attn logits metrics (max lse / max logits) 2. add moe metrics (mean & max router logits, maxvio & drop ratio) 3. model layerwise weight rms norm 4. est global batch tokens
…ce it is no longer needed
[Enhance] refactor internal metrics to use TypedDict
[Fix] fix rms_norm no_grad
39f7de0 to
0a5f008
Compare
No description provided.