-
Notifications
You must be signed in to change notification settings - Fork 42
apriel2 modeling bug #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apriel2 modeling bug #450
Conversation
tscholak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important fixes!
| initial_state=recurrent_state, | ||
| output_final_state=past_key_values is not None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch!
| 4. Update state: S = S + k β delta | ||
| 5. Output: o = S @ q (scaled) | ||
| """ | ||
| input_dtype = query.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the torch implementation, right? under normal circumstances we wouldn't hit this code path because we're using the FLA kernel, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torch fallback is torch_chunk_gated_delta_rule. The _recurrent_gated_delta_rule is used at decoding time in recurrent mode (i.e. after prefill)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an FLA kernel for chunk_gated_delta_rule that we can use. I think you saw that too now. thanks for approving!
|
when I looked at these changes, I realized that it doesn't make sense to have a torch fallback. I decided to just remove that code. The outcome is #451 |
Co-authored-by: Claude Opus 4.5 <[email protected]>
tscholak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good stuff
β¨ Description
Bugs in the recurrent step path used in generation.
This makes GSM8k generations actually sensical.
π Type of change
Select all that apply:
π Changes
List the key changes introduced in this PR:
β Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
π Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
ποΈ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.