-
Notifications
You must be signed in to change notification settings - Fork 78
add option to cache TMA loaded buffer in registers #5811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
!test |
|
Review updated until commit 5c08e12 Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Missing performance validation
|
Greptile OverviewGreptile SummaryThis PR adds register caching support for TMA-loaded buffers in the inner persistent normalization scheduler. When warp specialization is enabled, TMA-loaded shared memory buffers are cached to registers to immediately release shared memory barriers, allowing the next TMA load to proceed without waiting for computation to complete. Key Implementation Details:
Changes:
Minor Issue:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Heuristics as getInnerPersistentHeuristics
participant Setup as setupPersistentSchedule
participant Scheduler as scheduleInnerPersistentWarpSpecialized
participant TMA as TMA Load
Note over Heuristics: Check if warp specialization<br/>conditions met (n_stages >= 2, bdimx == 128)
alt Warp Specialization Enabled
Heuristics->>Heuristics: Set circular_buffer_options
Heuristics->>Heuristics: Set is_circular_buffer_regs_cached = true
else No Warp Specialization
Heuristics->>Heuristics: is_circular_buffer_regs_cached = false (default)
end
Heuristics->>Setup: Pass params with flag
Setup->>Setup: Cache inputs (cacheInputs)
Setup->>Setup: Create TMA loads to shared memory
Setup->>Setup: Create register cache (cacheAfter)
alt is_circular_buffer_regs_cached = true
Note over Setup: Skip recomputation logic<br/>to keep all data in registers
Setup->>Setup: continue (no recompute)
else is_circular_buffer_regs_cached = false
Setup->>Setup: Recompute from smem for each consumer
end
Setup->>Scheduler: Return setup with smem2reg_tvs
alt is_circular_buffer_regs_cached = true
Scheduler->>Scheduler: Filter smem2reg_tvs with BIDx
Scheduler->>Scheduler: Inline at pos_after_bidx
Note over Scheduler,TMA: Cached registers immediately<br/>release shared memory barrier
TMA->>TMA: Next TMA load can proceed
end
Note over Scheduler: Result: Improved TMA pipelining<br/>at cost of increased register usage
|
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
|
|
||
| // If regs cache is enabled, no need to further recompute from smem as | ||
| // we want to cache all tma loaded buffers to regs to immediately release | ||
| // the shared memory barrier to launch the next TMA load. Note that, this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar: "increased" should be "increases"
| // the shared memory barrier to launch the next TMA load. Note that, this | |
| // increased register usage. |
No description provided.