Conversation
They are only used by canonicalDimReduction
|
!test |
|
Review updated until commit fafcea9 Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Lambda capture safety
merge_all captures tv and num_ordered_dims by reference. While tv remains constant, num_ordered_dims is modified within the lambda's scope (lines 1200, 1206), which could lead to unexpected behavior. Consider capturing by value or restructuring to avoid potential issues. |
Test failures
-
(High, 95)
CUDA driver version too old for runtime – nvFuser matmul/top-k test suites on dlcluster_h100Test Name H100 Source Ampere/MmaTest.SingleTile/Ampere_16_8_16__bfloat ❌ Link ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/2048_1_1_0 ❌ Link BlockSizeAndItemsPerThread/ArgSortComprehensiveTest.ComprehensiveValidation/BlockSize32_ItemsPerThread4 ❌ Link ClusterReductionTest.SimpleFusionNotAllReduce/cluster_15_dtype_double ❌ Link ClusterReductionTest.SimpleFusionNotAllReduce/cluster_4_dtype_double ❌ Link CutlassExecutorTest.Nvfp4Matmul_BiasEpilogue ❌ Link General/HopperPlusMatmulSchedulerTest.FusedMultiplySum/KK_512_256_128_MmaMacro_m64_n128_k16_splitk_2 ❌ Link General/HopperPlusMatmulSchedulerTest.FusedMultiplySum/MK_512_256_128_MmaMacro_m128_n128_k16_tma_store ❌ Link General/HopperPlusMatmulSchedulerTest.FusedMultiplySumBiasNeg/MN_512_256_128_MmaMacro_m64_n128_k16_tma_store_splitk_2 ❌ Link GreedySchedulerTest.ScanNonLocalOutput ❌ Link ... with 85 more test failures omitted. Check internal logs. -
(High, 1)
Outdated NVIDIA driver on dlcluster_h100 causes CUDA initialization failure in RNG testsTest Name H100 Source RNGTest.BroadcastingRNG ❌ Link
Greptile OverviewGreptile SummaryThis PR successfully refactors Key Improvements1. Bug Fix - Stream HandlingThe old code only checked 2. Multi-Dimensional Sharding SupportThe change in 3. Code SimplificationThe refactored implementation introduces a reusable Implementation DetailsThe new
The final layout is: Edge Cases VerifiedThe implementation correctly handles:
Confidence Score: 5/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant Caller
participant canonicalizeReduction
participant reorderParallelizedToFront
participant merge_all
participant TensorView
Caller->>canonicalizeReduction: canonicalizeReduction(fusion, tv, schedule_3d)
alt schedule_3d == true
canonicalizeReduction->>TensorView: merge_3d(tv)
TensorView-->>canonicalizeReduction: 3D merged result
canonicalizeReduction->>TensorView: handle broadcast reordering
canonicalizeReduction-->>Caller: {true, true}
else schedule_3d == false
canonicalizeReduction->>reorderParallelizedToFront: reorderParallelizedToFront(tv)
Note over reorderParallelizedToFront: Orders dims by rank:<br/>Stream(0) < DIDz(1) < DIDy(2) < DIDx(3)
reorderParallelizedToFront->>TensorView: reorder parallel dims to front
reorderParallelizedToFront-->>canonicalizeReduction: reorder_map (num_ordered_dims)
Note over canonicalizeReduction: First merge_all: non-reductions
canonicalizeReduction->>merge_all: merge_all(pred: !isReduction)
loop For each unordered dim (reverse)
merge_all->>TensorView: Check if !isReduction()
alt Satisfies predicate
merge_all->>TensorView: merge(i, merged)
end
end
merge_all-->>canonicalizeReduction: merged_non_reduction index
alt merged_non_reduction >= 0
canonicalizeReduction->>TensorView: reorder to num_ordered_dims
Note over canonicalizeReduction: Increment num_ordered_dims
end
Note over canonicalizeReduction: Second merge_all: all remaining
canonicalizeReduction->>merge_all: merge_all(pred: always true)
loop For each remaining dim (reverse)
merge_all->>TensorView: merge(i, merged)
end
merge_all-->>canonicalizeReduction: merged_reduction index
alt merged_reduction >= 0
canonicalizeReduction->>TensorView: reorder to num_ordered_dims
Note over canonicalizeReduction: Increment num_ordered_dims
end
canonicalizeReduction->>canonicalizeReduction: Assert num_ordered_dims == nDims()
canonicalizeReduction-->>Caller: {has_iter, has_reduction}
end
|
|
!test |
liqiangxl
left a comment
There was a problem hiding this comment.
Please fix failed tests.
|
!test |
|
!test |
|
!test |
|
!test |
| // We coalesce all reduction axes to the right; | ||
| bool has_red_axis = mergeReduction(tv) > 0; | ||
|
|
||
| bool has_iter_axis = mergeNonReduction(tv) > 0; |
There was a problem hiding this comment.
There’s temporal coupling between mergeReduction and mergeNonReduction. For example:
- mergeReduction must be called before mergeNonReduction.
- mergeNonReduction needs to reorder DIDs, while mergeReduction does not.
This kind of coupling is much easier to reason about if both steps are inlined into a single function.
|
!test |
More thoroughly tested by #5806