Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Aug 27, 2025

Fixes #4888

In FusionExecutorCache, we were assuming fusion segments always generate contiguous tensors. This is not true for ExpressionEvaluator segments. For such segments, we need to actually execute the segment to know the shape and stride we will get. This execution is on device type meta, instead of using a real CUDA tensor. According to https://docs.pytorch.org/docs/stable/meta.html

In some cases, not all device types (e.g., CPU and CUDA) have exactly the same output metadata for an operation; we typically prefer representing the CUDA behavior faithfully in this situation.

This execution should lead to the same shape and stride as we will see when using real tensors.

@github-actions
Copy link

github-actions bot commented Aug 27, 2025

Review updated until commit d558cca

Description

  • Use meta tensors and ExpressionEvaluator to compute output shapes/strides for ExprEval segments

  • Add resetAllocationDomainAndContiguity function to update tensor metadata based on PyTorch results

  • Modify MatmulOp and CutlassNvfp4GroupedMmaOp evaluate methods to handle meta device tensors

  • Update FusionKernelRuntime to distinguish between ExprEval and codegen fusion segments

  • Add test case for issue FusionKernelRuntime::getMaybeHeuristicsFor computes the wrong strides.  #4888 to validate the fix

Changes walkthrough

Relevant files
Bug fix
1 files
internal_nodes.cpp
Update MatmulOp and CutlassNvfp4GroupedMmaOp evaluate methods for meta
tensors
+40/-11 
Enhancement
3 files
fusion_cache_utils.cpp
Add resetAllocationDomainAndContiguity function and update argument
manager
+73/-1   
fusion_kernel_runtime.cpp
Modify fusion runtime to handle ExprEval segments with meta tensors
+87/-13 
fusion_cache_utils.h
Update function signature for updateWithSegmentOutputs     
+2/-1     
Tests
5 files
test_allocation_domain.cpp
Simplify test expectation for allocation domain test         
+1/-4     
test_matmul_aten_evaluation.cpp
Remove MatmulNodeTest output strides test case                     
+0/-33   
test_segmentation.cpp
Update test expectation for upcast ops count                         
+4/-1     
opinfo_input_generators.py
Update error message regex pattern for topk generator       
+1/-1     
test_python_frontend.py
Add test case for issue #4888                                                       
+99/-0   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Complex contiguity logic

The new resetAllocationDomainAndContiguity function contains complex logic for computing contiguity from tensor strides. This logic handles reductions, broadcasts, and multi-dimensional tensors. The reviewer should verify that this algorithm correctly handles all edge cases and matches PyTorch's contiguity semantics.

void resetAllocationDomainAndContiguity(
    TensorView* tv,
    const at::Tensor& tensor) {
  if (!tensor.defined()) {
    return;
  }
  const auto [sizes, strides] =
      inferAllocationSizesAndStrides(tensor, tv, ExpressionEvaluator());

  const auto& alloc = tv->getMaybeAllocationDomain();

  // Custom contiguity inference that considers IterDomain information
  std::vector<std::optional<bool>> contiguity(alloc.size(), std::nullopt);

  // Single pass from right to left with two dynamic indices:
  // - alloc_idx: iterates through allocation domain
  // - sizes_idx: tracks position in sizes/strides (excludes reductions)
  int64_t sizes_idx = (int64_t)sizes.size() - 1;
  int64_t prev_non_skipped_sizes_idx = -1;

  for (int64_t alloc_idx = (int64_t)alloc.size() - 1; alloc_idx >= 0; --alloc_idx) {
    auto id = alloc[alloc_idx];

    // Reduction dimensions: nullopt contiguity (already set), no entry in sizes/strides
    if (id->isReduction()) {
      // Don't decrement sizes_idx since reductions have no entry
      continue;
    }

    // This dimension has an entry in sizes/strides
    NVF_CHECK(sizes_idx >= 0, "Sizes index out of bounds");

    // Broadcast dimensions: nullopt contiguity (already set), but has entry in sizes/strides
    if (id->isBroadcast()) {
      sizes_idx--;  // Move to next dimension in sizes/strides
      continue;
    }

    // Non-broadcast, non-reduction dimension
    if (prev_non_skipped_sizes_idx == -1) {
      // This is the rightmost (innermost) non-skipped dimension
      // It's contiguous if stride == 1
      contiguity[alloc_idx] = (strides[sizes_idx] == 1);
    } else {
      // A dimension is contiguous if its stride equals the stride of the
      // next dimension multiplied by that dimension's size
      contiguity[alloc_idx] = (strides[sizes_idx] == 
                               strides[prev_non_skipped_sizes_idx] * sizes[prev_non_skipped_sizes_idx]);
    }

    prev_non_skipped_sizes_idx = sizes_idx;
    sizes_idx--;  // Move to next dimension in sizes/strides
  }

  NVF_CHECK(sizes_idx == -1, "Not all sizes/strides were consumed");

  tv->setContiguity(contiguity);
}
Meta tensor handling

The code creates meta tensors using at::empty_strided with original tensor sizes and strides, then binds them to the ExpressionEvaluator. This approach assumes that meta tensors will have the same metadata as real tensors, which should be verified for all supported operations.

// For expr evaluated fusion, the striding rules follow that of ATen.
ExpressionEvaluator eval_fusion;
for (auto [i, v] : enumerate(group_to_run->inputs())) {
  auto tensor_pv = args_manager.checkTensorMap(v);
  if (tensor_pv.is<at::Tensor>()) {
    const auto t = tensor_pv.as<at::Tensor>();
    if (t.defined()) {
      const auto meta_t = at::empty_strided(
          t.sizes(),
          t.strides(),
          at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
      eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
    } else {
      eval_fusion.bind(fusion_to_run->inputs()[i], t);
    }
  } else {
    eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
  }
}
for (auto v : fusion_to_run->outputs()) {
  auto result = eval_fusion.evaluate(v);
  group_runtime_outputs.push(result);
}
Simplified MatmulOp evaluation

The MatmulOp evaluation was significantly simplified by removing manual contiguity checks and stride handling. The reviewer should verify that this simplification doesn't break existing functionality and that the ExpressionEvaluator will handle contiguity correctly.

  }

  if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
      rfactor_did_idx != -1) {
    matmul_out = matmul_out.unsqueeze(rfactor_did_idx);
  }
  return {matmul_out};
}

Test failures

  • (Medium, 12) NVFuser evaluator assertion failures in stream/multidevice matmul & linear tests

    Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source
    tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager]
    tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache]
    tests.python.multidevice.test_overlap.test_row_parallel_linear_forward
  • (Medium, 6) Profiler event count mismatch in test_stream.test_matmul (nvFuser stream scheduling)

    Test Name A100 GB200 H100 Source
    tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=eager]
    tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=lru_cache]
  • (Medium, 1) CUDA out-of-memory in nvFuser TmaPointwiseTestF.SplitGridDim2D (runner: H100)

    Test Name H100 Source
    TmaPointwiseTestF.SplitGridDim2D Link

zasdfgbnm added a commit that referenced this pull request Sep 2, 2025
In preparing for the fix of #4888,
I am working on #5082, which
requires the use of `Expr::evaluate` on meta tensors to infer shape and
strides of fusion segments selected to be scheduled by the `ExprEval`
scheduler. As a consequence of this change, all the `Expr::evaluate`
functions should support meta device, and the returned output tensor's
shape and stride must match that on device type CUDA.

According to https://docs.pytorch.org/docs/stable/meta.html
> In some cases, not all device types (e.g., CPU and CUDA) have exactly
the same output metadata for an operation; we typically prefer
representing the CUDA behavior faithfully in this situation.

It is generally safe to assume that we can use device type meta to infer
shapes and strides of device type CUDA. But unfortunately, not all
operators implement meta device, and
`at::_scaled_dot_product_flash_attention` is such an example.

In this PR, I am adding my own `at::_scaled_dot_product_flash_attention`
implementation on meta devices.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +370 to +393
if (is_expr_eval) {
// For expr evaluated fusion, the striding rules follow that of ATen.
ExpressionEvaluator eval_fusion;
for (auto [i, v] : enumerate(group_to_run->inputs())) {
auto tensor_pv = args_manager.checkTensorMap(v);
if (tensor_pv.is<at::Tensor>()) {
const auto t = tensor_pv.as<at::Tensor>();
if (t.defined()) {
const auto meta_t = at::empty_strided(
t.sizes(),
t.strides(),
at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
} else {
eval_fusion.bind(fusion_to_run->inputs()[i], t);
}
} else {
eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
}
}
for (auto v : fusion_to_run->outputs()) {
auto result = eval_fusion.evaluate(v);
group_runtime_outputs.push(result);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: duplicated meta tensor evaluation logic between prepareInputs (lines 370-393) and getMaybeHeuristicsFor (lines 640-663). extract to helper like evaluateExprEvalSegmentOutputs(fusion_to_run, group_inputs, args_manager) to reduce duplication

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@zasdfgbnm
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@zasdfgbnm
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@zasdfgbnm
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@zasdfgbnm
Copy link
Collaborator Author

!test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

} else {
// TODO: inferOutputSizes doesn't seem to strictly require a Fusion for
// each segment. Consider using the complete fusion instead.
auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: variable shadowing: fusion_to_run declared on line 365 as Fusion*, then redeclared here as std::unique_ptr<Fusion>. while technically valid (different scopes), this is confusing

Suggested change
auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;
auto fusion_unique_ptr = segmented_fusion_->makeFusion(group_to_run).second;
group_runtime_outputs =
inferOutputSizes(fusion_unique_ptr.get(), group_runtime_inputs);

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +370 to +393
if (is_expr_eval) {
// For expr evaluated fusion, the striding rules follow that of ATen.
ExpressionEvaluator eval_fusion;
for (auto [i, v] : enumerate(group_to_run->inputs())) {
auto tensor_pv = args_manager.checkTensorMap(v);
if (tensor_pv.is<at::Tensor>()) {
const auto t = tensor_pv.as<at::Tensor>();
if (t.defined()) {
const auto meta_t = at::empty_strided(
t.sizes(),
t.strides(),
at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
} else {
eval_fusion.bind(fusion_to_run->inputs()[i], t);
}
} else {
eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
}
}
for (auto v : fusion_to_run->outputs()) {
auto result = eval_fusion.evaluate(v);
group_runtime_outputs.push(result);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: duplicated between prepareInputs and getMaybeHeuristicsFor. extract to helper like evaluateExprEvalSegmentOnMeta(fusion, inputs, args_manager) to eliminate duplication

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;
auto group_runtime_outputs =
inferOutputSizes(fusion_to_run.get(), group_runtime_inputs);
Fusion* fusion_to_run = group_to_run->getFusion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: variable fusion_to_run shadows line 397 where it's redeclared as std::unique_ptr<Fusion>. rename one (e.g., this to fusion_ptr or line 397 to fusion_to_run_unique) to avoid confusion

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

} else {
// TODO: inferOutputSizes doesn't seem to strictly require a Fusion for
// each segment. Consider using the complete fusion instead.
auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: fusion_to_run redeclared here as std::unique_ptr<Fusion> shadows the Fusion* on line 365. rename to avoid confusion (e.g., fusion_to_run_ptr)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@zasdfgbnm
Copy link
Collaborator Author

!test

@zasdfgbnm
Copy link
Collaborator Author

!test

zasdfgbnm added a commit that referenced this pull request Jan 23, 2026
)

Fixes #4888

Stacked on #5766

I used to work on #5082 for the fix,
but I hit too many blockers, because this PR could interact with many
new assumptions/hacks/unfinalized designs on things like allocation
domain, stream-sharded tensor, multidevice, etc., and we keep having new
things committed to the main branch that break #5082. This situation
delayed the PR for a very long time. So I recreated this PR that is more
friendly to incremental development.

Today, in the main branch, in `FusionExecutorCache`, we were assuming
fusion segments always generate contiguous tensors. This is not true for
`ExpressionEvaluator` segments. For example, ATen's slice op returns
non-contiguous tensors. It is worth mentioning that, because
segmentation and scheduler selection depend on inputs, the contiguity of
intermediate results also depends on inputs.

This PR adds `FusionKernelRuntime::inferOutputMetaTensor(`, which
replaces `inferOutputShapeAndContiguousStrides` to infer the output
shape and stride of each segment. Both
`FusionKernelRuntime::inferOutputMetaTensor(` and
`inferOutputShapeAndContiguousStrides` store their result as a tensor on
the meta device. The difference is,
`FusionKernelRuntime::inferOutputMetaTensor(` will actually run the
segment on device type meta if this segment is scheduled to run by
`ExpressionEvaluator`, while `inferOutputShapeAndContiguousStrides` just
assumes the output to be contiguous.

Because `FusionKernelRuntime::inferOutputMetaTensor(` will run the
segment on device type meta, related op's `MyOp::evaluate` should work
for device type meta. There is good and bad news for this design. The
good news is, most `MyOp::evaluate` just calls `at::` ops, which usually
already support meta device, and [PyTorch designed meta device to try to
make its behavior on par with
CUDA](https://docs.pytorch.org/docs/stable/meta.html). The bad news is,
because many op's meta device implementation is on Python, running
`at::op` on these kinds of ops would hang due to the inability to grab
Python's GIL (Thanks @naoyam for help debugging!). If this is the case,
the corresponding `MyOp::evaluate` must manually compute the shape and
stride and use `at::empty_strided(device=meta)` to create the result.

Besides `FusionKernelRuntime::inferOutputMetaTensor(`, this PR also adds
`FusionKernelRuntime::updateContiguityOfSegmentOutputs(`. Which updates
the segment output `TensorView`s' contiguity based on the inferred shape
and stride.

This PR adds an enable option "infer-contiguity" to incrementally enable
this feature. When "infer-contiguity" is disabled,
`FusionKernelRuntime::inferOutputMetaTensor(` will fallback to the
behavior of `inferOutputShapeAndContiguousStrides`, and
`FusionKernelRuntime::updateContiguityOfSegmentOutputs(` will be no-op.
The plan is, we merge this PR and not set "infer-contiguity" for the
currently failed tests. I will write new PRs fixing the failed tests one
by one.

---------

Co-authored-by: Jingyue Wu <[email protected]>
@zasdfgbnm
Copy link
Collaborator Author

Superseded by #5772

@zasdfgbnm zasdfgbnm closed this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FusionKernelRuntime::getMaybeHeuristicsFor computes the wrong strides.

5 participants