Use ATen ops on meta tensor to compute output shapes and strides for ExprEval segments #5082

zasdfgbnm · 2025-08-27T07:02:41Z

In FusionExecutorCache, we were assuming fusion segments always generate contiguous tensors. This is not true for ExpressionEvaluator segments. For such segments, we need to actually execute the segment to know the shape and stride we will get. This execution is on device type meta, instead of using a real CUDA tensor. According to https://docs.pytorch.org/docs/stable/meta.html

In some cases, not all device types (e.g., CPU and CUDA) have exactly the same output metadata for an operation; we typically prefer representing the CUDA behavior faithfully in this situation.

This execution should lead to the same shape and stride as we will see when using real tensors.

…ExprEval segments

github-actions · 2025-08-27T07:03:45Z

Review updated until commit d558cca

Description

Use meta tensors and ExpressionEvaluator to compute output shapes/strides for ExprEval segments
Add resetAllocationDomainAndContiguity function to update tensor metadata based on PyTorch results
Modify MatmulOp and CutlassNvfp4GroupedMmaOp evaluate methods to handle meta device tensors
Update FusionKernelRuntime to distinguish between ExprEval and codegen fusion segments
Add test case for issue FusionKernelRuntime::getMaybeHeuristicsFor computes the wrong strides. #4888 to validate the fix

Changes walkthrough

Relevant files

Bug fix

1 files

internal_nodes.cpp `Update MatmulOp and CutlassNvfp4GroupedMmaOp evaluate methods for meta` `tensors`	+40/-11

Enhancement

3 files

fusion_cache_utils.cpp `Add resetAllocationDomainAndContiguity function and update argument` `manager`	+73/-1
fusion_kernel_runtime.cpp `Modify fusion runtime to handle ExprEval segments with meta tensors`	+87/-13
fusion_cache_utils.h `Update function signature for updateWithSegmentOutputs`	+2/-1

Tests

5 files

test_allocation_domain.cpp `Simplify test expectation for allocation domain test`	+1/-4
test_matmul_aten_evaluation.cpp `Remove MatmulNodeTest output strides test case`	+0/-33
test_segmentation.cpp `Update test expectation for upcast ops count`	+4/-1
opinfo_input_generators.py `Update error message regex pattern for topk generator`	+1/-1
test_python_frontend.py `Add test case for issue #4888`	+99/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Complex contiguity logic

The new resetAllocationDomainAndContiguity function contains complex logic for computing contiguity from tensor strides. This logic handles reductions, broadcasts, and multi-dimensional tensors. The reviewer should verify that this algorithm correctly handles all edge cases and matches PyTorch's contiguity semantics.

void resetAllocationDomainAndContiguity(
    TensorView* tv,
    const at::Tensor& tensor) {
  if (!tensor.defined()) {
    return;
  }
  const auto [sizes, strides] =
      inferAllocationSizesAndStrides(tensor, tv, ExpressionEvaluator());

  const auto& alloc = tv->getMaybeAllocationDomain();

  // Custom contiguity inference that considers IterDomain information
  std::vector<std::optional<bool>> contiguity(alloc.size(), std::nullopt);

  // Single pass from right to left with two dynamic indices:
  // - alloc_idx: iterates through allocation domain
  // - sizes_idx: tracks position in sizes/strides (excludes reductions)
  int64_t sizes_idx = (int64_t)sizes.size() - 1;
  int64_t prev_non_skipped_sizes_idx = -1;

  for (int64_t alloc_idx = (int64_t)alloc.size() - 1; alloc_idx >= 0; --alloc_idx) {
    auto id = alloc[alloc_idx];

    // Reduction dimensions: nullopt contiguity (already set), no entry in sizes/strides
    if (id->isReduction()) {
      // Don't decrement sizes_idx since reductions have no entry
      continue;
    }

    // This dimension has an entry in sizes/strides
    NVF_CHECK(sizes_idx >= 0, "Sizes index out of bounds");

    // Broadcast dimensions: nullopt contiguity (already set), but has entry in sizes/strides
    if (id->isBroadcast()) {
      sizes_idx--;  // Move to next dimension in sizes/strides
      continue;
    }

    // Non-broadcast, non-reduction dimension
    if (prev_non_skipped_sizes_idx == -1) {
      // This is the rightmost (innermost) non-skipped dimension
      // It's contiguous if stride == 1
      contiguity[alloc_idx] = (strides[sizes_idx] == 1);
    } else {
      // A dimension is contiguous if its stride equals the stride of the
      // next dimension multiplied by that dimension's size
      contiguity[alloc_idx] = (strides[sizes_idx] == 
                               strides[prev_non_skipped_sizes_idx] * sizes[prev_non_skipped_sizes_idx]);
    }

    prev_non_skipped_sizes_idx = sizes_idx;
    sizes_idx--;  // Move to next dimension in sizes/strides
  }

  NVF_CHECK(sizes_idx == -1, "Not all sizes/strides were consumed");

  tv->setContiguity(contiguity);
}

Meta tensor handling

The code creates meta tensors using at::empty_strided with original tensor sizes and strides, then binds them to the ExpressionEvaluator. This approach assumes that meta tensors will have the same metadata as real tensors, which should be verified for all supported operations.

// For expr evaluated fusion, the striding rules follow that of ATen.
ExpressionEvaluator eval_fusion;
for (auto [i, v] : enumerate(group_to_run->inputs())) {
  auto tensor_pv = args_manager.checkTensorMap(v);
  if (tensor_pv.is<at::Tensor>()) {
    const auto t = tensor_pv.as<at::Tensor>();
    if (t.defined()) {
      const auto meta_t = at::empty_strided(
          t.sizes(),
          t.strides(),
          at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
      eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
    } else {
      eval_fusion.bind(fusion_to_run->inputs()[i], t);
    }
  } else {
    eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
  }
}
for (auto v : fusion_to_run->outputs()) {
  auto result = eval_fusion.evaluate(v);
  group_runtime_outputs.push(result);
}

Simplified MatmulOp evaluation

The MatmulOp evaluation was significantly simplified by removing manual contiguity checks and stride handling. The reviewer should verify that this simplification doesn't break existing functionality and that the ExpressionEvaluator will handle contiguity correctly.

  }

  if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
      rfactor_did_idx != -1) {
    matmul_out = matmul_out.unsqueeze(rfactor_did_idx);
  }
  return {matmul_out};
}

Test failures

(Medium, 12) NVFuser evaluator assertion failures in stream/multidevice matmul & linear tests

Test Name	A100	A100 (dist.)	GB200	GB200 (dist.)	H100	H100 (dist.)
tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager]	❌		❌		❌
tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache]	❌		❌		❌
tests.python.multidevice.test_overlap.test_row_parallel_linear_forward	❌	❌	❌	❌	❌	❌

(Medium, 6) Profiler event count mismatch in test_stream.test_matmul (nvFuser stream scheduling)

Test Name	A100	GB200	H100	Source
tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=eager]	❌	❌	❌
tests.python.direct.test_stream.test_matmul[nvfuser_direct_test=lru_cache]	❌	❌	❌

(Medium, 1) CUDA out-of-memory in nvFuser TmaPointwiseTestF.SplitGridDim2D (runner: H100)

Test Name H100 Source

TmaPointwiseTestF.SplitGridDim2D ❌ Link

In preparing for the fix of #4888, I am working on #5082, which requires the use of `Expr::evaluate` on meta tensors to infer shape and strides of fusion segments selected to be scheduled by the `ExprEval` scheduler. As a consequence of this change, all the `Expr::evaluate` functions should support meta device, and the returned output tensor's shape and stride must match that on device type CUDA. According to https://docs.pytorch.org/docs/stable/meta.html > In some cases, not all device types (e.g., CPU and CUDA) have exactly the same output metadata for an operation; we typically prefer representing the CUDA behavior faithfully in this situation. It is generally safe to assume that we can use device type meta to infer shapes and strides of device type CUDA. But unfortunately, not all operators implement meta device, and `at::_scaled_dot_product_flash_attention` is such an example. In this PR, I am adding my own `at::_scaled_dot_product_flash_attention` implementation on meta devices.

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-10T20:06:27Z

csrc/runtime/fusion_kernel_runtime.cpp

+    if (is_expr_eval) {
+      // For expr evaluated fusion, the striding rules follow that of ATen.
+      ExpressionEvaluator eval_fusion;
+      for (auto [i, v] : enumerate(group_to_run->inputs())) {
+        auto tensor_pv = args_manager.checkTensorMap(v);
+        if (tensor_pv.is<at::Tensor>()) {
+          const auto t = tensor_pv.as<at::Tensor>();
+          if (t.defined()) {
+            const auto meta_t = at::empty_strided(
+                t.sizes(),
+                t.strides(),
+                at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
+            eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
+          } else {
+            eval_fusion.bind(fusion_to_run->inputs()[i], t);
+          }
+        } else {
+          eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
+        }
+      }
+      for (auto v : fusion_to_run->outputs()) {
+        auto result = eval_fusion.evaluate(v);
+        group_runtime_outputs.push(result);
+      }


style: duplicated meta tensor evaluation logic between prepareInputs (lines 370-393) and getMaybeHeuristicsFor (lines 640-663). extract to helper like evaluateExprEvalSegmentOutputs(fusion_to_run, group_inputs, args_manager) to reduce duplication

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zasdfgbnm · 2025-11-12T21:45:46Z

!test

greptile-apps

_{11 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zasdfgbnm · 2025-11-12T21:51:04Z

!test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zasdfgbnm · 2025-11-13T04:39:40Z

!test

greptile-apps

_{8 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zasdfgbnm · 2025-12-17T06:58:53Z

!test

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-17T07:02:44Z

csrc/runtime/fusion_kernel_runtime.cpp

+    } else {
+      // TODO: inferOutputSizes doesn't seem to strictly require a Fusion for
+      // each segment. Consider using the complete fusion instead.
+      auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;


style: variable shadowing: fusion_to_run declared on line 365 as Fusion*, then redeclared here as std::unique_ptr<Fusion>. while technically valid (different scopes), this is confusing

Suggested change

auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;

auto fusion_unique_ptr = segmented_fusion_->makeFusion(group_to_run).second;

group_runtime_outputs =

inferOutputSizes(fusion_unique_ptr.get(), group_runtime_inputs);

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps

_{9 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-17T18:52:57Z

csrc/runtime/fusion_kernel_runtime.cpp

+    if (is_expr_eval) {
+      // For expr evaluated fusion, the striding rules follow that of ATen.
+      ExpressionEvaluator eval_fusion;
+      for (auto [i, v] : enumerate(group_to_run->inputs())) {
+        auto tensor_pv = args_manager.checkTensorMap(v);
+        if (tensor_pv.is<at::Tensor>()) {
+          const auto t = tensor_pv.as<at::Tensor>();
+          if (t.defined()) {
+            const auto meta_t = at::empty_strided(
+                t.sizes(),
+                t.strides(),
+                at::TensorOptions().device(at::kMeta).dtype(t.dtype()));
+            eval_fusion.bind(fusion_to_run->inputs()[i], meta_t);
+          } else {
+            eval_fusion.bind(fusion_to_run->inputs()[i], t);
+          }
+        } else {
+          eval_fusion.bind(fusion_to_run->inputs()[i], tensor_pv);
+        }
+      }
+      for (auto v : fusion_to_run->outputs()) {
+        auto result = eval_fusion.evaluate(v);
+        group_runtime_outputs.push(result);
+      }


style: duplicated between prepareInputs and getMaybeHeuristicsFor. extract to helper like evaluateExprEvalSegmentOnMeta(fusion, inputs, args_manager) to eliminate duplication

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2025-12-17T18:52:57Z

csrc/runtime/fusion_kernel_runtime.cpp

-    auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;
-    auto group_runtime_outputs =
-        inferOutputSizes(fusion_to_run.get(), group_runtime_inputs);
+    Fusion* fusion_to_run = group_to_run->getFusion();


style: variable fusion_to_run shadows line 397 where it's redeclared as std::unique_ptr<Fusion>. rename one (e.g., this to fusion_ptr or line 397 to fusion_to_run_unique) to avoid confusion

greptile-apps

_{9 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-17T18:57:51Z

csrc/runtime/fusion_kernel_runtime.cpp

+    } else {
+      // TODO: inferOutputSizes doesn't seem to strictly require a Fusion for
+      // each segment. Consider using the complete fusion instead.
+      auto fusion_to_run = segmented_fusion_->makeFusion(group_to_run).second;


style: fusion_to_run redeclared here as std::unique_ptr<Fusion> shadows the Fusion* on line 365. rename to avoid confusion (e.g., fusion_to_run_ptr)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

zasdfgbnm · 2025-12-17T19:04:08Z

!test

zasdfgbnm · 2025-12-17T19:34:48Z

!test

@naoyam

) Fixes #4888 Stacked on #5766 I used to work on #5082 for the fix, but I hit too many blockers, because this PR could interact with many new assumptions/hacks/unfinalized designs on things like allocation domain, stream-sharded tensor, multidevice, etc., and we keep having new things committed to the main branch that break #5082. This situation delayed the PR for a very long time. So I recreated this PR that is more friendly to incremental development. Today, in the main branch, in `FusionExecutorCache`, we were assuming fusion segments always generate contiguous tensors. This is not true for `ExpressionEvaluator` segments. For example, ATen's slice op returns non-contiguous tensors. It is worth mentioning that, because segmentation and scheduler selection depend on inputs, the contiguity of intermediate results also depends on inputs. This PR adds `FusionKernelRuntime::inferOutputMetaTensor(`, which replaces `inferOutputShapeAndContiguousStrides` to infer the output shape and stride of each segment. Both `FusionKernelRuntime::inferOutputMetaTensor(` and `inferOutputShapeAndContiguousStrides` store their result as a tensor on the meta device. The difference is, `FusionKernelRuntime::inferOutputMetaTensor(` will actually run the segment on device type meta if this segment is scheduled to run by `ExpressionEvaluator`, while `inferOutputShapeAndContiguousStrides` just assumes the output to be contiguous. Because `FusionKernelRuntime::inferOutputMetaTensor(` will run the segment on device type meta, related op's `MyOp::evaluate` should work for device type meta. There is good and bad news for this design. The good news is, most `MyOp::evaluate` just calls `at::` ops, which usually already support meta device, and [PyTorch designed meta device to try to make its behavior on par with CUDA](https://docs.pytorch.org/docs/stable/meta.html). The bad news is, because many op's meta device implementation is on Python, running `at::op` on these kinds of ops would hang due to the inability to grab Python's GIL (Thanks @naoyam for help debugging!). If this is the case, the corresponding `MyOp::evaluate` must manually compute the shape and stride and use `at::empty_strided(device=meta)` to create the result. Besides `FusionKernelRuntime::inferOutputMetaTensor(`, this PR also adds `FusionKernelRuntime::updateContiguityOfSegmentOutputs(`. Which updates the segment output `TensorView`s' contiguity based on the inferred shape and stride. This PR adds an enable option "infer-contiguity" to incrementally enable this feature. When "infer-contiguity" is disabled, `FusionKernelRuntime::inferOutputMetaTensor(` will fallback to the behavior of `inferOutputShapeAndContiguousStrides`, and `FusionKernelRuntime::updateContiguityOfSegmentOutputs(` will be no-op. The plan is, we merge this PR and not set "infer-contiguity" for the currently failed tests. I will write new PRs fixing the failed tests one by one. --------- Co-authored-by: Jingyue Wu <[email protected]>

zasdfgbnm · 2026-01-26T22:42:04Z

Superseded by #5772

Use ATen ops on meta tensor to compute output shapes and strides for …

09ae32c

…ExprEval segments

zasdfgbnm mentioned this pull request Aug 29, 2025

Add SdpaFwdOp::evaluate on meta device #5086

Merged

zasdfgbnm and others added 26 commits September 5, 2025 13:41

Merge branch 'main' into eval-segment

0f02fd4

update contiguity

5cc99d1

save

2489b69

save

11179b2

save

f149b4b

fix

6a130ef

save

75497a4

inferAndValidateAllocationSizesAndStrides

fc6a204

save

9059dd6

save

48d5c0e

save

3c42b89

save

4dbda7f

fix

f74838b

save

b9ac153

save

c6ad258

save

591a9f3

fix

7d49da6

save

1756cbb

save

c6ae8e6

p

c0713af

save

4ff91ca

save

52360cf

save

7d1aff0

save

3b73a41

fix

08f4b18

save

f5baf4f

Merge branch 'main' into eval-segment

71a06a1

greptile-apps bot reviewed Nov 10, 2025

View reviewed changes

Merge branch 'main' into eval-segment

68db9de

greptile-apps bot reviewed Nov 11, 2025

View reviewed changes

zasdfgbnm added 2 commits November 12, 2025 13:42

fix contiguity computation (#5489)

25c3fee

Merge branch 'main' into eval-segment

d70d574

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

cleanup

593e327

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

Merge branch 'main' into eval-segment

29f4b09

greptile-apps bot reviewed Nov 13, 2025

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/Fuser into eval-segment

2414810

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

fix error regex

0e78860

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

meta

75ab953

greptile-apps bot reviewed Dec 17, 2025

View reviewed changes

zasdfgbnm added 2 commits December 17, 2025 11:08

Merge branch 'main' into eval-segment

85bc7f5

Merge branch 'main' into eval-segment

d558cca

zasdfgbnm mentioned this pull request Dec 18, 2025

Inconsistent values for the same extent Val* #5710

Open

zasdfgbnm mentioned this pull request Jan 15, 2026

Use meta device tensor to infer contiguity for expr-eval segments #5772

Merged

zasdfgbnm closed this Jan 26, 2026

Use ATen ops on meta tensor to compute output shapes and strides for ExprEval segments #5082

Use ATen ops on meta tensor to compute output shapes and strides for ExprEval segments #5082

Uh oh!

Conversation

zasdfgbnm commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 27, 2025 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 12, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 12, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 13, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Dec 17, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Dec 17, 2025

Uh oh!

zasdfgbnm commented Dec 17, 2025

Uh oh!

zasdfgbnm commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zasdfgbnm commented Aug 27, 2025 •

edited

Loading

github-actions bot commented Aug 27, 2025 •

edited by xwang233

Loading