Skip to content

[BUG][Deepcompile] reduce_grad returns undefined tensor -> Inductor compilation fails (expected a proper tensor but got None) #7682

@zuoyanzhang

Description

@zuoyanzhang

Describe the bug
During AOTAutograd backward compilation, DeepSpeed’s reduce_grad op returns an undefined tensor, but the graph rewrite pass rewires all downstream gradient usages to this output.
As a result, Inductor/FakeTensor sees None as input to ops like aten.sum or reshape, causing compilation failure.

Error

torch._inductor.exc.InductorError: RuntimeError:
Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'

Trigger path

  1. Backward graph: each parameter-grad node is rewritten to torch.ops.dc.reduce_grad.default(grad)
  2. All uses of the original grad are replaced by the output of this op
  3. Fx trace shows downstream ops (e.g., aten.sum(...,[0,1]), reshape) consuming the output of reduce_grad.
  4. c++ implementation returns at::Tensor() (undefined) in both:
  • reduce_grad()
  • reduce_grad_meta()
    This breaks FakeTensor propagation and Inductor lowering.

Root Cause
reduce_grad is treated as a functional node in the graph, but its c++ kernel and meta kernel return a undefined tensor, which cannot be consumed by downstream ops.

Since the compiler rewrites all gradient uses to this output, the output must be a valid Tensor.

Question for maintainers
In DeepSpeed/csrc/compile/deepcompile.cpp, both reduce_grad(...) and reduce_grad_meta(...) currently return an undefined tensor (at::Tensor()).
Given that the graph rewrite redirects all downstream gradient uses to the output of this op, should these two functions instead return the input grad_tensor?

This would allow downstream ops (e.g., aten.sum, reshape) to receive a valid tensor and avoid FakeTensor/Inductor errors during compilation. Is returning grad_tensor the correct fix here, or is the intended semantics different?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions