-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Describe the bug
During AOTAutograd backward compilation, DeepSpeed’s reduce_grad op returns an undefined tensor, but the graph rewrite pass rewires all downstream gradient usages to this output.
As a result, Inductor/FakeTensor sees None as input to ops like aten.sum or reshape, causing compilation failure.
Error
torch._inductor.exc.InductorError: RuntimeError:
Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'
Trigger path
- Backward graph: each parameter-grad node is rewritten to torch.ops.dc.reduce_grad.default(grad)
- All uses of the original grad are replaced by the output of this op
- Fx trace shows downstream ops (e.g., aten.sum(...,[0,1]), reshape) consuming the output of reduce_grad.
- c++ implementation returns at::Tensor() (undefined) in both:
- reduce_grad()
- reduce_grad_meta()
This breaks FakeTensor propagation and Inductor lowering.
Root Cause
reduce_grad is treated as a functional node in the graph, but its c++ kernel and meta kernel return a undefined tensor, which cannot be consumed by downstream ops.
Since the compiler rewrites all gradient uses to this output, the output must be a valid Tensor.
Question for maintainers
In DeepSpeed/csrc/compile/deepcompile.cpp, both reduce_grad(...) and reduce_grad_meta(...) currently return an undefined tensor (at::Tensor()).
Given that the graph rewrite redirects all downstream gradient uses to the output of this op, should these two functions instead return the input grad_tensor?
This would allow downstream ops (e.g., aten.sum, reshape) to receive a valid tensor and avoid FakeTensor/Inductor errors during compilation. Is returning grad_tensor the correct fix here, or is the intended semantics different?