Describe the bug
The reproducer:
nspills.py
Without generate_native_code=True:
nspills = 0
With generate_native_code=True
nspill = 2304
The nspill value is used by Inductor to determine whether two Triton kernels can be fused. Therefore, an incorrect value will negatively impact performance.
Environment details
PyTorch: pytorch/pytorch#167972 (enable generate_native_code in Inductor) or use this branch: https://github.com/pytorch/pytorch/tree/gh/etaf/173/head
Triton: release/3.6.x
GPU: PVC