Skip to content

Conversation

@kdrozd-dev
Copy link
Contributor

Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to eliminate issues with division of the number by itself returning non 1 output which caused: #1895.

Copilot AI review requested due to automatic review settings November 26, 2025 08:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a division accuracy issue in SYCL kernels where dividing a number by itself could return non-1 results. The fix adds the -fno-reciprocal-math compiler flag to disable reciprocal math optimizations that can introduce numerical inaccuracies.

Key Changes:

  • Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to prevent compiler optimizations that replace division operations with multiplication by reciprocals

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@riverliuintel
Copy link
Contributor

@CuiYifeng

@kdrozd-dev
Copy link
Contributor Author

Currently evaluating the use of a more localized approach using #pragma clang fp reciprocal(off) to minimize the impact on performance. Do not merge for now.

# gcc -shared host.o kernel.o device-code.o -o libxxx.so
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-sycl-unnamed-lambda)
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -sycl-std=2020)
set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-reciprocal-math)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's study the NVCC behaviors first. I think reciprocal-math is a general optimization, and most the compilers enable it by default.

Copy link
Contributor Author

@kdrozd-dev kdrozd-dev Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EikanWang I believe --prec-div https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#prec-div-true-false-prec-div is the corresponding NVCC flag by default set to True unless --use_fast_math (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#use-fast-math-use-fast-math) is set to True. I do not believe pytorch is built with --use_fast_math correct me if I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information. Some HW platforms like CPU turn on this option by default. For CUDA, it keeps its default, saying use_fast_math is False. So, it makes sense to add no-reciprocal-math. I'm curious why SYCL turns on this option by default.

@fengyuan14 do you know the background?

# The fast-math will be enabled by default in SYCL compiler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my understanding, we should disable fast-math to align with CUDA

@kdrozd-dev
Copy link
Contributor Author

kdrozd-dev commented Dec 1, 2025

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

@EikanWang
Copy link
Contributor

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

@SanityRemnants , Although the performance is the king, I'd prefer to keep a high bar of maintenance when we have no concrete performance impact.

  • @SanityRemnants since it is close to the PyTorch 2.10 branch cut date, I'd like to defer this PR landing.
  • @weishi-deng , @chuanqi129 , could you help collect the performance data w/ this PR? My expectation is that this PR only impacts a few micro-benchmarks and there should be no impact on E2E models. Priority-wise, the collections is lower than PT 2.10.

@NeoZhangJianyu
Copy link

NeoZhangJianyu commented Dec 4, 2025

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

@SanityRemnants , Although the performance is the king, I'd prefer to keep a high bar of maintenance when we have no concrete performance impact.

  • @SanityRemnants since it is close to the PyTorch 2.10 branch cut date, I'd like to defer this PR landing.
  • @weishi-deng , @chuanqi129 , could you help collect the performance data w/ this PR? My expectation is that this PR only impacts a few micro-benchmarks and there should be no impact on E2E models. Priority-wise, the collections is lower than PT 2.10.

There is no such issue on CUDA.
IPEX has added some similar parameters to fix the accuracy issue.
Looks like they are not upstream to Pytorch.

This parameter is neither in IPEX or in PyTorch.
It only impact the OPs including divide().

We are waiting for this PR to fix related issue.
Please speed up the review!

Thank you!

@weishi-deng
Copy link
Contributor

weishi-deng commented Dec 5, 2025

Hi Eikan, we collect the op benchmark performance before and after this PR . For inference with fp16 dtype, in 22873 op cases, we found about 154 op cases with a ratio <0.9, 455 cases with a ratio <0.95, and the rest achieved a ratio of 95%-100%. Some of the relevant cases can be found in element-wise operations and softmax cases. These are the lowest cases:

<style> </style>
  shape stride main(us) with PR(us) %
cat [[tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4))], 1] ([(4000, 4, 1), (4000, 4, 1), (4000, 4, 1), (4000, 4, 1), (4000, 4, 1)], []) 363.945961 565.9461021 0.643075
eq [tensor(..., device='meta', size=(44, 4)), tensor(..., device='meta', size=(44, 4))] ((4, 1), (4, 1)) 22.59016037 34.70182419 0.650979
ne [tensor(..., device='meta', size=(80000, 4)), inf] ((4, 1), []) 21.88682556 32.28187561 0.677991
ne [tensor(..., device='meta', size=(1000, 320)), inf] ((320, 1), []) 21.6126442 29.8500061 0.724042
mul [tensor(..., device='meta', size=(87,)), tensor(..., device='meta', size=())] ((1,), ()) 21.43383026 29.56390381 0.725
repeat_interleave [tensor(..., device='meta', size=(8,), dtype=torch.int64)] ((1,),) 847.8164673 1135.122776 0.746894
div [tensor(..., device='meta', size=(203, 204, 26), dtype=torch.float64), tensor(..., device='meta', size=(203, 204, 1), dtype=torch.float64)] ((5304, 26, 1), (204, 1, 1)) 46.86117172 61.50007248 0.761969
div [tensor(..., device='meta', size=(204, 204, 25), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 25), dtype=torch.float64)] ((5100, 25, 1), (26, 26, 1)) 43.40410233 56.64825439 0.766204
div [tensor(..., device='meta', size=(201, 200, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 26), dtype=torch.float64)] ((5304, 26, 1), (26, 26, 1)) 46.31280899 61.41662598 0.754076
div [tensor(..., device='meta', size=(200, 200, 25), dtype=torch.float64), tensor(..., device='meta', size=(200, 1, 1), dtype=torch.float64)] ((5304, 26, 1), (1, 1, 1)) 47.88637161 62.14380264 0.770574
div [tensor(..., device='meta', size=(204, 203, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 203, 1), dtype=torch.float64)] ((5278, 26, 1), (204, 1, 1)) 47.1830368 63.16900253 0.746933
div [tensor(..., device='meta', size=(200, 201, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 26), dtype=torch.float64)] ((5304, 26, 1), (26, 26, 1)) 50.21095276 65.82736969 0.762767
div [tensor(..., device='meta', size=(200, 200, 25), dtype=torch.float64), tensor(..., device='meta', size=(1, 200, 1), dtype=torch.float64)] ((5304, 26, 1), (200, 1, 1)) 43.04647446 56.99396133 0.755281

@weishi-deng weishi-deng requested a review from EikanWang December 5, 2025 02:32
@EikanWang
Copy link
Contributor

@weishi-deng , I'm curious why the option impacts the other operations besides div. Let understand why.

@EikanWang
Copy link
Contributor

EikanWang commented Dec 8, 2025

IPEX has added some similar parameters to fix the accuracy issue ... We are waiting for this PR to fix related issue.

@NeoZhangJianyu , Any E2E accuracy issues due to this option?

@EikanWang
Copy link
Contributor

EikanWang commented Dec 11, 2025

IPEX has added some similar parameters to fix the accuracy issue ... We are waiting for this PR to fix related issue.

@NeoZhangJianyu , Any E2E accuracy issues due to this option?

No! It's found by test case of torch-xpu-ops repo, refer to internal jira: 7524. It impacts the fp64 div OPs only.

The test case of pytorch for related OPs are passed on PVC.

@NeoZhangJianyu pls. keep my comment as it is and address your questions by quoting my comment.

@EikanWang
Copy link
Contributor

@weishi-deng , cat, eq, ne, ne, mul, repeat_interleave - Have you checked why this PR impacts these operations?

@EikanWang
Copy link
Contributor

@SanityRemnants , the PR looks good to me. However, the performance of cat, eq, ne, ne, mul, repeat_interleave is suspicious. Could you please check with @weishi-deng to address my comments for performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants