Enable -fno-reciprocal-math to fix div accuracy. #2413

kdrozd-dev · 2025-11-26T08:28:45Z

Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to eliminate issues with division of the number by itself returning non 1 output which caused: #1895.

Copilot

Pull request overview

This PR addresses a division accuracy issue in SYCL kernels where dividing a number by itself could return non-1 results. The fix adds the -fno-reciprocal-math compiler flag to disable reciprocal math optimizations that can introduce numerical inaccuracies.

Key Changes:

Added -fno-reciprocal-math to SYCL_KERNEL_OPTIONS to prevent compiler optimizations that replace division operations with multiplication by reciprocals

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

riverliuintel · 2025-11-28T08:00:12Z

@CuiYifeng

kdrozd-dev · 2025-11-28T10:44:06Z

Currently evaluating the use of a more localized approach using #pragma clang fp reciprocal(off) to minimize the impact on performance. Do not merge for now.

EikanWang · 2025-11-30T03:08:21Z

cmake/BuildFlags.cmake

    # gcc -shared host.o kernel.o device-code.o -o libxxx.so
    set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-sycl-unnamed-lambda)
    set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -sycl-std=2020)
+    set(SYCL_KERNEL_OPTIONS ${SYCL_KERNEL_OPTIONS} -fno-reciprocal-math)


Let's study the NVCC behaviors first. I think reciprocal-math is a general optimization, and most the compilers enable it by default.

@EikanWang I believe --prec-div https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#prec-div-true-false-prec-div is the corresponding NVCC flag by default set to True unless --use_fast_math (https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html?highlight=reciprocal#use-fast-math-use-fast-math) is set to True. I do not believe pytorch is built with --use_fast_math correct me if I'm wrong.

Thanks for the information. Some HW platforms like CPU turn on this option by default. For CUDA, it keeps its default, saying use_fast_math is False. So, it makes sense to add no-reciprocal-math. I'm curious why SYCL turns on this option by default.

@fengyuan14 do you know the background?

torch-xpu-ops/cmake/BuildFlags.cmake

Line 77 in 20baf08

# The fast-math will be enabled by default in SYCL compiler.

Per my understanding, we should disable fast-math to align with CUDA

kdrozd-dev · 2025-12-01T11:41:39Z

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

EikanWang · 2025-12-03T07:44:52Z

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

@SanityRemnants , Although the performance is the king, I'd prefer to keep a high bar of maintenance when we have no concrete performance impact.

@SanityRemnants since it is close to the PyTorch 2.10 branch cut date, I'd like to defer this PR landing.
@weishi-deng , @chuanqi129 , could you help collect the performance data w/ this PR? My expectation is that this PR only impacts a few micro-benchmarks and there should be no impact on E2E models. Priority-wise, the collections is lower than PT 2.10.

NeoZhangJianyu · 2025-12-04T07:28:43Z

There are a few options regarding #pragma clang fp reciprocal(off) (https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags) with the least invasive being adding it only to the BinaryDivTrueKernel.cpp file. However, I'm not sure whether it's worth it to sacrifice some performance for accuracy in this oddly specific use case (calling div first and than trunc instead of div_trunc) for which we have no known business need outside the failing UT. Both the div and div_trunc results are within needed norms. It might be better to modify the UT. What is your opinion on that @EikanWang?

@SanityRemnants , Although the performance is the king, I'd prefer to keep a high bar of maintenance when we have no concrete performance impact.

@SanityRemnants since it is close to the PyTorch 2.10 branch cut date, I'd like to defer this PR landing.

@weishi-deng , @chuanqi129 , could you help collect the performance data w/ this PR? My expectation is that this PR only impacts a few micro-benchmarks and there should be no impact on E2E models. Priority-wise, the collections is lower than PT 2.10.

There is no such issue on CUDA.
IPEX has added some similar parameters to fix the accuracy issue.
Looks like they are not upstream to Pytorch.

This parameter is neither in IPEX or in PyTorch.
It only impact the OPs including divide().

We are waiting for this PR to fix related issue.
Please speed up the review!

Thank you!

weishi-deng · 2025-12-05T02:32:16Z

Hi Eikan, we collect the op benchmark performance before and after this PR . For inference with fp16 dtype, in 22873 op cases, we found about 154 op cases with a ratio <0.9, 455 cases with a ratio <0.95, and the rest achieved a ratio of 95%-100%. Some of the relevant cases can be found in element-wise operations and softmax cases. These are the lowest cases:

	shape	stride	main(us)	with PR(us)	%
cat	[[tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4)), tensor(..., device='meta', size=(16, 1000, 4))], 1]	([(4000, 4, 1), (4000, 4, 1), (4000, 4, 1), (4000, 4, 1), (4000, 4, 1)], [])	363.945961	565.9461021	0.643075
eq	[tensor(..., device='meta', size=(44, 4)), tensor(..., device='meta', size=(44, 4))]	((4, 1), (4, 1))	22.59016037	34.70182419	0.650979
ne	[tensor(..., device='meta', size=(80000, 4)), inf]	((4, 1), [])	21.88682556	32.28187561	0.677991
ne	[tensor(..., device='meta', size=(1000, 320)), inf]	((320, 1), [])	21.6126442	29.8500061	0.724042
mul	[tensor(..., device='meta', size=(87,)), tensor(..., device='meta', size=())]	((1,), ())	21.43383026	29.56390381	0.725
repeat_interleave	[tensor(..., device='meta', size=(8,), dtype=torch.int64)]	((1,),)	847.8164673	1135.122776	0.746894
div	[tensor(..., device='meta', size=(203, 204, 26), dtype=torch.float64), tensor(..., device='meta', size=(203, 204, 1), dtype=torch.float64)]	((5304, 26, 1), (204, 1, 1))	46.86117172	61.50007248	0.761969
div	[tensor(..., device='meta', size=(204, 204, 25), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 25), dtype=torch.float64)]	((5100, 25, 1), (26, 26, 1))	43.40410233	56.64825439	0.766204
div	[tensor(..., device='meta', size=(201, 200, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 26), dtype=torch.float64)]	((5304, 26, 1), (26, 26, 1))	46.31280899	61.41662598	0.754076
div	[tensor(..., device='meta', size=(200, 200, 25), dtype=torch.float64), tensor(..., device='meta', size=(200, 1, 1), dtype=torch.float64)]	((5304, 26, 1), (1, 1, 1))	47.88637161	62.14380264	0.770574
div	[tensor(..., device='meta', size=(204, 203, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 203, 1), dtype=torch.float64)]	((5278, 26, 1), (204, 1, 1))	47.1830368	63.16900253	0.746933
div	[tensor(..., device='meta', size=(200, 201, 26), dtype=torch.float64), tensor(..., device='meta', size=(1, 1, 26), dtype=torch.float64)]	((5304, 26, 1), (26, 26, 1))	50.21095276	65.82736969	0.762767
div	[tensor(..., device='meta', size=(200, 200, 25), dtype=torch.float64), tensor(..., device='meta', size=(1, 200, 1), dtype=torch.float64)]	((5304, 26, 1), (200, 1, 1))	43.04647446	56.99396133	0.755281

EikanWang · 2025-12-06T05:25:27Z

@weishi-deng , I'm curious why the option impacts the other operations besides div. Let understand why.

EikanWang · 2025-12-08T11:34:29Z

IPEX has added some similar parameters to fix the accuracy issue ... We are waiting for this PR to fix related issue.

@NeoZhangJianyu , Any E2E accuracy issues due to this option?

EikanWang · 2025-12-11T11:43:40Z

IPEX has added some similar parameters to fix the accuracy issue ... We are waiting for this PR to fix related issue.

@NeoZhangJianyu , Any E2E accuracy issues due to this option?

No! It's found by test case of torch-xpu-ops repo, refer to internal jira: 7524. It impacts the fp64 div OPs only.

The test case of pytorch for related OPs are passed on PVC.

@NeoZhangJianyu pls. keep my comment as it is and address your questions by quoting my comment.

EikanWang · 2025-12-11T11:52:30Z

@weishi-deng , cat, eq, ne, ne, mul, repeat_interleave - Have you checked why this PR impacts these operations?

EikanWang · 2025-12-16T00:25:39Z

@SanityRemnants , the PR looks good to me. However, the performance of cat, eq, ne, ne, mul, repeat_interleave is suspicious. Could you please check with @weishi-deng to address my comments for performance?

Copilot AI review requested due to automatic review settings November 26, 2025 08:28

Copilot AI reviewed Nov 26, 2025

View reviewed changes

riverliuintel requested a review from CuiYifeng November 28, 2025 08:00

CuiYifeng requested review from EikanWang and guangyey November 28, 2025 08:35

EikanWang reviewed Nov 30, 2025

View reviewed changes

weishi-deng requested a review from EikanWang December 5, 2025 02:32

add -fno-reciprocal-math to SYCL_KERNEL_OPTIONS

0101740

kdrozd-dev force-pushed the kdrozd/div_accuracy branch from dbfbfcb to 0101740 Compare December 8, 2025 12:56

kdrozd-dev added 2 commits December 10, 2025 08:58

Merge branch 'main' into kdrozd/div_accuracy

4445f8e

Merge branch 'main' into kdrozd/div_accuracy

78e1175

Merge branch 'main' into kdrozd/div_accuracy

5992505

EikanWang approved these changes Dec 16, 2025

View reviewed changes

Merge branch 'main' into kdrozd/div_accuracy

ed404b1

Enable -fno-reciprocal-math to fix div accuracy. #2413

Are you sure you want to change the base?

Enable -fno-reciprocal-math to fix div accuracy. #2413

Uh oh!

Conversation

kdrozd-dev commented Nov 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

riverliuintel commented Nov 28, 2025

Uh oh!

kdrozd-dev commented Nov 28, 2025

Uh oh!

EikanWang Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

kdrozd-dev Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EikanWang Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

kdrozd-dev commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang commented Dec 3, 2025

Uh oh!

NeoZhangJianyu commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weishi-deng commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang commented Dec 6, 2025

Uh oh!

EikanWang commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang commented Dec 11, 2025

Uh oh!

EikanWang commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kdrozd-dev Dec 1, 2025 •

edited

Loading

kdrozd-dev commented Dec 1, 2025 •

edited

Loading

NeoZhangJianyu commented Dec 4, 2025 •

edited

Loading

weishi-deng commented Dec 5, 2025 •

edited

Loading

EikanWang commented Dec 8, 2025 •

edited

Loading

EikanWang commented Dec 11, 2025 •

edited

Loading