Skip to content

Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential #9455

@developer0hye

Description

@developer0hye

Bug

When building torchvision from source, setup.py does not pass OpenMP compile/link flags (-fopenmp, -lomp/-lgomp) to the C++ extension build. This means any torchvision C++ kernel that calls at::parallel_for will silently fall back to sequential execution, because at::parallel_for is a header-only template (ATen/Parallel.h) whose #pragma omp parallel directives are compiled into the calling translation unit (_C.so), not into libtorch_cpu.

Why this hasn't been a problem until now

I checked every file under torchvision/csrc/ops/cpu/ on the current main branch:

File at::parallel_for #pragma omp
deform_conv2d_kernel.cpp
nms_kernel.cpp
roi_align_kernel.cpp commented out
roi_pool_kernel.cpp
ps_roi_align_kernel.cpp
ps_roi_pool_kernel.cpp
box_iou_rotated_kernel.cpp

No existing torchvision C++ code directly uses OpenMP parallelism, so the missing flags had no observable effect. The pre-built pip/conda wheels are built via CI scripts that handle OpenMP separately.

Why it matters now

PR #9442 introduces at::parallel_for to the deform_conv2d CPU forward kernel — the first direct usage in torchvision's codebase. Without the compile/link flags, source builds get 0% speedup from the parallelization while the change is designed to deliver 2.5–3.0×.

I confirmed this on Apple M2 (macOS ARM, 4 threads). Thread scaling with and without OpenMP flags:

Without -fopenmp (current setup.py):

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.99      16.76      78.12     324.52
4              2.65      16.14      75.25     313.23   ← no scaling

With -fopenmp + -lomp:

Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.91      15.71      75.60     310.34
4              1.07       5.36      30.33     121.75   ← scales as expected

Proposed fix

Add OpenMP flags to setup.py:

--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
         if sysconfig.get_config_var("Py_GIL_DISABLED"):
             extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
 
+    if sys.platform == "darwin":
+        extra_compile_args["cxx"].append("-Xpreprocessor")
+        extra_compile_args["cxx"].append("-fopenmp")
+    elif sys.platform != "win32":
+        extra_compile_args["cxx"].append("-fopenmp")
+
     if DEBUG:
         extra_compile_args["cxx"].append("-g")
         extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
             sources += mps_sources
 
     define_macros, extra_compile_args = get_macros_and_flags()
+
+    extra_link_args = []
+    if sys.platform == "darwin":
+        # Link against libomp shipped with PyTorch for at::parallel_for support
+        torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+        extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+    elif sys.platform != "win32":
+        extra_link_args = ["-lgomp"]
+
     return Extension(
         name="torchvision._C",
         sources=sorted(str(s) for s in sources),
         include_dirs=[CSRS_DIR],
         define_macros=define_macros,
         extra_compile_args=extra_compile_args,
+        extra_link_args=extra_link_args,
     )

This also unblocks future parallelization of other CPU kernels (roi_align, nms, etc.) as originally proposed in #6619.

Related

Versions

  • PyTorch: 2.10.0
  • torchvision: 0.26.0 (source build)
  • macOS 26.3.1, Apple M2, ARM64
  • Python 3.12

cc @NicolasHug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions