Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential

## Bug

When building torchvision from source, `setup.py` does not pass OpenMP compile/link flags (`-fopenmp`, `-lomp`/`-lgomp`) to the C++ extension build. This means **any torchvision C++ kernel that calls `at::parallel_for` will silently fall back to sequential execution**, because `at::parallel_for` is a header-only template (`ATen/Parallel.h`) whose `#pragma omp parallel` directives are compiled into the calling translation unit (`_C.so`), not into `libtorch_cpu`.

### Why this hasn't been a problem until now

I checked every file under `torchvision/csrc/ops/cpu/` on the current `main` branch:

| File | `at::parallel_for` | `#pragma omp` |
|------|-------------------|---------------|
| `deform_conv2d_kernel.cpp` | ❌ | ❌ |
| `nms_kernel.cpp` | ❌ | ❌ |
| `roi_align_kernel.cpp` | ❌ | commented out |
| `roi_pool_kernel.cpp` | ❌ | ❌ |
| `ps_roi_align_kernel.cpp` | ❌ | ❌ |
| `ps_roi_pool_kernel.cpp` | ❌ | ❌ |
| `box_iou_rotated_kernel.cpp` | ❌ | ❌ |

No existing torchvision C++ code directly uses OpenMP parallelism, so the missing flags had no observable effect. The pre-built pip/conda wheels are built via CI scripts that handle OpenMP separately.

### Why it matters now

PR #9442 introduces `at::parallel_for` to the `deform_conv2d` CPU forward kernel — the first direct usage in torchvision's codebase. Without the compile/link flags, source builds get **0% speedup** from the parallelization while the change is designed to deliver **2.5–3.0×**.

I confirmed this on Apple M2 (macOS ARM, 4 threads). Thread scaling with and without OpenMP flags:

**Without `-fopenmp` (current `setup.py`):**
```
Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.99      16.76      78.12     324.52
4              2.65      16.14      75.25     313.23   ← no scaling
```

**With `-fopenmp` + `-lomp`:**
```
Threads      s32-b1     s32-b4     s64-b1     s64-b4
----------------------------------------------------
1              2.91      15.71      75.60     310.34
4              1.07       5.36      30.33     121.75   ← scales as expected
```

### Proposed fix

Add OpenMP flags to `setup.py`:

```diff
--- a/setup.py
+++ b/setup.py
@@ -130,6 +130,12 @@ def get_macros_and_flags():
         if sysconfig.get_config_var("Py_GIL_DISABLED"):
             extra_compile_args["cxx"].append("-DPy_GIL_DISABLED")
 
+    if sys.platform == "darwin":
+        extra_compile_args["cxx"].append("-Xpreprocessor")
+        extra_compile_args["cxx"].append("-fopenmp")
+    elif sys.platform != "win32":
+        extra_compile_args["cxx"].append("-fopenmp")
+
     if DEBUG:
         extra_compile_args["cxx"].append("-g")
         extra_compile_args["cxx"].append("-O0")
@@ -183,12 +189,22 @@ def make_C_extension():
             sources += mps_sources
 
     define_macros, extra_compile_args = get_macros_and_flags()
+
+    extra_link_args = []
+    if sys.platform == "darwin":
+        # Link against libomp shipped with PyTorch for at::parallel_for support
+        torch_lib_dir = os.path.join(os.path.dirname(torch.__file__), "lib")
+        extra_link_args = [f"-L{torch_lib_dir}", "-lomp"]
+    elif sys.platform != "win32":
+        extra_link_args = ["-lgomp"]
+
     return Extension(
         name="torchvision._C",
         sources=sorted(str(s) for s in sources),
         include_dirs=[CSRS_DIR],
         define_macros=define_macros,
         extra_compile_args=extra_compile_args,
+        extra_link_args=extra_link_args,
     )
```

This also unblocks future parallelization of other CPU kernels (`roi_align`, `nms`, etc.) as originally proposed in #6619.

### Related

- #9442 — PR that introduces `at::parallel_for` to `deform_conv2d` CPU forward kernel
- #6619 — RFC for CPU performance optimization of torchvision ops
- #2783 — `warning: ignoring #pragma omp parallel` reported in 2020 (same root cause, still open)
- #4935 — `roi_align` OpenMP parallelization request, blocked by this same missing flag

### Versions

- PyTorch: 2.10.0
- torchvision: 0.26.0 (source build)
- macOS 26.3.1, Apple M2, ARM64
- Python 3.12

cc @NicolasHug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential #9455

Bug

Why this hasn't been a problem until now

Why it matters now

Proposed fix

Related

Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

File	`at::parallel_for`	`#pragma omp`
`deform_conv2d_kernel.cpp`	❌	❌
`nms_kernel.cpp`	❌	❌
`roi_align_kernel.cpp`	❌	commented out
`roi_pool_kernel.cpp`	❌	❌
`ps_roi_align_kernel.cpp`	❌	❌
`ps_roi_pool_kernel.cpp`	❌	❌
`box_iou_rotated_kernel.cpp`	❌	❌

Source builds on macOS/Linux missing OpenMP flags — at::parallel_for silently falls back to sequential #9455

Description

Bug

Why this hasn't been a problem until now

Why it matters now

Proposed fix

Related

Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions