Implement the new tuning API for `DeviceReduce` #6544

bernhardmgruber · 2025-11-07T02:04:48Z

Part of #6368, which was design approved yesterday. The goal is to merge refacttorings like the one here continuously, but avoid any public exposure of the tuning APIs for now. We can turn them live once we completed the rewrite.

No SASS difference for cub.bench.reduce.sum.base on sm120

Quick benchmark non my RTX 5090, since the SASS diff would not cover regressions in host code. LGTM:

## [0] NVIDIA GeForce RTX 5090

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   6.127 us |       1.74% |   6.183 us |       3.24% |  0.056 us |   0.91% |   SAME   |
|   I8    |      I32      |      2^20      |   6.325 us |       9.10% |   6.529 us |      12.28% |  0.204 us |   3.23% |   SAME   |
|   I8    |      I32      |      2^24      |  18.513 us |       2.25% |  18.575 us |       2.91% |  0.062 us |   0.34% |   SAME   |
|   I8    |      I32      |      2^28      | 186.257 us |       0.67% | 185.953 us |       0.70% | -0.304 us |  -0.16% |   SAME   |
|   I8    |      I64      |      2^16      |   6.138 us |       2.74% |   6.144 us |       0.38% |  0.006 us |   0.09% |   SAME   |
|   I8    |      I64      |      2^20      |   6.219 us |       6.64% |   6.288 us |       8.30% |  0.069 us |   1.11% |   SAME   |
|   I8    |      I64      |      2^24      |  18.408 us |       0.64% |  18.437 us |       1.12% |  0.029 us |   0.16% |   SAME   |
|   I8    |      I64      |      2^28      | 185.568 us |       0.73% | 186.777 us |       0.68% |  1.209 us |   0.65% |   SAME   |
|   I16   |      I32      |      2^16      |   6.214 us |       6.61% |   6.236 us |       6.93% |  0.022 us |   0.35% |   SAME   |
|   I16   |      I32      |      2^20      |   8.185 us |       0.99% |   8.188 us |       0.58% |  0.002 us |   0.03% |   SAME   |
|   I16   |      I32      |      2^24      |  28.724 us |       1.37% |  28.770 us |       1.52% |  0.047 us |   0.16% |   SAME   |
|   I16   |      I32      |      2^28      | 347.629 us |       0.55% | 347.639 us |       0.55% |  0.009 us |   0.00% |   SAME   |
|   I16   |      I64      |      2^16      |   6.115 us |       2.25% |   6.112 us |       2.62% | -0.003 us |  -0.05% |   SAME   |
|   I16   |      I64      |      2^20      |   8.173 us |       1.21% |   8.188 us |       0.59% |  0.015 us |   0.18% |   SAME   |
|   I16   |      I64      |      2^24      |  29.304 us |       3.49% |  29.576 us |       3.91% |  0.271 us |   0.93% |   SAME   |
|   I16   |      I64      |      2^28      | 347.087 us |       0.56% | 347.213 us |       0.56% |  0.127 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^16      |   6.126 us |       2.07% |   6.132 us |       1.67% |  0.005 us |   0.08% |   SAME   |
|   I32   |      I32      |      2^20      |   8.183 us |       0.83% |   8.186 us |       0.70% |  0.003 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^24      |  51.217 us |       0.53% |  51.215 us |       0.38% | -0.002 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^28      | 672.272 us |       0.46% | 672.407 us |       0.50% |  0.135 us |   0.02% |   SAME   |
|   I32   |      I64      |      2^16      |   6.121 us |       1.83% |   6.139 us |       1.16% |  0.018 us |   0.29% |   SAME   |
|   I32   |      I64      |      2^20      |   8.190 us |       0.40% |   8.172 us |       1.28% | -0.018 us |  -0.22% |   SAME   |
|   I32   |      I64      |      2^24      |  51.259 us |       0.74% |  51.325 us |       1.02% |  0.066 us |   0.13% |   SAME   |
|   I32   |      I64      |      2^28      | 671.143 us |       0.61% | 670.702 us |       0.46% | -0.441 us |  -0.07% |   SAME   |
|   I64   |      I32      |      2^16      |   6.131 us |       1.63% |   6.135 us |       1.25% |  0.005 us |   0.08% |   SAME   |
|   I64   |      I32      |      2^20      |  12.444 us |       4.49% |  12.376 us |       3.37% | -0.068 us |  -0.55% |   SAME   |
|   I64   |      I32      |      2^24      |  95.237 us |       1.13% |  95.110 us |       1.12% | -0.127 us |  -0.13% |   SAME   |
|   I64   |      I32      |      2^28      |   1.312 ms |       0.14% |   1.312 ms |       0.15% |  0.045 us |   0.00% |   SAME   |
|   I64   |      I64      |      2^16      |   6.135 us |       1.99% |   6.136 us |       2.53% |  0.001 us |   0.02% |   SAME   |
|   I64   |      I64      |      2^20      |  14.277 us |       2.22% |  14.311 us |       1.59% |  0.035 us |   0.24% |   SAME   |
|   I64   |      I64      |      2^24      |  95.506 us |       1.02% |  95.617 us |       1.03% |  0.112 us |   0.12% |   SAME   |
|   I64   |      I64      |      2^28      |   1.313 ms |       0.15% |   1.312 ms |       0.11% | -0.291 us |  -0.02% |   SAME   |
|  I128   |      I32      |      2^16      |   6.565 us |      12.21% |   6.615 us |      12.69% |  0.050 us |   0.76% |   SAME   |
|  I128   |      I32      |      2^20      |  18.460 us |       1.82% |  18.444 us |       1.16% | -0.016 us |  -0.09% |   SAME   |
|  I128   |      I32      |      2^24      | 188.893 us |       1.50% | 189.003 us |       1.52% |  0.110 us |   0.06% |   SAME   |
|  I128   |      I32      |      2^28      |   2.581 ms |       0.14% |   2.581 ms |       0.13% |  0.101 us |   0.00% |   SAME   |
|  I128   |      I64      |      2^16      |   6.165 us |       2.94% |   6.211 us |       5.45% |  0.046 us |   0.75% |   SAME   |
|  I128   |      I64      |      2^20      |  18.637 us |       3.38% |  18.854 us |       4.38% |  0.216 us |   1.16% |   SAME   |
|  I128   |      I64      |      2^24      | 183.319 us |       0.97% | 183.897 us |       1.05% |  0.578 us |   0.32% |   SAME   |
|  I128   |      I64      |      2^28      |   2.573 ms |       0.11% |   2.573 ms |       0.10% |  0.118 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^16      |   6.216 us |       6.39% |   6.144 us |       0.00% | -0.072 us |  -1.17% |   ????   |
|   F32   |      I32      |      2^20      |   8.189 us |       0.40% |   8.188 us |       0.58% | -0.001 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^24      |  51.843 us |       2.05% |  51.733 us |       1.98% | -0.110 us |  -0.21% |   SAME   |
|   F32   |      I32      |      2^28      | 666.607 us |       0.27% | 666.752 us |       0.25% |  0.145 us |   0.02% |   SAME   |
|   F32   |      I64      |      2^16      |   6.126 us |       1.63% |   6.136 us |       1.10% |  0.010 us |   0.17% |   SAME   |
|   F32   |      I64      |      2^20      |   9.863 us |       7.95% |   9.680 us |       9.39% | -0.183 us |  -1.86% |   SAME   |
|   F32   |      I64      |      2^24      |  51.605 us |       1.66% |  51.636 us |       1.68% |  0.031 us |   0.06% |   SAME   |
|   F32   |      I64      |      2^28      | 668.767 us |       0.43% | 668.438 us |       0.39% | -0.329 us |  -0.05% |   SAME   |
|   F64   |      I32      |      2^16      |  10.204 us |       1.37% |  10.222 us |       1.13% |  0.018 us |   0.18% |   SAME   |
|   F64   |      I32      |      2^20      |  14.522 us |       4.02% |  14.466 us |       3.40% | -0.057 us |  -0.39% |   SAME   |
|   F64   |      I32      |      2^24      |  96.754 us |       1.11% |  96.618 us |       1.11% | -0.137 us |  -0.14% |   SAME   |
|   F64   |      I32      |      2^28      |   1.317 ms |       0.24% |   1.317 ms |       0.24% | -0.008 us |  -0.00% |   SAME   |
|   F64   |      I64      |      2^16      |  10.189 us |       1.81% |  10.233 us |       1.45% |  0.044 us |   0.43% |   SAME   |
|   F64   |      I64      |      2^20      |  14.373 us |       2.37% |  14.337 us |       0.20% | -0.036 us |  -0.25% |   FAST   |
|   F64   |      I64      |      2^24      |  96.030 us |       0.86% |  96.017 us |       0.86% | -0.013 us |  -0.01% |   SAME   |
|   F64   |      I64      |      2^28      |   1.313 ms |       0.18% |   1.313 ms |       0.17% | -0.124 us |  -0.01% |   SAME   |
|   C32   |      I32      |      2^16      |   6.142 us |       1.25% |   6.133 us |       1.57% | -0.009 us |  -0.15% |   SAME   |
|   C32   |      I32      |      2^20      |  12.470 us |       4.63% |  12.505 us |       5.04% |  0.035 us |   0.28% |   SAME   |
|   C32   |      I32      |      2^24      |  95.110 us |       1.09% |  94.948 us |       1.09% | -0.161 us |  -0.17% |   SAME   |
|   C32   |      I32      |      2^28      |   1.311 ms |       0.13% |   1.311 ms |       0.13% | -0.088 us |  -0.01% |   SAME   |
|   C32   |      I64      |      2^16      |   6.125 us |       1.67% |   6.144 us |       0.00% |  0.019 us |   0.31% |   ????   |
|   C32   |      I64      |      2^20      |  14.251 us |       2.79% |  14.206 us |       3.52% | -0.045 us |  -0.31% |   SAME   |
|   C32   |      I64      |      2^24      |  95.332 us |       1.13% |  95.240 us |       1.12% | -0.091 us |  -0.10% |   SAME   |
|   C32   |      I64      |      2^28      |   1.312 ms |       0.13% |   1.312 ms |       0.11% | -0.170 us |  -0.01% |   SAME   |

Fixes: #6565

copy-pr-bot · 2025-11-07T02:04:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-11-07T02:36:40Z

/ok to test ad38ff2

cub/cub/detail/launcher/cuda_driver.cuh

bernhardmgruber · 2025-11-11T09:26:02Z

Found an issue with the way the accumulator type is specified in the benchmarks, which explains the regressions I currently observe when using the public tuning API: #6576

miscco

I like the direction this is going

cub/benchmarks/bench/reduce/base.cuh

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

cub/cub/util_arch.cuh

griwes

As said in one of the comments below - I do like the overall structure of this. That said, lack of pattern matching = pain.

c/parallel/src/reduce.cu

griwes · 2025-11-18T22:23:43Z

c/parallel/src/reduce.cu

+    // convert type information to CUB arch_policies
+    using namespace cub::detail::reduce;
+
+    auto at = accum_type::other;
+    if (accum_t.type == CCCL_FLOAT32)
+    {
+      at = accum_type::float32;
+    }
+    if (accum_t.type == CCCL_FLOAT64)
+    {
+      at = accum_type::double32;
+    }
+
+    auto ot = op_type::unknown;
+    switch (op.type)
+    {
+      case CCCL_PLUS:
+        ot = op_type::plus;
+        break;
+      case CCCL_MINIMUM:
+      case CCCL_MAXIMUM:
+        ot = op_type::min_or_max;
+        break;
+      default:
+        break;
+    }

-    using cub::detail::RuntimeReduceAgentPolicy;
-    auto reduce_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "ReducePolicy");
-    auto st_policy     = RuntimeReduceAgentPolicy::from_json(runtime_policy, "SingleTilePolicy");
+    auto os = offset_size::_8; // sizeof(uint64_t)


This should be centralized. Not just for c.parallel (so that we can avoid re-stating this over and over again in mimicry of the CUB classify calls), but also for CUB itself so that c.parallel can just do this per category (op_type, accum_type) instead of doing it per algorithm.

c/parallel/src/reduce.cu

cub/cub/device/dispatch/kernels/kernel_reduce.cuh

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

griwes · 2025-11-18T23:07:42Z

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

  using MaxPolicy = Policy1000;
 };
+
+struct arch_policies // equivalent to the policy_hub, holds policies for a bunch of CUDA architectures


This is an internal type, but one that still materializes when users invoke the algorithms, right? I wonder if this should turn into a template and its data members should be turned into an environment returning those values by queries, because as is, any change to the layout would be an ABI break...

A very appealing aspect of the current design is that tuning information is expressed very simply as structs with data members, so I would love if we could keep that.

Regarding API breaks, we do allow those at every release. This is pointed out in our README:

Symbols in the thrust:: and cub:: namespaces may break ABI at any time without warning.

I agree; however, it'd be nice to have the ABI break manifest as a linker error instead of being entirely silent.

jrhemstad · 2025-11-18T23:18:28Z

@bernhardmgruber can we see a comparison in compile time between this approach and the new one for the DeviceReduce tests? I want to see if there is any impact (for better or worse) on compile time with the new tuning machinery.

cub/benchmarks/bench/reduce/base.cuh

This reverts commit 881b89a.

github-actions · 2025-11-20T17:43:50Z

😬 CI Workflow Results

🟥 Finished in 5h 48m: Pass: 58%/90 | Total: 4d 11h | Max: 5h 47m | Hits: 49%/35107

See results here.

github-project-automation bot added this to CCCL Nov 7, 2025

github-project-automation bot moved this to Todo in CCCL Nov 7, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 7, 2025

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning branch from ad38ff2 to 1500255 Compare November 10, 2025 12:19

bernhardmgruber commented Nov 10, 2025

View reviewed changes

cub/cub/detail/launcher/cuda_driver.cuh Show resolved Hide resolved

bernhardmgruber force-pushed the tuning branch 2 times, most recently from b45686c to 3e605f8 Compare November 10, 2025 21:49

This was referenced Nov 10, 2025

[FEA]: Explore alternative tuning policy design #6492

Closed

Test and refactor [Mem|Reg]BoundScaling #6575

Merged

bernhardmgruber force-pushed the tuning branch from c9d7fd7 to d2f0578 Compare November 11, 2025 08:33

bernhardmgruber force-pushed the tuning branch from 7a5a806 to 61bf19d Compare November 11, 2025 10:04

miscco reviewed Nov 11, 2025

View reviewed changes

bernhardmgruber force-pushed the tuning branch from 410d404 to 04ee487 Compare November 13, 2025 19:53

This was referenced Nov 17, 2025

Replace term "PTX version" by "CUDA architecture" or something else #4081

Open

Refactor out variant from transform tunings #6669

Merged

griwes reviewed Nov 18, 2025

View reviewed changes

bernhardmgruber changed the title ~~Design a new tuning API~~ Implement the new tuning API for DeviceReduce Nov 19, 2025

bernhardmgruber commented Nov 19, 2025

View reviewed changes

cub/benchmarks/bench/reduce/base.cuh Outdated Show resolved Hide resolved

bernhardmgruber force-pushed the tuning branch 2 times, most recently from cada761 to ae3a5aa Compare November 19, 2025 12:00

bernhardmgruber marked this pull request as ready for review November 19, 2025 12:07

bernhardmgruber requested review from a team as code owners November 19, 2025 12:07

bernhardmgruber requested a review from shwina November 19, 2025 12:07

bernhardmgruber added 26 commits November 20, 2025 12:10

reviewer feedback

bca6da0

fix

37344ae

More operator<<

a32939d

Fix policy check

bbb4aeb

arch_id

bbb3934

use correct memory resource

b59ceb3

name

c9fd073

fix archid

e2bedf1

refactoring

741f64e

docs

45a0a66

fix

5554fdf

some reverts

e358432

arch_id for nondet reduce

ed29736

refactor tuning

e308c12

fix

f47a864

launch bounds templ param

0ee1663

fix concepts

bcccbba

Revert to C++17

2cce193

cudaError_t

44c8f0a

needless cast

dba050d

Revert "launch bounds templ param"

6925576

This reverts commit 881b89a.

CI fixes

fc105d4

please GCC

448567b

Fix segmented_reduce

3441b99

Add cub::detail::ptx_arch_id

ab5e916

Fix transform_reduce

3f07644

bernhardmgruber force-pushed the tuning branch from 79c25a9 to 3f07644 Compare November 20, 2025 11:34

bernhardmgruber added 2 commits November 20, 2025 12:43

drop comment

c72ffc9

Fix warning: may be used uninitlaized

5cecf07

Implement the new tuning API for DeviceReduce #6544

Are you sure you want to change the base?

Implement the new tuning API for DeviceReduce #6544

Conversation

bernhardmgruber commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

bernhardmgruber commented Nov 7, 2025

Uh oh!

This comment has been minimized.

Uh oh!

bernhardmgruber commented Nov 11, 2025

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

griwes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

griwes Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

griwes Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

griwes Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

jrhemstad commented Nov 18, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025

😬 CI Workflow Results

🟥 Finished in 5h 48m: Pass: 58%/90 | Total: 4d 11h | Max: 5h 47m | Hits: 49%/35107

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement the new tuning API for `DeviceReduce` #6544

Implement the new tuning API for `DeviceReduce` #6544

bernhardmgruber commented Nov 7, 2025 •

edited

Loading

bernhardmgruber Nov 19, 2025 •

edited

Loading