Skip to content

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Nov 7, 2025

Part of #6368, which was design approved yesterday. The goal is to merge refacttorings like the one here continuously, but avoid any public exposure of the tuning APIs for now. We can turn them live once we completed the rewrite.

  • No SASS difference for cub.bench.reduce.sum.base on sm120

Quick benchmark non my RTX 5090, since the SASS diff would not cover regressions in host code. LGTM:

## [0] NVIDIA GeForce RTX 5090

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   6.127 us |       1.74% |   6.183 us |       3.24% |  0.056 us |   0.91% |   SAME   |
|   I8    |      I32      |      2^20      |   6.325 us |       9.10% |   6.529 us |      12.28% |  0.204 us |   3.23% |   SAME   |
|   I8    |      I32      |      2^24      |  18.513 us |       2.25% |  18.575 us |       2.91% |  0.062 us |   0.34% |   SAME   |
|   I8    |      I32      |      2^28      | 186.257 us |       0.67% | 185.953 us |       0.70% | -0.304 us |  -0.16% |   SAME   |
|   I8    |      I64      |      2^16      |   6.138 us |       2.74% |   6.144 us |       0.38% |  0.006 us |   0.09% |   SAME   |
|   I8    |      I64      |      2^20      |   6.219 us |       6.64% |   6.288 us |       8.30% |  0.069 us |   1.11% |   SAME   |
|   I8    |      I64      |      2^24      |  18.408 us |       0.64% |  18.437 us |       1.12% |  0.029 us |   0.16% |   SAME   |
|   I8    |      I64      |      2^28      | 185.568 us |       0.73% | 186.777 us |       0.68% |  1.209 us |   0.65% |   SAME   |
|   I16   |      I32      |      2^16      |   6.214 us |       6.61% |   6.236 us |       6.93% |  0.022 us |   0.35% |   SAME   |
|   I16   |      I32      |      2^20      |   8.185 us |       0.99% |   8.188 us |       0.58% |  0.002 us |   0.03% |   SAME   |
|   I16   |      I32      |      2^24      |  28.724 us |       1.37% |  28.770 us |       1.52% |  0.047 us |   0.16% |   SAME   |
|   I16   |      I32      |      2^28      | 347.629 us |       0.55% | 347.639 us |       0.55% |  0.009 us |   0.00% |   SAME   |
|   I16   |      I64      |      2^16      |   6.115 us |       2.25% |   6.112 us |       2.62% | -0.003 us |  -0.05% |   SAME   |
|   I16   |      I64      |      2^20      |   8.173 us |       1.21% |   8.188 us |       0.59% |  0.015 us |   0.18% |   SAME   |
|   I16   |      I64      |      2^24      |  29.304 us |       3.49% |  29.576 us |       3.91% |  0.271 us |   0.93% |   SAME   |
|   I16   |      I64      |      2^28      | 347.087 us |       0.56% | 347.213 us |       0.56% |  0.127 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^16      |   6.126 us |       2.07% |   6.132 us |       1.67% |  0.005 us |   0.08% |   SAME   |
|   I32   |      I32      |      2^20      |   8.183 us |       0.83% |   8.186 us |       0.70% |  0.003 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^24      |  51.217 us |       0.53% |  51.215 us |       0.38% | -0.002 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^28      | 672.272 us |       0.46% | 672.407 us |       0.50% |  0.135 us |   0.02% |   SAME   |
|   I32   |      I64      |      2^16      |   6.121 us |       1.83% |   6.139 us |       1.16% |  0.018 us |   0.29% |   SAME   |
|   I32   |      I64      |      2^20      |   8.190 us |       0.40% |   8.172 us |       1.28% | -0.018 us |  -0.22% |   SAME   |
|   I32   |      I64      |      2^24      |  51.259 us |       0.74% |  51.325 us |       1.02% |  0.066 us |   0.13% |   SAME   |
|   I32   |      I64      |      2^28      | 671.143 us |       0.61% | 670.702 us |       0.46% | -0.441 us |  -0.07% |   SAME   |
|   I64   |      I32      |      2^16      |   6.131 us |       1.63% |   6.135 us |       1.25% |  0.005 us |   0.08% |   SAME   |
|   I64   |      I32      |      2^20      |  12.444 us |       4.49% |  12.376 us |       3.37% | -0.068 us |  -0.55% |   SAME   |
|   I64   |      I32      |      2^24      |  95.237 us |       1.13% |  95.110 us |       1.12% | -0.127 us |  -0.13% |   SAME   |
|   I64   |      I32      |      2^28      |   1.312 ms |       0.14% |   1.312 ms |       0.15% |  0.045 us |   0.00% |   SAME   |
|   I64   |      I64      |      2^16      |   6.135 us |       1.99% |   6.136 us |       2.53% |  0.001 us |   0.02% |   SAME   |
|   I64   |      I64      |      2^20      |  14.277 us |       2.22% |  14.311 us |       1.59% |  0.035 us |   0.24% |   SAME   |
|   I64   |      I64      |      2^24      |  95.506 us |       1.02% |  95.617 us |       1.03% |  0.112 us |   0.12% |   SAME   |
|   I64   |      I64      |      2^28      |   1.313 ms |       0.15% |   1.312 ms |       0.11% | -0.291 us |  -0.02% |   SAME   |
|  I128   |      I32      |      2^16      |   6.565 us |      12.21% |   6.615 us |      12.69% |  0.050 us |   0.76% |   SAME   |
|  I128   |      I32      |      2^20      |  18.460 us |       1.82% |  18.444 us |       1.16% | -0.016 us |  -0.09% |   SAME   |
|  I128   |      I32      |      2^24      | 188.893 us |       1.50% | 189.003 us |       1.52% |  0.110 us |   0.06% |   SAME   |
|  I128   |      I32      |      2^28      |   2.581 ms |       0.14% |   2.581 ms |       0.13% |  0.101 us |   0.00% |   SAME   |
|  I128   |      I64      |      2^16      |   6.165 us |       2.94% |   6.211 us |       5.45% |  0.046 us |   0.75% |   SAME   |
|  I128   |      I64      |      2^20      |  18.637 us |       3.38% |  18.854 us |       4.38% |  0.216 us |   1.16% |   SAME   |
|  I128   |      I64      |      2^24      | 183.319 us |       0.97% | 183.897 us |       1.05% |  0.578 us |   0.32% |   SAME   |
|  I128   |      I64      |      2^28      |   2.573 ms |       0.11% |   2.573 ms |       0.10% |  0.118 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^16      |   6.216 us |       6.39% |   6.144 us |       0.00% | -0.072 us |  -1.17% |   ????   |
|   F32   |      I32      |      2^20      |   8.189 us |       0.40% |   8.188 us |       0.58% | -0.001 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^24      |  51.843 us |       2.05% |  51.733 us |       1.98% | -0.110 us |  -0.21% |   SAME   |
|   F32   |      I32      |      2^28      | 666.607 us |       0.27% | 666.752 us |       0.25% |  0.145 us |   0.02% |   SAME   |
|   F32   |      I64      |      2^16      |   6.126 us |       1.63% |   6.136 us |       1.10% |  0.010 us |   0.17% |   SAME   |
|   F32   |      I64      |      2^20      |   9.863 us |       7.95% |   9.680 us |       9.39% | -0.183 us |  -1.86% |   SAME   |
|   F32   |      I64      |      2^24      |  51.605 us |       1.66% |  51.636 us |       1.68% |  0.031 us |   0.06% |   SAME   |
|   F32   |      I64      |      2^28      | 668.767 us |       0.43% | 668.438 us |       0.39% | -0.329 us |  -0.05% |   SAME   |
|   F64   |      I32      |      2^16      |  10.204 us |       1.37% |  10.222 us |       1.13% |  0.018 us |   0.18% |   SAME   |
|   F64   |      I32      |      2^20      |  14.522 us |       4.02% |  14.466 us |       3.40% | -0.057 us |  -0.39% |   SAME   |
|   F64   |      I32      |      2^24      |  96.754 us |       1.11% |  96.618 us |       1.11% | -0.137 us |  -0.14% |   SAME   |
|   F64   |      I32      |      2^28      |   1.317 ms |       0.24% |   1.317 ms |       0.24% | -0.008 us |  -0.00% |   SAME   |
|   F64   |      I64      |      2^16      |  10.189 us |       1.81% |  10.233 us |       1.45% |  0.044 us |   0.43% |   SAME   |
|   F64   |      I64      |      2^20      |  14.373 us |       2.37% |  14.337 us |       0.20% | -0.036 us |  -0.25% |   FAST   |
|   F64   |      I64      |      2^24      |  96.030 us |       0.86% |  96.017 us |       0.86% | -0.013 us |  -0.01% |   SAME   |
|   F64   |      I64      |      2^28      |   1.313 ms |       0.18% |   1.313 ms |       0.17% | -0.124 us |  -0.01% |   SAME   |
|   C32   |      I32      |      2^16      |   6.142 us |       1.25% |   6.133 us |       1.57% | -0.009 us |  -0.15% |   SAME   |
|   C32   |      I32      |      2^20      |  12.470 us |       4.63% |  12.505 us |       5.04% |  0.035 us |   0.28% |   SAME   |
|   C32   |      I32      |      2^24      |  95.110 us |       1.09% |  94.948 us |       1.09% | -0.161 us |  -0.17% |   SAME   |
|   C32   |      I32      |      2^28      |   1.311 ms |       0.13% |   1.311 ms |       0.13% | -0.088 us |  -0.01% |   SAME   |
|   C32   |      I64      |      2^16      |   6.125 us |       1.67% |   6.144 us |       0.00% |  0.019 us |   0.31% |   ????   |
|   C32   |      I64      |      2^20      |  14.251 us |       2.79% |  14.206 us |       3.52% | -0.045 us |  -0.31% |   SAME   |
|   C32   |      I64      |      2^24      |  95.332 us |       1.13% |  95.240 us |       1.12% | -0.091 us |  -0.10% |   SAME   |
|   C32   |      I64      |      2^28      |   1.312 ms |       0.13% |   1.312 ms |       0.11% | -0.170 us |  -0.01% |   SAME   |

Fixes: #6565

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 7, 2025
@bernhardmgruber
Copy link
Contributor Author

/ok to test ad38ff2

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

Found an issue with the way the accumulator type is specified in the benchmarks, which explains the regressions I currently observe when using the public tuning API: #6576

Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this is going

Copy link
Contributor

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in one of the comments below - I do like the overall structure of this. That said, lack of pattern matching = pain.

Comment on lines 257 to 284
// convert type information to CUB arch_policies
using namespace cub::detail::reduce;

auto at = accum_type::other;
if (accum_t.type == CCCL_FLOAT32)
{
at = accum_type::float32;
}
if (accum_t.type == CCCL_FLOAT64)
{
at = accum_type::double32;
}

auto ot = op_type::unknown;
switch (op.type)
{
case CCCL_PLUS:
ot = op_type::plus;
break;
case CCCL_MINIMUM:
case CCCL_MAXIMUM:
ot = op_type::min_or_max;
break;
default:
break;
}

using cub::detail::RuntimeReduceAgentPolicy;
auto reduce_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "ReducePolicy");
auto st_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "SingleTilePolicy");
auto os = offset_size::_8; // sizeof(uint64_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be centralized. Not just for c.parallel (so that we can avoid re-stating this over and over again in mimicry of the CUB classify calls), but also for CUB itself so that c.parallel can just do this per category (op_type, accum_type) instead of doing it per algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

using MaxPolicy = Policy1000;
};

struct arch_policies // equivalent to the policy_hub, holds policies for a bunch of CUDA architectures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal type, but one that still materializes when users invoke the algorithms, right? I wonder if this should turn into a template and its data members should be turned into an environment returning those values by queries, because as is, any change to the layout would be an ABI break...

Copy link
Contributor Author

@bernhardmgruber bernhardmgruber Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very appealing aspect of the current design is that tuning information is expressed very simply as structs with data members, so I would love if we could keep that.

Regarding API breaks, we do allow those at every release. This is pointed out in our README:

Symbols in the thrust:: and cub:: namespaces may break ABI at any time without warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; however, it'd be nice to have the ABI break manifest as a linker error instead of being entirely silent.

@jrhemstad
Copy link
Collaborator

@bernhardmgruber can we see a comparison in compile time between this approach and the new one for the DeviceReduce tests? I want to see if there is any impact (for better or worse) on compile time with the new tuning machinery.

@bernhardmgruber bernhardmgruber changed the title Design a new tuning API Implement the new tuning API for DeviceReduce Nov 19, 2025
@bernhardmgruber bernhardmgruber force-pushed the tuning branch 2 times, most recently from cada761 to ae3a5aa Compare November 19, 2025 12:00
@bernhardmgruber bernhardmgruber marked this pull request as ready for review November 19, 2025 12:07
@bernhardmgruber bernhardmgruber requested review from a team as code owners November 19, 2025 12:07
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 5h 48m: Pass: 58%/90 | Total: 4d 11h | Max: 5h 47m | Hits: 49%/35107

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement a MVP for cub::DeviceReduce using the new tuning API

5 participants