-
Notifications
You must be signed in to change notification settings - Fork 294
Implement the new tuning API for DeviceReduce
#6544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test ad38ff2 |
This comment has been minimized.
This comment has been minimized.
ad38ff2 to
1500255
Compare
b45686c to
3e605f8
Compare
c9d7fd7 to
d2f0578
Compare
|
Found an issue with the way the accumulator type is specified in the benchmarks, which explains the regressions I currently observe when using the public tuning API: #6576 |
7a5a806 to
61bf19d
Compare
miscco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this is going
410d404 to
04ee487
Compare
griwes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in one of the comments below - I do like the overall structure of this. That said, lack of pattern matching = pain.
c/parallel/src/reduce.cu
Outdated
| // convert type information to CUB arch_policies | ||
| using namespace cub::detail::reduce; | ||
|
|
||
| auto at = accum_type::other; | ||
| if (accum_t.type == CCCL_FLOAT32) | ||
| { | ||
| at = accum_type::float32; | ||
| } | ||
| if (accum_t.type == CCCL_FLOAT64) | ||
| { | ||
| at = accum_type::double32; | ||
| } | ||
|
|
||
| auto ot = op_type::unknown; | ||
| switch (op.type) | ||
| { | ||
| case CCCL_PLUS: | ||
| ot = op_type::plus; | ||
| break; | ||
| case CCCL_MINIMUM: | ||
| case CCCL_MAXIMUM: | ||
| ot = op_type::min_or_max; | ||
| break; | ||
| default: | ||
| break; | ||
| } | ||
|
|
||
| using cub::detail::RuntimeReduceAgentPolicy; | ||
| auto reduce_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "ReducePolicy"); | ||
| auto st_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "SingleTilePolicy"); | ||
| auto os = offset_size::_8; // sizeof(uint64_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be centralized. Not just for c.parallel (so that we can avoid re-stating this over and over again in mimicry of the CUB classify calls), but also for CUB itself so that c.parallel can just do this per category (op_type, accum_type) instead of doing it per algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
| using MaxPolicy = Policy1000; | ||
| }; | ||
|
|
||
| struct arch_policies // equivalent to the policy_hub, holds policies for a bunch of CUDA architectures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal type, but one that still materializes when users invoke the algorithms, right? I wonder if this should turn into a template and its data members should be turned into an environment returning those values by queries, because as is, any change to the layout would be an ABI break...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very appealing aspect of the current design is that tuning information is expressed very simply as structs with data members, so I would love if we could keep that.
Regarding API breaks, we do allow those at every release. This is pointed out in our README:
Symbols in the
thrust::andcub::namespaces may break ABI at any time without warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; however, it'd be nice to have the ABI break manifest as a linker error instead of being entirely silent.
|
@bernhardmgruber can we see a comparison in compile time between this approach and the new one for the DeviceReduce tests? I want to see if there is any impact (for better or worse) on compile time with the new tuning machinery. |
DeviceReduce
cada761 to
ae3a5aa
Compare
This reverts commit 881b89a.
79c25a9 to
3f07644
Compare
😬 CI Workflow Results🟥 Finished in 5h 48m: Pass: 58%/90 | Total: 4d 11h | Max: 5h 47m | Hits: 49%/35107See results here. |
Part of #6368, which was design approved yesterday. The goal is to merge refacttorings like the one here continuously, but avoid any public exposure of the tuning APIs for now. We can turn them live once we completed the rewrite.
cub.bench.reduce.sum.baseon sm120Quick benchmark non my RTX 5090, since the SASS diff would not cover regressions in host code. LGTM:
Fixes: #6565