Skip to content

Conversation

@gonidelis
Copy link
Member

fixes #5606

Boilerplate code for extracting types information (stream, mr, tuning_t etc.) is too big and repetitive across the new device environment based interfaces we introduced. This PR extracts the code into a separate function and re-uses it in the existing environment based device APIs that we have (DeviceScan and DeviceReduce).

Some consideration about the design for the reviewers:

  • Each device primitive has its own quirks regarding which deterministm_t is supported. For example DeviceReduce::Reduce can support both gpu_to_gpu and run_to_run determinism, while DeviceReduce::ArgMax/Min or DeviceScan only support run_to_run at the moment. That means the determinism heuristics cannot be incorporated into the boilerplate code. Future environment-based APIs must individually evaluate each algorithm to determine and support the appropriate deterministic types.
  • The existing boilerplate code uses a lambda callable to pass the specific deterministic algorithm implementation by packing the arguments.
      auto reduce_callable = [&](auto tuning, void* storage, size_t& bytes, auto... args) {
        using tuning_t = decltype(tuning);
        return reduce_impl<tuning_t>(storage, bytes, args...);
      };

      // Dispatch with environment - handles all boilerplate
      return detail::dispatch_with_env(
        env, determinism_t{}, reduce_callable, d_in, d_out, num_items, reduction_op, ::cuda::std::identity{}, init);
    }

I need some feedback on whether this interface on the dispatch_with_env() looks sane.

@gonidelis gonidelis requested a review from a team as a code owner November 13, 2025 19:18
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 13, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 13, 2025

// Initial value
OutputExtremumT initial_value{::cuda::std::numeric_limits<InputValueT>::max()};
OutputExtremumT initial_value{::cuda::std::numeric_limits<InputValueT>::lowest()};
Copy link
Member Author

@gonidelis gonidelis Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to figure out why this bug was not caught from the tests

@github-actions

This comment has been minimized.

@gonidelis gonidelis force-pushed the refactor_env_boilerplate branch from c72dd09 to 03c1ee9 Compare November 13, 2025 20:38
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 3h 00m: Pass: 28%/81 | Total: 2d 04h | Max: 2h 59m | Hits: 81%/22346

See results here.


return deallocate_error;
// Lambda that calls reduce_impl with the right overload based on determinism
auto reduce_callable = [&](auto tuning, void* storage, size_t& bytes, auto... args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: it seems that we are not using anything from the outside scope, so we don't need to capture by reference [&]. Comment applies throughout PR


return deallocate_error;
// Lambda that calls reduce_impl with the right overload based on determinism
auto reduce_callable = [&](auto tuning, void* storage, size_t& bytes, auto... args) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we need a lambda to make this work? Can we just pass reduce_impl<tuning_t> directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Refactor and cleanup env-based CUB APIs and underlying implementation

2 participants