Skip to content

Commit a7eb660

Browse files
Document accum_t benchmark skew
1 parent d2f0578 commit a7eb660

File tree

1 file changed

+6
-0
lines changed

1 file changed

+6
-0
lines changed

cub/benchmarks/bench/reduce/base.cuh

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,12 @@ void reduce(nvbench::state& state, nvbench::type_list<T, OffsetT>)
122122
state.add_global_memory_reads<T>(elements, "Size");
123123
state.add_global_memory_writes<T>(1);
124124

125+
// FIXME(bgruber): the previous implementation did target cub::DispatchReduce, and provided T as accumulator type.
126+
// This is not realistic, since a user cannot set the accumulator type the same way at the public API. For example,
127+
// reducing I8 over cuda::std::plus deduces accumulator type I32 at the public API, but the benchmark forces it to I8.
128+
// This skews the MemBoundScaling, leading to 20% regression for the same tuning when the public API is called (with
129+
// accum_t I32) over the benchmark (forced accum_t of I8).
130+
125131
caching_last_alloc_mr mr;
126132
state.exec(nvbench::exec_tag::gpu | nvbench::exec_tag::no_batch, [&](nvbench::launch& launch) {
127133
auto env = ::cuda::std::execution::env{

0 commit comments

Comments
 (0)