Skip to content

Conversation

@IlyaGrebnov
Copy link
Contributor

@IlyaGrebnov IlyaGrebnov commented Nov 15, 2025

On Policy860 (SM 8.6+, Ada/Hopper), MediumSegmentPolicy was using 16 threads per segment, while other architecture policies use a full warp (32 threads). That effectively cuts the “medium” capacity roughly in half:

  • Policy860: 16 threads × 7 items/thread = 112 items
  • Policy800: 32 threads × 7–11 items/thread = 224–352 items

As a result, segments in the [113, 352] range were classified as “large” and routed to the block-level radix sort instead of the faster warp-level merge sort. This PR changes Policy860 to use 32 threads per segment (full warp), aligning its MediumSegmentPolicy with the other architectures and restoring the intended “medium” cutoff on Ada/Hopper.

On standard industry benchmark using libcubwt library for Burrows-Wheeler transform construction, throughput improves by ~7% on average, with the best-affected cases (e.g., proteins.001.1, rs.13) speeding up by up to ~50%. A few inputs fluctuate within ~2%, which looks like acceptable variation.

dataset size (bytes) before (time, throughput) after (time, throughput)
enwik8 100000000 0.145 sec ( 691.84 MB/s) 0.136 sec ( 736.81 MB/s)
enwik9 369098752 0.568 sec ( 649.78 MB/s) 0.531 sec ( 694.76 MB/s)
chr22.dna 34553758 0.090 sec ( 382.34 MB/s) 0.089 sec ( 386.53 MB/s)
etext99 105277340 0.195 sec ( 540.69 MB/s) 0.167 sec ( 628.72 MB/s)
gcc-3.0.tar 86630400 0.215 sec ( 402.98 MB/s) 0.215 sec ( 402.60 MB/s)
howto 39422105 0.064 sec ( 615.49 MB/s) 0.061 sec ( 644.09 MB/s)
jdk13c 69728899 0.170 sec ( 410.05 MB/s) 0.158 sec ( 441.78 MB/s)
linux-2.4.5.tar 116254720 0.246 sec ( 473.32 MB/s) 0.246 sec ( 471.87 MB/s)
rctail96 114711151 0.249 sec ( 460.40 MB/s) 0.232 sec ( 493.80 MB/s)
rfc 116421901 0.223 sec ( 522.74 MB/s) 0.222 sec ( 524.79 MB/s)
sprot34.dat 109617186 0.207 sec ( 530.69 MB/s) 0.192 sec ( 570.10 MB/s)
w3c2 104201579 0.280 sec ( 372.51 MB/s) 0.259 sec ( 402.34 MB/s)
dblp.xml 296135874 0.546 sec ( 541.99 MB/s) 0.509 sec ( 581.30 MB/s)
dna 369098752 0.582 sec ( 634.39 MB/s) 0.572 sec ( 644.78 MB/s)
english.1024MB 369098752 0.744 sec ( 496.28 MB/s) 0.709 sec ( 520.68 MB/s)
pitches 55832855 0.094 sec ( 591.88 MB/s) 0.093 sec ( 598.75 MB/s)
proteins 369098752 0.726 sec ( 508.49 MB/s) 0.681 sec ( 542.03 MB/s)
sources 210866607 0.383 sec ( 549.86 MB/s) 0.357 sec ( 590.10 MB/s)
cere 369098752 1.735 sec ( 212.68 MB/s) 1.774 sec ( 208.05 MB/s)
coreutils 205281778 0.873 sec ( 235.12 MB/s) 0.754 sec ( 272.11 MB/s)
einstein.de.txt 92758441 0.532 sec ( 174.40 MB/s) 0.419 sec ( 221.18 MB/s)
einstein.en.txt 369098752 2.132 sec ( 173.10 MB/s) 1.700 sec ( 217.13 MB/s)
Escherichia_Coli 112689515 0.296 sec ( 380.28 MB/s) 0.296 sec ( 380.39 MB/s)
influenza 154808555 0.515 sec ( 300.74 MB/s) 0.454 sec ( 341.04 MB/s)
kernel 257961616 1.044 sec ( 247.10 MB/s) 1.049 sec ( 245.92 MB/s)
para 369098752 1.304 sec ( 282.95 MB/s) 1.325 sec ( 278.48 MB/s)
world_leaders 46968181 0.302 sec ( 155.71 MB/s) 0.300 sec ( 156.58 MB/s)
dblp.xml.00001.1 104857600 0.417 sec ( 251.59 MB/s) 0.406 sec ( 258.59 MB/s)
dblp.xml.00001.2 104857600 0.424 sec ( 247.40 MB/s) 0.413 sec ( 253.68 MB/s)
dblp.xml.0001.1 104857600 0.364 sec ( 288.14 MB/s) 0.350 sec ( 299.45 MB/s)
dblp.xml.0001.2 104857600 0.365 sec ( 287.29 MB/s) 0.354 sec ( 296.31 MB/s)
dna.001.1 104857600 0.270 sec ( 388.41 MB/s) 0.270 sec ( 388.90 MB/s)
english.001.2 104857600 0.279 sec ( 375.51 MB/s) 0.272 sec ( 385.10 MB/s)
proteins.001.1 104857600 0.475 sec ( 220.58 MB/s) 0.310 sec ( 338.12 MB/s)
sources.001.2 104857600 0.295 sec ( 355.87 MB/s) 0.280 sec ( 374.89 MB/s)
fib41 267914296 2.421 sec ( 110.67 MB/s) 2.259 sec ( 118.59 MB/s)
rs.13 216747218 2.050 sec ( 105.72 MB/s) 1.698 sec ( 127.66 MB/s)
tm29 268435456 2.292 sec ( 117.09 MB/s) 2.095 sec ( 128.15 MB/s)

Description

closes #6173

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

On Policy860 (SM 8.6+, Ada/Hopper), MediumSegmentPolicy uses 16 threads/segment, while other arch policies use 32 (full warp). That cuts the “medium” capacity roughly in half:

Policy860: 16 threads × 7 items/thread = 112 items
Policy800: 32 threads × 7–11 items/thread = 224–352 items

This means segments in [113, 352] are classified as “large” and routed to block-level radix sort instead of the faster warp-level merge sort.
@IlyaGrebnov IlyaGrebnov requested a review from a team as a code owner November 15, 2025 20:07
@IlyaGrebnov IlyaGrebnov requested a review from miscco November 15, 2025 20:07
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 15, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 15, 2025
@bernhardmgruber
Copy link
Contributor

Hi! Thank you for bringing this to our attention! I think we should run some benchmarks ourselves with your proposed change and then come back to you.

@NaderAlAwar
Copy link
Contributor

@IlyaGrebnov I benchmarked your changes using our benchmark, which is the cub.bench.segmented_sort.keys.base target, on an H200 GPU. You can reproduce my results by doing

ninja cub.bench.segmented_sort.keys.base
./bin/cub.bench.segmented_sort.keys.base --devices 0 --stopping-criterion entropy --json before.json

and then checking out your branch and rerunning the benchmark, saving the file to after.json. I then used nvbench_compare.py script like so

python _deps/nvbench-src/scripts/nvbench_compare.py before.json after.json

The results I obtained are below. Overall there are significant speedups for segment sizes in the range you mentioned ([113, 352]) but we also see some regressions in some of the segment sizes that are slightly smaller than that.

# power

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  Segments{io}  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|---------------|----------------|----------------|-----------|------------|-------------|------------|-------------|---------------|---------|----------|
|   I8    |      I32      |      2^22      |      2^12      |     1     | 135.154 us |       1.01% | 124.416 us |       1.02% |    -10.738 us |  -7.94% |   FAST   |
|   I8    |      I32      |      2^26      |      2^12      |     1     | 815.618 us |       3.22% | 817.400 us |       3.37% |      1.783 us |   0.22% |   SAME   |
|   I8    |      I32      |      2^30      |      2^12      |     1     |  12.697 ms |       2.79% |  12.716 ms |       2.91% |     18.298 us |   0.14% |   SAME   |
|   I8    |      I32      |      2^22      |      2^16      |     1     | 240.828 us |       0.38% | 159.555 us |       0.56% |    -81.273 us | -33.75% |   FAST   |
|   I8    |      I32      |      2^26      |      2^16      |     1     |   1.319 ms |       0.09% |   1.157 ms |       0.11% |   -162.358 us | -12.31% |   FAST   |
|   I8    |      I32      |      2^30      |      2^16      |     1     |   7.939 ms |       0.20% |   7.939 ms |       0.22% |     -0.446 us |  -0.01% |   SAME   |
|   I8    |      I32      |      2^22      |      2^20      |     1     | 107.855 us |       3.32% | 124.552 us |       2.87% |     16.697 us |  15.48% |   SLOW   |
|   I8    |      I32      |      2^26      |      2^20      |     1     |   3.372 ms |       0.11% |   1.993 ms |       0.18% |  -1378.991 us | -40.89% |   FAST   |
|   I8    |      I32      |      2^30      |      2^20      |     1     |  20.470 ms |       0.03% |  17.801 ms |       0.03% |  -2669.103 us | -13.04% |   FAST   |
|   I8    |      I32      |      2^22      |      2^12      |   0.201   | 134.558 us |       0.97% | 123.910 us |       1.11% |    -10.648 us |  -7.91% |   FAST   |
|   I8    |      I32      |      2^26      |      2^12      |   0.201   | 801.665 us |       3.27% | 803.229 us |       3.40% |      1.564 us |   0.20% |   SAME   |
|   I8    |      I32      |      2^30      |      2^12      |   0.201   |  12.517 ms |       2.57% |  12.522 ms |       2.74% |      4.944 us |   0.04% |   SAME   |
|   I8    |      I32      |      2^22      |      2^16      |   0.201   | 240.502 us |       0.31% | 158.498 us |       0.49% |    -82.004 us | -34.10% |   FAST   |
|   I8    |      I32      |      2^26      |      2^16      |   0.201   |   1.312 ms |       0.09% |   1.150 ms |       0.12% |   -161.751 us | -12.33% |   FAST   |
|   I8    |      I32      |      2^30      |      2^16      |   0.201   |   7.808 ms |       0.22% |   7.808 ms |       0.20% |     -0.832 us |  -0.01% |   SAME   |
|   I8    |      I32      |      2^22      |      2^20      |   0.201   | 108.198 us |       3.35% | 125.030 us |       2.91% |     16.832 us |  15.56% |   SLOW   |
|   I8    |      I32      |      2^26      |      2^20      |   0.201   |   3.370 ms |       0.11% |   1.992 ms |       0.19% |  -1377.797 us | -40.88% |   FAST   |
|   I8    |      I32      |      2^30      |      2^20      |   0.201   |  20.376 ms |       0.03% |  17.711 ms |       0.05% |  -2665.091 us | -13.08% |   FAST   |
|   I16   |      I32      |      2^22      |      2^12      |     1     | 259.749 us |       0.57% | 233.907 us |       0.52% |    -25.842 us |  -9.95% |   FAST   |
|   I16   |      I32      |      2^26      |      2^12      |     1     |   1.710 ms |       0.12% |   1.711 ms |       0.12% |      0.741 us |   0.04% |   SAME   |
|   I16   |      I32      |      2^30      |      2^12      |     1     |  27.545 ms |       0.19% |  27.548 ms |       0.19% |      2.865 us |   0.01% |   SAME   |
|   I16   |      I32      |      2^22      |      2^16      |     1     | 493.428 us |       0.19% | 254.032 us |       0.40% |   -239.396 us | -48.52% |   FAST   |
|   I16   |      I32      |      2^26      |      2^16      |     1     |   3.101 ms |       0.04% |   2.682 ms |       0.04% |   -418.657 us | -13.50% |   FAST   |
|   I16   |      I32      |      2^30      |      2^16      |     1     |  19.635 ms |       0.02% |  19.633 ms |       0.02% |     -1.688 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^22      |      2^20      |     1     | 118.396 us |       3.03% | 127.923 us |       2.96% |      9.527 us |   8.05% |   SLOW   |
|   I16   |      I32      |      2^26      |      2^20      |     1     |   7.554 ms |       0.05% |   3.509 ms |       0.11% |  -4044.938 us | -53.55% |   FAST   |
|   I16   |      I32      |      2^30      |      2^20      |     1     |  48.893 ms |       0.03% |  42.135 ms |       0.06% |  -6758.233 us | -13.82% |   FAST   |
|   I16   |      I32      |      2^22      |      2^12      |   0.201   | 257.973 us |       0.73% | 232.422 us |       0.39% |    -25.551 us |  -9.90% |   FAST   |
|   I16   |      I32      |      2^26      |      2^12      |   0.201   |   1.683 ms |       0.13% |   1.684 ms |       0.17% |      0.930 us |   0.06% |   SAME   |
|   I16   |      I32      |      2^30      |      2^12      |   0.201   |  27.197 ms |       0.20% |  27.201 ms |       0.18% |      4.305 us |   0.02% |   SAME   |
|   I16   |      I32      |      2^22      |      2^16      |   0.201   | 492.621 us |       0.21% | 253.397 us |       0.33% |   -239.224 us | -48.56% |   FAST   |
|   I16   |      I32      |      2^26      |      2^16      |   0.201   |   3.092 ms |       0.04% |   2.673 ms |       0.05% |   -418.965 us | -13.55% |   FAST   |
|   I16   |      I32      |      2^30      |      2^16      |   0.201   |  19.267 ms |       0.02% |  19.265 ms |       0.02% |     -1.546 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^22      |      2^20      |   0.201   | 117.649 us |       3.00% | 127.714 us |       2.90% |     10.065 us |   8.56% |   SLOW   |
|   I16   |      I32      |      2^26      |      2^20      |   0.201   |   7.552 ms |       0.05% |   3.506 ms |       0.11% |  -4046.558 us | -53.58% |   FAST   |
|   I16   |      I32      |      2^30      |      2^20      |   0.201   |  48.745 ms |       0.03% |  41.984 ms |       0.05% |  -6761.515 us | -13.87% |   FAST   |
|   I32   |      I32      |      2^22      |      2^12      |     1     | 419.989 us |       0.26% | 373.139 us |       0.24% |    -46.851 us | -11.16% |   FAST   |
|   I32   |      I32      |      2^26      |      2^12      |     1     |   3.372 ms |       0.12% |   3.373 ms |       0.12% |      1.035 us |   0.03% |   SAME   |
|   I32   |      I32      |      2^30      |      2^12      |     1     |  59.603 ms |       0.09% |  59.605 ms |       0.08% |      2.091 us |   0.00% |   SAME   |
|   I32   |      I32      |      2^22      |      2^16      |     1     | 769.322 us |       0.12% | 344.018 us |       0.34% |   -425.304 us | -55.28% |   FAST   |
|   I32   |      I32      |      2^26      |      2^16      |     1     |   5.123 ms |       0.02% |   4.418 ms |       0.02% |   -704.742 us | -13.76% |   FAST   |
|   I32   |      I32      |      2^30      |      2^16      |     1     |  39.262 ms |       0.02% |  39.259 ms |       0.01% |     -2.961 us |  -0.01% |   SAME   |
|   I32   |      I32      |      2^22      |      2^20      |     1     | 122.785 us |       2.90% | 118.855 us |       3.22% |     -3.930 us |  -3.20% |   FAST   |
|   I32   |      I32      |      2^26      |      2^20      |     1     |  11.966 ms |       0.07% |   4.901 ms |       0.10% |  -7064.916 us | -59.04% |   FAST   |
|   I32   |      I32      |      2^30      |      2^20      |     1     |  81.021 ms |       0.04% |  69.633 ms |       0.05% | -11387.947 us | -14.06% |   FAST   |
|   I32   |      I32      |      2^22      |      2^12      |   0.201   | 417.907 us |       0.26% | 370.674 us |       0.30% |    -47.232 us | -11.30% |   FAST   |
|   I32   |      I32      |      2^26      |      2^12      |   0.201   |   3.316 ms |       0.12% |   3.316 ms |       0.13% |      0.401 us |   0.01% |   SAME   |
|   I32   |      I32      |      2^30      |      2^12      |   0.201   |  58.618 ms |       0.08% |  58.620 ms |       0.08% |      2.339 us |   0.00% |   SAME   |
|   I32   |      I32      |      2^22      |      2^16      |   0.201   | 769.651 us |       0.12% | 343.707 us |       0.32% |   -425.944 us | -55.34% |   FAST   |
|   I32   |      I32      |      2^26      |      2^16      |   0.201   |   5.116 ms |       0.04% |   4.410 ms |       0.02% |   -705.863 us | -13.80% |   FAST   |
|   I32   |      I32      |      2^30      |      2^16      |   0.201   |  38.590 ms |       0.02% |  38.586 ms |       0.02% |     -4.046 us |  -0.01% |   SAME   |
|   I32   |      I32      |      2^22      |      2^20      |   0.201   | 122.512 us |       2.99% | 118.698 us |       3.18% |     -3.814 us |  -3.11% |   FAST   |
|   I32   |      I32      |      2^26      |      2^20      |   0.201   |  11.962 ms |       0.07% |   4.896 ms |       0.09% |  -7065.843 us | -59.07% |   FAST   |
|   I32   |      I32      |      2^30      |      2^20      |   0.201   |  80.879 ms |       0.05% |  69.507 ms |       0.05% | -11371.922 us | -14.06% |   FAST   |
|   I64   |      I32      |      2^22      |      2^12      |     1     | 778.664 us |       0.16% | 739.221 us |       0.15% |    -39.443 us |  -5.07% |   FAST   |
|   I64   |      I32      |      2^26      |      2^12      |     1     |  10.241 ms |       0.18% |  10.241 ms |       0.18% |     -0.212 us |  -0.00% |   SAME   |
|   I64   |      I32      |      2^30      |      2^12      |     1     | 186.220 ms |       0.10% | 186.224 ms |       0.08% |      4.369 us |   0.00% |   SAME   |
|   I64   |      I32      |      2^22      |      2^16      |     1     |   1.912 ms |       0.06% | 911.559 us |       0.10% |  -1000.246 us | -52.32% |   FAST   |
|   I64   |      I32      |      2^26      |      2^16      |     1     |   9.290 ms |       0.02% |   8.743 ms |       0.02% |   -546.791 us |  -5.89% |   FAST   |
|   I64   |      I32      |      2^30      |      2^16      |     1     | 120.606 ms |       0.02% | 120.599 ms |       0.02% |     -6.982 us |  -0.01% |   SAME   |
|   I64   |      I32      |      2^22      |      2^20      |     1     | 358.730 us |       1.08% | 210.320 us |       1.82% |   -148.410 us | -41.37% |   FAST   |
|   I64   |      I32      |      2^26      |      2^20      |     1     |  30.096 ms |       0.06% |  14.086 ms |       0.07% | -16010.276 us | -53.20% |   FAST   |
|   I64   |      I32      |      2^30      |      2^20      |     1     | 146.166 ms |       0.03% | 137.221 ms |       0.03% |  -8944.963 us |  -6.12% |   FAST   |
|   I64   |      I32      |      2^22      |      2^12      |   0.201   | 772.326 us |       0.17% | 733.098 us |       0.17% |    -39.228 us |  -5.08% |   FAST   |
|   I64   |      I32      |      2^26      |      2^12      |   0.201   |  10.044 ms |       0.17% |  10.046 ms |       0.17% |      1.630 us |   0.02% |   SAME   |
|   I64   |      I32      |      2^30      |      2^12      |   0.201   | 183.608 ms |       0.09% | 183.638 ms |       0.10% |     30.038 us |   0.02% |   SAME   |
|   I64   |      I32      |      2^22      |      2^16      |   0.201   |   1.910 ms |       0.06% | 911.037 us |       0.13% |   -999.041 us | -52.30% |   FAST   |
|   I64   |      I32      |      2^26      |      2^16      |   0.201   |   9.231 ms |       0.02% |   8.688 ms |       0.03% |   -542.780 us |  -5.88% |   FAST   |
|   I64   |      I32      |      2^30      |      2^16      |   0.201   | 118.352 ms |       0.02% | 118.350 ms |       0.02% |     -1.888 us |  -0.00% |   SAME   |
|   I64   |      I32      |      2^22      |      2^20      |   0.201   | 358.741 us |       1.03% | 210.898 us |       1.75% |   -147.843 us | -41.21% |   FAST   |
|   I64   |      I32      |      2^26      |      2^20      |   0.201   |  30.062 ms |       0.03% |  14.076 ms |       0.06% | -15986.318 us | -53.18% |   FAST   |
|   I64   |      I32      |      2^30      |      2^20      |   0.201   | 145.282 ms |       0.01% | 136.392 ms |       0.03% |  -8889.502 us |  -6.12% |   FAST   |
|  I128   |      I32      |      2^22      |      2^12      |     1     |   1.484 ms |       3.37% |   1.467 ms |       3.64% |    -16.520 us |  -1.11% |   SAME   |
|  I128   |      I32      |      2^26      |      2^12      |     1     |  27.727 ms |       1.84% |  27.719 ms |       1.74% |     -7.604 us |  -0.03% |   SAME   |
|  I128   |      I32      |      2^30      |      2^12      |     1     | 467.864 ms |       1.31% | 466.899 ms |       0.95% |   -964.449 us |  -0.21% |   SAME   |
|  I128   |      I32      |      2^22      |      2^16      |     1     |   4.201 ms |       0.03% |   2.457 ms |       0.21% |  -1744.459 us | -41.52% |   FAST   |
|  I128   |      I32      |      2^26      |      2^16      |     1     |  16.308 ms |       0.20% |  16.033 ms |       0.21% |   -274.640 us |  -1.68% |   FAST   |
|  I128   |      I32      |      2^30      |      2^16      |     1     | 299.337 ms |       0.13% | 299.193 ms |       0.05% |   -144.541 us |  -0.05% |   SAME   |
|  I128   |      I32      |      2^22      |      2^20      |     1     |   1.478 ms |       0.25% | 475.035 us |       0.70% |  -1002.488 us | -67.85% |   FAST   |
|  I128   |      I32      |      2^26      |      2^20      |     1     |  66.814 ms |       0.05% |  38.446 ms |       0.06% | -28368.106 us | -42.46% |   FAST   |
|  I128   |      I32      |      2^30      |      2^20      |     1     | 252.051 ms |       0.02% | 247.814 ms |       0.02% |  -4236.079 us |  -1.68% |   FAST   |
|  I128   |      I32      |      2^22      |      2^12      |   0.201   |   1.469 ms |       3.20% |   1.452 ms |       3.38% |    -17.293 us |  -1.18% |   SAME   |
|  I128   |      I32      |      2^26      |      2^12      |   0.201   |  27.246 ms |       1.92% |  27.239 ms |       1.88% |     -6.875 us |  -0.03% |   SAME   |
|  I128   |      I32      |      2^30      |      2^12      |   0.201   | 460.723 ms |       2.25% | 459.804 ms |       1.24% |   -918.410 us |  -0.20% |   SAME   |
|  I128   |      I32      |      2^22      |      2^16      |   0.201   |   4.200 ms |       0.02% |   2.455 ms |       0.21% |  -1744.819 us | -41.55% |   FAST   |
|  I128   |      I32      |      2^26      |      2^16      |   0.201   |  16.175 ms |       0.20% |  15.896 ms |       0.19% |   -278.328 us |  -1.72% |   FAST   |
|  I128   |      I32      |      2^30      |      2^16      |   0.201   | 293.261 ms |       0.12% | 293.253 ms |       0.11% |     -7.832 us |  -0.00% |   SAME   |
|  I128   |      I32      |      2^22      |      2^20      |   0.201   |   1.479 ms |       0.23% | 474.182 us |       0.68% |  -1004.559 us | -67.93% |   FAST   |
|  I128   |      I32      |      2^26      |      2^20      |   0.201   |  66.766 ms |       0.05% |  38.413 ms |       0.06% | -28352.497 us | -42.47% |   FAST   |
|  I128   |      I32      |      2^30      |      2^20      |   0.201   | 250.003 ms |       0.01% | 245.763 ms |       0.02% |  -4240.285 us |  -1.70% |   FAST   |
|   F32   |      I32      |      2^22      |      2^12      |     1     | 417.049 us |       0.32% | 371.341 us |       0.33% |    -45.709 us | -10.96% |   FAST   |
|   F32   |      I32      |      2^26      |      2^12      |     1     |   3.245 ms |       0.13% |   3.246 ms |       0.13% |      1.051 us |   0.03% |   SAME   |
|   F32   |      I32      |      2^30      |      2^12      |     1     |  56.189 ms |       0.09% |  56.187 ms |       0.08% |     -2.279 us |  -0.00% |   SAME   |
|   F32   |      I32      |      2^22      |      2^16      |     1     | 776.273 us |       0.17% | 349.413 us |       0.35% |   -426.860 us | -54.99% |   FAST   |
|   F32   |      I32      |      2^26      |      2^16      |     1     |   5.120 ms |       0.02% |   4.413 ms |       0.04% |   -706.739 us | -13.80% |   FAST   |
|   F32   |      I32      |      2^30      |      2^16      |     1     |  37.967 ms |       0.02% |  37.963 ms |       0.02% |     -3.613 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^22      |      2^20      |     1     | 128.960 us |       2.76% | 124.500 us |       2.83% |     -4.459 us |  -3.46% |   FAST   |
|   F32   |      I32      |      2^26      |      2^20      |     1     |  12.010 ms |       0.07% |   4.935 ms |       0.09% |  -7075.099 us | -58.91% |   FAST   |
|   F32   |      I32      |      2^30      |      2^20      |     1     |  80.957 ms |       0.04% |  69.546 ms |       0.05% | -11411.269 us | -14.10% |   FAST   |
|   F32   |      I32      |      2^22      |      2^12      |   0.201   | 415.545 us |       0.26% | 369.875 us |       0.30% |    -45.669 us | -10.99% |   FAST   |
|   F32   |      I32      |      2^26      |      2^12      |   0.201   |   3.212 ms |       0.12% |   3.213 ms |       0.12% |      1.427 us |   0.04% |   SAME   |
|   F32   |      I32      |      2^30      |      2^12      |   0.201   |  55.742 ms |       0.10% |  55.743 ms |       0.11% |      1.053 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^22      |      2^16      |   0.201   | 775.565 us |       0.15% | 349.153 us |       0.36% |   -426.413 us | -54.98% |   FAST   |
|   F32   |      I32      |      2^26      |      2^16      |   0.201   |   5.116 ms |       0.04% |   4.409 ms |       0.04% |   -706.748 us | -13.82% |   FAST   |
|   F32   |      I32      |      2^30      |      2^16      |   0.201   |  37.576 ms |       0.02% |  37.572 ms |       0.02% |     -3.555 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^22      |      2^20      |   0.201   | 129.034 us |       2.74% | 124.478 us |       2.91% |     -4.556 us |  -3.53% |   FAST   |
|   F32   |      I32      |      2^26      |      2^20      |   0.201   |  12.001 ms |       0.06% |   4.933 ms |       0.08% |  -7068.589 us | -58.90% |   FAST   |
|   F32   |      I32      |      2^30      |      2^20      |   0.201   |  80.866 ms |       0.04% |  69.472 ms |       0.05% | -11394.144 us | -14.09% |   FAST   |
|   F64   |      I32      |      2^22      |      2^12      |     1     | 676.158 us |       3.42% | 647.289 us |       3.86% |    -28.868 us |  -4.27% |   FAST   |
|   F64   |      I32      |      2^26      |      2^12      |     1     |   9.382 ms |       2.62% |   9.386 ms |       2.75% |      3.958 us |   0.04% |   SAME   |
|   F64   |      I32      |      2^30      |      2^12      |     1     | 157.794 ms |       2.80% | 156.551 ms |       1.64% |  -1243.075 us |  -0.79% |   SAME   |
|   F64   |      I32      |      2^22      |      2^16      |     1     |   1.443 ms |       0.07% | 704.385 us |       0.13% |   -738.896 us | -51.20% |   FAST   |
|   F64   |      I32      |      2^26      |      2^16      |     1     |   7.254 ms |       0.17% |   6.845 ms |       0.18% |   -408.938 us |  -5.64% |   FAST   |
|   F64   |      I32      |      2^30      |      2^16      |     1     |  99.775 ms |       0.09% |  99.802 ms |       0.17% |     26.835 us |   0.03% |   SAME   |
|   F64   |      I32      |      2^22      |      2^20      |     1     | 305.738 us |       1.21% | 199.229 us |       1.47% |   -106.508 us | -34.84% |   FAST   |
|   F64   |      I32      |      2^26      |      2^20      |     1     |  22.515 ms |       0.06% |  10.710 ms |       0.07% | -11804.915 us | -52.43% |   FAST   |
|   F64   |      I32      |      2^30      |      2^20      |     1     | 111.970 ms |       0.03% | 105.382 ms |       0.03% |  -6588.140 us |  -5.88% |   FAST   |
|   F64   |      I32      |      2^22      |      2^12      |   0.201   | 669.280 us |       3.35% | 640.238 us |       3.91% |    -29.043 us |  -4.34% |   FAST   |
|   F64   |      I32      |      2^26      |      2^12      |   0.201   |   9.161 ms |       2.66% |   9.161 ms |       2.63% |      0.133 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^30      |      2^12      |   0.201   | 153.592 ms |       1.75% | 153.894 ms |       1.98% |    301.599 us |   0.20% |   SAME   |
|   F64   |      I32      |      2^22      |      2^16      |   0.201   |   1.442 ms |       0.08% | 703.630 us |       0.13% |   -738.477 us | -51.21% |   FAST   |
|   F64   |      I32      |      2^26      |      2^16      |   0.201   |   7.199 ms |       0.16% |   6.789 ms |       0.16% |   -409.626 us |  -5.69% |   FAST   |
|   F64   |      I32      |      2^30      |      2^16      |   0.201   |  97.568 ms |       0.13% |  97.561 ms |       0.14% |     -7.332 us |  -0.01% |   SAME   |
|   F64   |      I32      |      2^22      |      2^20      |   0.201   | 306.043 us |       1.18% | 197.751 us |       1.88% |   -108.292 us | -35.38% |   FAST   |
|   F64   |      I32      |      2^26      |      2^20      |   0.201   |  22.498 ms |       0.05% |  10.699 ms |       0.06% | -11798.924 us | -52.45% |   FAST   |
|   F64   |      I32      |      2^30      |      2^20      |   0.201   | 111.160 ms |       0.02% | 104.595 ms |       0.04% |  -6565.312 us |  -5.91% |   FAST   |

# small

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |            Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------|-------------|------------|-------------|-----------------|---------|----------|
|   I8    |      I32      |      2^22      |       2^1        |  65.803 us |       3.78% |  66.725 us |       3.76% |        0.922 us |   1.40% |   SAME   |
|   I8    |      I32      |      2^26      |       2^1        | 740.253 us |       0.33% | 738.627 us |       0.45% |       -1.626 us |  -0.22% |   SAME   |
|   I8    |      I32      |      2^30      |       2^1        |  11.275 ms |       0.08% |  11.276 ms |       0.03% |        1.050 us |   0.01% |   SAME   |
|   I8    |      I32      |      2^22      |       2^2        |  82.771 us |       3.23% |  84.683 us |       4.06% |        1.913 us |   2.31% |   SAME   |
|   I8    |      I32      |      2^26      |       2^2        | 926.261 us |       0.11% | 925.661 us |       0.13% |       -0.600 us |  -0.06% |   SAME   |
|   I8    |      I32      |      2^30      |       2^2        |  14.307 ms |       0.03% |  14.307 ms |       0.01% |        0.238 us |   0.00% |   SAME   |
|   I8    |      I32      |      2^22      |       2^3        |  55.235 us |       1.86% |  54.656 us |       1.90% |       -0.579 us |  -1.05% |   SAME   |
|   I8    |      I32      |      2^26      |       2^3        | 472.518 us |       0.76% | 473.919 us |       0.70% |        1.402 us |   0.30% |   SAME   |
|   I8    |      I32      |      2^30      |       2^3        |   7.054 ms |       0.04% |   7.052 ms |       0.05% |       -1.320 us |  -0.02% |   SAME   |
|   I8    |      I32      |      2^22      |       2^4        |  43.531 us |       1.58% |  43.582 us |       2.25% |        0.051 us |   0.12% |   SAME   |
|   I8    |      I32      |      2^26      |       2^4        | 274.880 us |       0.31% | 274.442 us |       0.40% |       -0.438 us |  -0.16% |   SAME   |
|   I8    |      I32      |      2^30      |       2^4        |   3.879 ms |       0.07% |   3.878 ms |       0.07% |       -0.991 us |  -0.03% |   SAME   |
|   I8    |      I32      |      2^22      |       2^5        | 113.182 us |       0.62% | 209.653 us |       0.32% |       96.471 us |  85.24% |   SLOW   |
|   I8    |      I32      |      2^26      |       2^5        |   1.380 ms |       0.17% |   2.945 ms |       0.12% |        1.565 ms | 113.35% |   SLOW   |
|   I8    |      I32      |      2^30      |       2^5        |  21.707 ms |       0.02% |  46.727 ms |       0.01% |       25.020 ms | 115.27% |   SLOW   |
|   I8    |      I32      |      2^22      |       2^6        |  83.277 us |       0.69% | 146.179 us |       1.32% |       62.903 us |  75.53% |   SLOW   |
|   I8    |      I32      |      2^26      |       2^6        | 877.539 us |       0.29% |   1.893 ms |       0.16% |        1.015 ms | 115.66% |   SLOW   |
|   I8    |      I32      |      2^30      |       2^6        |  13.626 ms |       0.01% |  29.850 ms |       0.01% |       16.224 ms | 119.06% |   SLOW   |
|   I8    |      I32      |      2^22      |       2^7        | 263.814 us |       0.24% |  93.932 us |       0.75% |     -169.882 us | -64.39% |   FAST   |
|   I8    |      I32      |      2^26      |       2^7        |   3.740 ms |       0.04% |   1.046 ms |       0.09% |    -2693.988 us | -72.04% |   FAST   |
|   I8    |      I32      |      2^30      |       2^7        |  59.882 ms |       0.00% |  16.299 ms |       0.02% |   -43583.321 us | -72.78% |   FAST   |
|   I8    |      I32      |      2^22      |       2^8        | 454.705 us |       0.15% | 167.243 us |       0.50% |     -287.462 us | -63.22% |   FAST   |
|   I8    |      I32      |      2^26      |       2^8        |   6.894 ms |       0.03% |   2.154 ms |       0.06% |    -4740.541 us | -68.76% |   FAST   |
|   I8    |      I32      |      2^30      |       2^8        | 110.241 ms |       0.01% |  34.239 ms |       0.01% |   -76002.204 us | -68.94% |   FAST   |
|   I16   |      I32      |      2^22      |       2^1        |  67.112 us |       3.20% |  67.698 us |       3.08% |        0.587 us |   0.87% |   SAME   |
|   I16   |      I32      |      2^26      |       2^1        | 775.704 us |       0.44% | 776.558 us |       0.35% |        0.854 us |   0.11% |   SAME   |
|   I16   |      I32      |      2^30      |       2^1        |  11.885 ms |       0.03% |  11.892 ms |       0.03% |        6.776 us |   0.06% |   SLOW   |
|   I16   |      I32      |      2^22      |       2^2        |  80.719 us |       3.55% |  81.856 us |       4.05% |        1.136 us |   1.41% |   SAME   |
|   I16   |      I32      |      2^26      |       2^2        | 896.007 us |       0.19% | 896.204 us |       0.12% |        0.197 us |   0.02% |   SAME   |
|   I16   |      I32      |      2^30      |       2^2        |  13.821 ms |       0.02% |  13.821 ms |       0.03% |        0.019 us |   0.00% |   SAME   |
|   I16   |      I32      |      2^22      |       2^3        |  54.342 us |       1.37% |  54.199 us |       1.73% |       -0.144 us |  -0.26% |   SAME   |
|   I16   |      I32      |      2^26      |       2^3        | 466.403 us |       0.72% | 466.446 us |       0.77% |        0.043 us |   0.01% |   SAME   |
|   I16   |      I32      |      2^30      |       2^3        |   6.937 ms |       0.05% |   6.939 ms |       0.06% |        1.595 us |   0.02% |   SAME   |
|   I16   |      I32      |      2^22      |       2^4        |  45.221 us |       2.69% |  45.457 us |       5.77% |        0.236 us |   0.52% |   SAME   |
|   I16   |      I32      |      2^26      |       2^4        | 287.673 us |       0.41% | 287.240 us |       0.39% |       -0.433 us |  -0.15% |   SAME   |
|   I16   |      I32      |      2^30      |       2^4        |   4.062 ms |       0.08% |   4.062 ms |       0.09% |        0.514 us |   0.01% |   SAME   |
|   I16   |      I32      |      2^22      |       2^5        | 114.349 us |       0.49% | 211.895 us |       0.34% |       97.546 us |  85.31% |   SLOW   |
|   I16   |      I32      |      2^26      |       2^5        |   1.412 ms |       0.18% |   2.979 ms |       0.09% |        1.568 ms | 111.08% |   SLOW   |
|   I16   |      I32      |      2^30      |       2^5        |  22.210 ms |       0.02% |  47.306 ms |       0.01% |       25.096 ms | 112.99% |   SLOW   |
|   I16   |      I32      |      2^22      |       2^6        |  84.367 us |       1.59% | 148.009 us |       0.43% |       63.642 us |  75.43% |   SLOW   |
|   I16   |      I32      |      2^26      |       2^6        | 886.968 us |       0.31% |   1.914 ms |       0.18% |        1.027 ms | 115.80% |   SLOW   |
|   I16   |      I32      |      2^30      |       2^6        |  13.758 ms |       0.01% |  30.187 ms |       0.01% |       16.428 ms | 119.40% |   SLOW   |
|   I16   |      I32      |      2^22      |       2^7        | 553.671 us |       0.19% |  95.881 us |       0.65% |     -457.790 us | -82.68% |   FAST   |
|   I16   |      I32      |      2^26      |       2^7        |   8.465 ms |       0.06% |   1.070 ms |       0.09% |    -7394.713 us | -87.36% |   FAST   |
|   I16   |      I32      |      2^30      |       2^7        | 135.806 ms |       0.03% |  16.680 ms |       0.02% |  -119126.242 us | -87.72% |   FAST   |
|   I16   |      I32      |      2^22      |       2^8        |   1.045 ms |       0.08% | 317.924 us |       0.29% |     -727.146 us | -69.58% |   FAST   |
|   I16   |      I32      |      2^26      |       2^8        |  16.373 ms |       0.01% |   4.573 ms |       0.06% |   -11799.467 us | -72.07% |   FAST   |
|   I16   |      I32      |      2^30      |       2^8        | 261.787 ms |       0.00% |  73.029 ms |       0.02% |  -188758.167 us | -72.10% |   FAST   |
|   I32   |      I32      |      2^22      |       2^1        |  74.156 us |       3.47% |  74.489 us |       3.00% |        0.334 us |   0.45% |   SAME   |
|   I32   |      I32      |      2^26      |       2^1        | 821.259 us |       0.41% | 822.094 us |       0.36% |        0.835 us |   0.10% |   SAME   |
|   I32   |      I32      |      2^30      |       2^1        |  12.621 ms |       0.03% |  12.621 ms |       0.03% |        0.415 us |   0.00% |   SAME   |
|   I32   |      I32      |      2^22      |       2^2        |  77.127 us |       2.69% |  79.018 us |       4.38% |        1.891 us |   2.45% |   SAME   |
|   I32   |      I32      |      2^26      |       2^2        | 838.596 us |       0.23% | 837.026 us |       0.22% |       -1.570 us |  -0.19% |   SAME   |
|   I32   |      I32      |      2^30      |       2^2        |  12.874 ms |       0.06% |  12.873 ms |       0.06% |       -0.978 us |  -0.01% |   SAME   |
|   I32   |      I32      |      2^22      |       2^3        |  54.050 us |       2.08% |  55.281 us |       1.58% |        1.231 us |   2.28% |   SLOW   |
|   I32   |      I32      |      2^26      |       2^3        | 461.635 us |       0.86% | 460.585 us |       0.78% |       -1.050 us |  -0.23% |   SAME   |
|   I32   |      I32      |      2^30      |       2^3        |   6.853 ms |       0.07% |   6.853 ms |       0.07% |       -0.262 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^22      |       2^4        |  46.185 us |       1.69% |  45.508 us |       1.91% |       -0.676 us |  -1.46% |   SAME   |
|   I32   |      I32      |      2^26      |       2^4        | 301.435 us |       0.43% | 301.272 us |       0.57% |       -0.162 us |  -0.05% |   SAME   |
|   I32   |      I32      |      2^30      |       2^4        |   4.441 ms |       3.99% |   4.447 ms |       3.61% |        5.685 us |   0.13% |   SAME   |
|   I32   |      I32      |      2^22      |       2^5        |  97.380 us |       0.94% | 171.215 us |       0.97% |       73.835 us |  75.82% |   SLOW   |
|   I32   |      I32      |      2^26      |       2^5        |   1.157 ms |       0.39% |   2.306 ms |       0.30% |        1.149 ms |  99.26% |   SLOW   |
|   I32   |      I32      |      2^30      |       2^5        |  18.188 ms |       0.03% |  36.573 ms |       0.01% |       18.385 ms | 101.08% |   SLOW   |
|   I32   |      I32      |      2^22      |       2^6        |  74.192 us |       2.15% | 121.087 us |       0.71% |       46.895 us |  63.21% |   SLOW   |
|   I32   |      I32      |      2^26      |       2^6        | 718.127 us |       0.39% |   1.467 ms |       0.15% |      748.586 us | 104.24% |   SLOW   |
|   I32   |      I32      |      2^30      |       2^6        |  11.066 ms |       0.07% |  23.053 ms |       0.01% |       11.986 ms | 108.32% |   SLOW   |
|   I32   |      I32      |      2^22      |       2^7        | 871.787 us |       0.12% |  81.434 us |       1.00% |     -790.353 us | -90.66% |   FAST   |
|   I32   |      I32      |      2^26      |       2^7        |  13.545 ms |       0.07% | 832.718 us |       0.13% |   -12712.424 us | -93.85% |   FAST   |
|   I32   |      I32      |      2^30      |       2^7        | 216.457 ms |       0.02% |  12.872 ms |       0.05% |  -203584.992 us | -94.05% |   FAST   |
|   I32   |      I32      |      2^22      |       2^8        |   1.702 ms |       0.06% | 477.079 us |       0.35% |    -1224.875 us | -71.97% |   FAST   |
|   I32   |      I32      |      2^26      |       2^8        |  26.789 ms |       0.01% |   7.075 ms |       0.06% |   -19714.092 us | -73.59% |   FAST   |
|   I32   |      I32      |      2^30      |       2^8        | 428.064 ms |       0.00% | 112.656 ms |       0.04% |  -315407.526 us | -73.68% |   FAST   |
|   I64   |      I32      |      2^22      |       2^1        | 169.574 us |       1.45% | 169.516 us |       2.07% |       -0.058 us |  -0.03% |   SAME   |
|   I64   |      I32      |      2^26      |       2^1        |   2.427 ms |       0.13% |   2.426 ms |       0.14% |       -0.880 us |  -0.04% |   SAME   |
|   I64   |      I32      |      2^30      |       2^1        |  38.300 ms |       0.01% |  38.301 ms |       0.01% |        1.568 us |   0.00% |   SAME   |
|   I64   |      I32      |      2^22      |       2^2        | 184.825 us |       1.54% | 184.525 us |       1.42% |       -0.300 us |  -0.16% |   SAME   |
|   I64   |      I32      |      2^26      |       2^2        |   2.562 ms |       0.09% |   2.561 ms |       0.10% |       -1.141 us |  -0.04% |   SAME   |
|   I64   |      I32      |      2^30      |       2^2        |  40.449 ms |       0.01% |  40.451 ms |       0.02% |        2.472 us |   0.01% |   SAME   |
|   I64   |      I32      |      2^22      |       2^3        | 118.668 us |       0.64% | 117.896 us |       0.80% |       -0.773 us |  -0.65% |   FAST   |
|   I64   |      I32      |      2^26      |       2^3        |   1.480 ms |       0.25% |   1.478 ms |       0.24% |       -1.491 us |  -0.10% |   SAME   |
|   I64   |      I32      |      2^30      |       2^3        |  23.149 ms |       0.05% |  23.147 ms |       0.05% |       -1.409 us |  -0.01% |   SAME   |
|   I64   |      I32      |      2^22      |       2^4        |  84.935 us |       1.01% |  85.521 us |       0.98% |        0.586 us |   0.69% |   SAME   |
|   I64   |      I32      |      2^26      |       2^4        | 930.518 us |       0.13% | 930.103 us |       0.14% |       -0.416 us |  -0.04% |   SAME   |
|   I64   |      I32      |      2^30      |       2^4        |  14.358 ms |       0.08% |  14.353 ms |       0.06% |       -5.482 us |  -0.04% |   SAME   |
|   I64   |      I32      |      2^22      |       2^5        | 103.049 us |       0.82% | 155.060 us |       0.64% |       52.011 us |  50.47% |   SLOW   |
|   I64   |      I32      |      2^26      |       2^5        |   1.203 ms |       0.16% |   2.046 ms |       0.14% |      843.650 us |  70.14% |   SLOW   |
|   I64   |      I32      |      2^30      |       2^5        |  18.803 ms |       0.02% |  32.300 ms |       0.04% |       13.497 ms |  71.78% |   SLOW   |
|   I64   |      I32      |      2^22      |       2^6        | 104.079 us |       0.72% | 163.353 us |       0.55% |       59.274 us |  56.95% |   SLOW   |
|   I64   |      I32      |      2^26      |       2^6        |   1.203 ms |       0.22% |   2.159 ms |       0.11% |      956.246 us |  79.50% |   SLOW   |
|   I64   |      I32      |      2^30      |       2^6        |  18.824 ms |       0.01% |  34.127 ms |       0.01% |       15.304 ms |  81.30% |   SLOW   |
|   I64   |      I32      |      2^22      |       2^7        |   4.451 ms |       0.04% | 119.710 us |       0.70% |    -4331.357 us | -97.31% |   FAST   |
|   I64   |      I32      |      2^26      |       2^7        |  70.698 ms |       0.01% |   1.448 ms |       0.07% |   -69249.575 us | -97.95% |   FAST   |
|   I64   |      I32      |      2^30      |       2^7        |    1.131 s |       0.01% |  22.740 ms |       0.01% | -1107809.339 us | -97.99% |   FAST   |
|   I64   |      I32      |      2^22      |       2^8        |   2.293 ms |       0.05% |   2.282 ms |       0.05% |      -11.394 us |  -0.50% |   FAST   |
|   I64   |      I32      |      2^26      |       2^8        |  36.169 ms |       0.01% |  35.932 ms |       0.01% |     -236.934 us |  -0.66% |   FAST   |
|   I64   |      I32      |      2^30      |       2^8        | 578.102 ms |       0.00% | 574.323 ms |       0.00% |    -3779.169 us |  -0.65% |   FAST   |
|  I128   |      I32      |      2^22      |       2^1        | 196.288 us |       1.85% | 196.660 us |       2.04% |        0.372 us |   0.19% |   SAME   |
|  I128   |      I32      |      2^26      |       2^1        |   2.757 ms |       0.13% |   2.759 ms |       0.07% |        1.624 us |   0.06% |   SAME   |
|  I128   |      I32      |      2^30      |       2^1        |  43.608 ms |       0.01% |  43.610 ms |       0.01% |        1.997 us |   0.00% |   SAME   |
|  I128   |      I32      |      2^22      |       2^2        | 172.989 us |       1.58% | 172.512 us |       1.75% |       -0.477 us |  -0.28% |   SAME   |
|  I128   |      I32      |      2^26      |       2^2        |   2.346 ms |       0.09% |   2.344 ms |       0.05% |       -1.367 us |  -0.06% |   FAST   |
|  I128   |      I32      |      2^30      |       2^2        |  36.986 ms |       0.01% |  36.984 ms |       0.02% |       -2.402 us |  -0.01% |   SAME   |
|  I128   |      I32      |      2^22      |       2^3        | 109.976 us |       0.88% | 109.728 us |       1.92% |       -0.248 us |  -0.23% |   SAME   |
|  I128   |      I32      |      2^26      |       2^3        |   1.339 ms |       0.26% |   1.338 ms |       0.26% |       -0.886 us |  -0.07% |   SAME   |
|  I128   |      I32      |      2^30      |       2^3        |  20.884 ms |       0.05% |  20.882 ms |       0.05% |       -2.101 us |  -0.01% |   SAME   |
|  I128   |      I32      |      2^22      |       2^4        | 164.987 us |       0.57% | 300.489 us |       0.32% |      135.502 us |  82.13% |   SLOW   |
|  I128   |      I32      |      2^26      |       2^4        |   2.212 ms |       0.08% |   4.389 ms |       0.02% |        2.177 ms |  98.40% |   SLOW   |
|  I128   |      I32      |      2^30      |       2^4        |  34.916 ms |       0.01% |  69.741 ms |       0.01% |       34.825 ms |  99.74% |   SLOW   |
|  I128   |      I32      |      2^22      |       2^5        | 131.598 us |       0.62% | 206.203 us |       0.47% |       74.605 us |  56.69% |   SLOW   |
|  I128   |      I32      |      2^26      |       2^5        |   1.647 ms |       0.13% |   2.832 ms |       0.06% |        1.185 ms |  71.94% |   SLOW   |
|  I128   |      I32      |      2^30      |       2^5        |  25.892 ms |       0.01% |  44.843 ms |       0.00% |       18.951 ms |  73.19% |   SLOW   |
|  I128   |      I32      |      2^22      |       2^6        |  11.003 ms |       0.01% | 152.857 us |       0.50% |   -10849.652 us | -98.61% |   FAST   |
|  I128   |      I32      |      2^26      |       2^6        | 175.440 ms |       0.01% |   1.971 ms |       0.16% |  -173469.386 us | -98.88% |   FAST   |
|  I128   |      I32      |      2^30      |       2^6        |    2.806 s |       0.00% |  31.091 ms |       0.01% | -2774724.044 us | -98.89% |   FAST   |
|  I128   |      I32      |      2^22      |       2^7        |   5.712 ms |       0.02% |   5.632 ms |       0.02% |      -79.999 us |  -1.40% |   FAST   |
|  I128   |      I32      |      2^26      |       2^7        |  90.743 ms |       0.01% |  89.371 ms |       0.01% |    -1371.206 us |  -1.51% |   FAST   |
|  I128   |      I32      |      2^30      |       2^7        |    1.451 s |       0.00% |    1.429 s |       0.00% |   -21587.802 us |  -1.49% |   FAST   |
|  I128   |      I32      |      2^22      |       2^8        |   2.896 ms |       0.03% |   2.896 ms |       0.03% |        0.336 us |   0.01% |   SAME   |
|  I128   |      I32      |      2^26      |       2^8        |  45.692 ms |       0.01% |  45.703 ms |       0.01% |       11.310 us |   0.02% |   SLOW   |
|  I128   |      I32      |      2^30      |       2^8        | 730.460 ms |       0.00% | 730.565 ms |       0.00% |      105.498 us |   0.01% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^1        |  75.318 us |       2.21% |  75.708 us |       3.65% |        0.390 us |   0.52% |   SAME   |
|   F32   |      I32      |      2^26      |       2^1        | 820.731 us |       0.39% | 821.568 us |       0.42% |        0.837 us |   0.10% |   SAME   |
|   F32   |      I32      |      2^30      |       2^1        |  12.617 ms |       0.03% |  12.622 ms |       0.03% |        5.004 us |   0.04% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^2        |  82.507 us |       3.26% |  83.701 us |       4.09% |        1.194 us |   1.45% |   SAME   |
|   F32   |      I32      |      2^26      |       2^2        | 914.927 us |       0.14% | 914.255 us |       0.19% |       -0.672 us |  -0.07% |   SAME   |
|   F32   |      I32      |      2^30      |       2^2        |  14.099 ms |       0.05% |  14.101 ms |       0.06% |        1.531 us |   0.01% |   SAME   |
|   F32   |      I32      |      2^22      |       2^3        |  56.501 us |       1.59% |  56.572 us |       2.96% |        0.071 us |   0.13% |   SAME   |
|   F32   |      I32      |      2^26      |       2^3        | 479.259 us |       0.81% | 480.049 us |       0.72% |        0.790 us |   0.16% |   SAME   |
|   F32   |      I32      |      2^30      |       2^3        |   7.145 ms |       0.10% |   7.143 ms |       0.07% |       -2.262 us |  -0.03% |   SAME   |
|   F32   |      I32      |      2^22      |       2^4        |  46.556 us |       2.06% |  45.861 us |       1.88% |       -0.695 us |  -1.49% |   SAME   |
|   F32   |      I32      |      2^26      |       2^4        | 303.846 us |       0.43% | 304.657 us |       0.44% |        0.811 us |   0.27% |   SAME   |
|   F32   |      I32      |      2^30      |       2^4        |   4.487 ms |       3.55% |   4.484 ms |       3.48% |       -3.513 us |  -0.08% |   SAME   |
|   F32   |      I32      |      2^22      |       2^5        | 100.961 us |       1.00% | 175.583 us |       0.61% |       74.622 us |  73.91% |   SLOW   |
|   F32   |      I32      |      2^26      |       2^5        |   1.206 ms |       0.38% |   2.403 ms |       0.25% |        1.197 ms |  99.25% |   SLOW   |
|   F32   |      I32      |      2^30      |       2^5        |  18.963 ms |       0.03% |  38.137 ms |       0.01% |       19.174 ms | 101.12% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^6        |  75.478 us |       1.10% | 123.867 us |       0.62% |       48.389 us |  64.11% |   SLOW   |
|   F32   |      I32      |      2^26      |       2^6        | 744.421 us |       0.42% |   1.525 ms |       0.21% |      780.211 us | 104.81% |   SLOW   |
|   F32   |      I32      |      2^30      |       2^6        |  11.479 ms |       0.04% |  23.957 ms |       0.01% |       12.478 ms | 108.71% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^7        | 876.569 us |       0.17% |  82.987 us |       0.94% |     -793.582 us | -90.53% |   FAST   |
|   F32   |      I32      |      2^26      |       2^7        |  13.574 ms |       0.04% | 857.968 us |       0.15% |   -12715.606 us | -93.68% |   FAST   |
|   F32   |      I32      |      2^30      |       2^7        | 216.996 ms |       0.03% |  13.277 ms |       0.05% |  -203718.320 us | -93.88% |   FAST   |
|   F32   |      I32      |      2^22      |       2^8        |   1.708 ms |       0.08% | 481.958 us |       0.29% |    -1225.883 us | -71.78% |   FAST   |
|   F32   |      I32      |      2^26      |       2^8        |  26.883 ms |       0.01% |   7.107 ms |       0.05% |   -19775.606 us | -73.56% |   FAST   |
|   F32   |      I32      |      2^30      |       2^8        | 429.501 ms |       0.00% | 113.126 ms |       0.02% |  -316375.689 us | -73.66% |   FAST   |
|   F64   |      I32      |      2^22      |       2^1        | 170.279 us |       1.45% | 170.507 us |       1.62% |        0.228 us |   0.13% |   SAME   |
|   F64   |      I32      |      2^26      |       2^1        |   2.425 ms |       0.16% |   2.424 ms |       0.13% |       -0.455 us |  -0.02% |   SAME   |
|   F64   |      I32      |      2^30      |       2^1        |  38.274 ms |       0.01% |  38.265 ms |       0.01% |       -9.421 us |  -0.02% |   FAST   |
|   F64   |      I32      |      2^22      |       2^2        | 177.220 us |       1.93% | 177.551 us |       1.80% |        0.331 us |   0.19% |   SAME   |
|   F64   |      I32      |      2^26      |       2^2        |   2.429 ms |       0.09% |   2.425 ms |       0.09% |       -3.438 us |  -0.14% |   FAST   |
|   F64   |      I32      |      2^30      |       2^2        |  38.325 ms |       0.02% |  38.282 ms |       0.02% |      -43.260 us |  -0.11% |   FAST   |
|   F64   |      I32      |      2^22      |       2^3        | 118.236 us |       0.74% | 117.540 us |       0.99% |       -0.696 us |  -0.59% |   SAME   |
|   F64   |      I32      |      2^26      |       2^3        |   1.477 ms |       0.25% |   1.476 ms |       0.24% |       -1.002 us |  -0.07% |   SAME   |
|   F64   |      I32      |      2^30      |       2^3        |  23.105 ms |       0.05% |  23.104 ms |       0.05% |       -0.675 us |  -0.00% |   SAME   |
|   F64   |      I32      |      2^22      |       2^4        |  85.115 us |       1.13% |  86.109 us |       1.08% |        0.995 us |   1.17% |   SLOW   |
|   F64   |      I32      |      2^26      |       2^4        | 930.663 us |       0.37% | 929.787 us |       0.16% |       -0.877 us |  -0.09% |   SAME   |
|   F64   |      I32      |      2^30      |       2^4        |  14.352 ms |       0.08% |  14.349 ms |       0.07% |       -3.218 us |  -0.02% |   SAME   |
|   F64   |      I32      |      2^22      |       2^5        | 103.023 us |       0.93% | 156.523 us |       0.62% |       53.500 us |  51.93% |   SLOW   |
|   F64   |      I32      |      2^26      |       2^5        |   1.203 ms |       0.19% |   2.048 ms |       0.16% |      844.448 us |  70.17% |   SLOW   |
|   F64   |      I32      |      2^30      |       2^5        |  18.802 ms |       0.03% |  32.310 ms |       0.06% |       13.508 ms |  71.84% |   SLOW   |
|   F64   |      I32      |      2^22      |       2^6        | 104.437 us |       0.77% | 163.515 us |       0.60% |       59.078 us |  56.57% |   SLOW   |
|   F64   |      I32      |      2^26      |       2^6        |   1.203 ms |       0.29% |   2.159 ms |       0.12% |      955.542 us |  79.40% |   SLOW   |
|   F64   |      I32      |      2^30      |       2^6        |  18.809 ms |       0.01% |  34.124 ms |       0.01% |       15.316 ms |  81.43% |   SLOW   |
|   F64   |      I32      |      2^22      |       2^7        |   3.341 ms |       0.03% | 119.538 us |       0.62% |    -3221.160 us | -96.42% |   FAST   |
|   F64   |      I32      |      2^26      |       2^7        |  52.802 ms |       0.01% |   1.448 ms |       0.06% |   -51354.347 us | -97.26% |   FAST   |
|   F64   |      I32      |      2^30      |       2^7        | 844.109 ms |       0.00% |  22.733 ms |       0.01% |  -821375.853 us | -97.31% |   FAST   |
|   F64   |      I32      |      2^22      |       2^8        |   1.723 ms |       0.04% |   1.718 ms |       0.05% |       -4.853 us |  -0.28% |   FAST   |
|   F64   |      I32      |      2^26      |       2^8        |  27.022 ms |       0.00% |  26.823 ms |       0.01% |     -198.459 us |  -0.73% |   FAST   |
|   F64   |      I32      |      2^30      |       2^8        | 431.725 ms |       0.00% | 428.570 ms |       0.00% |    -3154.266 us |  -0.73% |   FAST   |

# large

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  MaxSegmentSize  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |      I32      |      2^22      |       2^10       | 144.663 us |       0.68% | 144.465 us |       0.70% |  -0.198 us |  -0.14% |   SAME   |
|   I8    |      I32      |      2^26      |       2^10       |   1.774 ms |       0.06% |   1.774 ms |       0.06% |  -0.024 us |  -0.00% |   SAME   |
|   I8    |      I32      |      2^30      |       2^10       |  27.994 ms |       0.02% |  27.994 ms |       0.01% |   0.203 us |   0.00% |   SAME   |
|   I8    |      I32      |      2^22      |       2^12       |  60.835 us |       1.21% |  60.448 us |       1.13% |  -0.387 us |  -0.64% |   SAME   |
|   I8    |      I32      |      2^26      |       2^12       | 497.784 us |       0.15% | 497.785 us |       0.49% |   0.001 us |   0.00% |   SAME   |
|   I8    |      I32      |      2^30      |       2^12       |   7.503 ms |       0.03% |   7.506 ms |       0.03% |   3.425 us |   0.05% |   SLOW   |
|   I8    |      I32      |      2^22      |       2^14       |  58.281 us |       1.68% |  58.199 us |       1.72% |  -0.082 us |  -0.14% |   SAME   |
|   I8    |      I32      |      2^26      |       2^14       | 571.728 us |       0.20% | 571.532 us |       0.48% |  -0.197 us |  -0.03% |   SAME   |
|   I8    |      I32      |      2^30      |       2^14       |   8.482 ms |       0.03% |   8.482 ms |       0.03% |  -0.314 us |  -0.00% |   SAME   |
|   I8    |      I32      |      2^22      |       2^16       |  79.825 us |       0.66% |  79.795 us |       0.74% |  -0.030 us |  -0.04% |   SAME   |
|   I8    |      I32      |      2^26      |       2^16       | 495.848 us |       0.96% | 495.879 us |       0.96% |   0.030 us |   0.01% |   SAME   |
|   I8    |      I32      |      2^30      |       2^16       |   6.800 ms |       0.06% |   6.799 ms |       0.06% |  -1.293 us |  -0.02% |   SAME   |
|   I8    |      I32      |      2^22      |       2^18       | 284.097 us |       0.23% | 284.040 us |       0.24% |  -0.056 us |  -0.02% |   SAME   |
|   I8    |      I32      |      2^26      |       2^18       | 698.833 us |       2.41% | 699.691 us |       2.38% |   0.858 us |   0.12% |   SAME   |
|   I8    |      I32      |      2^30      |       2^18       |   7.283 ms |       0.20% |   7.283 ms |       0.20% |  -0.453 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^22      |       2^10       | 296.744 us |       0.26% | 296.655 us |       0.25% |  -0.089 us |  -0.03% |   SAME   |
|   I16   |      I32      |      2^26      |       2^10       |   4.204 ms |       0.02% |   4.204 ms |       0.02% |  -0.343 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^30      |       2^10       |  66.800 ms |       0.01% |  66.800 ms |       0.01% |   0.822 us |   0.00% |   SAME   |
|   I16   |      I32      |      2^22      |       2^12       | 101.603 us |       1.62% | 101.173 us |       0.81% |  -0.430 us |  -0.42% |   SAME   |
|   I16   |      I32      |      2^26      |       2^12       |   1.131 ms |       0.07% |   1.131 ms |       0.07% |  -0.255 us |  -0.02% |   SAME   |
|   I16   |      I32      |      2^30      |       2^12       |  17.615 ms |       0.01% |  17.616 ms |       0.01% |   0.970 us |   0.01% |   SAME   |
|   I16   |      I32      |      2^22      |       2^14       | 110.132 us |       0.40% | 110.095 us |       0.41% |  -0.037 us |  -0.03% |   SAME   |
|   I16   |      I32      |      2^26      |       2^14       |   1.430 ms |       0.10% |   1.430 ms |       0.07% |  -0.110 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^30      |       2^14       |  22.060 ms |       0.01% |  22.059 ms |       0.01% |  -1.165 us |  -0.01% |   SAME   |
|   I16   |      I32      |      2^22      |       2^16       | 137.936 us |       0.39% | 138.001 us |       0.40% |   0.065 us |   0.05% |   SAME   |
|   I16   |      I32      |      2^26      |       2^16       |   1.166 ms |       0.10% |   1.166 ms |       0.11% |  -0.563 us |  -0.05% |   SAME   |
|   I16   |      I32      |      2^30      |       2^16       |  17.073 ms |       0.01% |  17.070 ms |       0.01% |  -3.184 us |  -0.02% |   FAST   |
|   I16   |      I32      |      2^22      |       2^18       | 486.131 us |       0.21% | 485.994 us |       0.20% |  -0.138 us |  -0.03% |   SAME   |
|   I16   |      I32      |      2^26      |       2^18       |   1.491 ms |       0.20% |   1.491 ms |       0.20% |   0.007 us |   0.00% |   SAME   |
|   I16   |      I32      |      2^30      |       2^18       |  19.316 ms |       0.02% |  19.315 ms |       0.02% |  -0.568 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^22      |       2^10       | 468.037 us |       0.35% | 466.925 us |       0.18% |  -1.112 us |  -0.24% |   FAST   |
|   I32   |      I32      |      2^26      |       2^10       |   6.889 ms |       0.01% |   6.888 ms |       0.02% |  -1.291 us |  -0.02% |   FAST   |
|   I32   |      I32      |      2^30      |       2^10       | 109.655 ms |       0.00% | 109.643 ms |       0.00% | -12.388 us |  -0.01% |   FAST   |
|   I32   |      I32      |      2^22      |       2^12       | 152.471 us |       0.59% | 151.517 us |       0.59% |  -0.954 us |  -0.63% |   FAST   |
|   I32   |      I32      |      2^26      |       2^12       |   1.871 ms |       0.07% |   1.871 ms |       0.06% |  -0.639 us |  -0.03% |   SAME   |
|   I32   |      I32      |      2^30      |       2^12       |  29.380 ms |       0.01% |  29.379 ms |       0.01% |  -1.325 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^22      |       2^14       | 196.689 us |       0.32% | 197.003 us |       0.38% |   0.313 us |   0.16% |   SAME   |
|   I32   |      I32      |      2^26      |       2^14       |   2.588 ms |       0.06% |   2.588 ms |       0.05% |  -0.047 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^30      |       2^14       |  40.312 ms |       0.01% |  40.335 ms |       0.01% |  23.071 us |   0.06% |   SLOW   |
|   I32   |      I32      |      2^22      |       2^16       | 269.103 us |       0.34% | 269.323 us |       0.70% |   0.220 us |   0.08% |   SAME   |
|   I32   |      I32      |      2^26      |       2^16       |   2.522 ms |       0.07% |   2.522 ms |       0.07% |   0.228 us |   0.01% |   SAME   |
|   I32   |      I32      |      2^30      |       2^16       |  37.911 ms |       0.01% |  37.906 ms |       0.01% |  -5.066 us |  -0.01% |   FAST   |
|   I32   |      I32      |      2^22      |       2^18       | 956.919 us |       0.52% | 956.969 us |       0.51% |   0.050 us |   0.01% |   SAME   |
|   I32   |      I32      |      2^26      |       2^18       |   3.254 ms |       0.16% |   3.254 ms |       0.16% |   0.335 us |   0.01% |   SAME   |
|   I32   |      I32      |      2^30      |       2^18       |  42.362 ms |       0.02% |  42.364 ms |       0.02% |   2.075 us |   0.00% |   SAME   |
|   I64   |      I32      |      2^22      |       2^10       | 628.955 us |       0.18% | 628.846 us |       0.15% |  -0.109 us |  -0.02% |   SAME   |
|   I64   |      I32      |      2^26      |       2^10       |   9.404 ms |       0.01% |   9.405 ms |       0.02% |   0.455 us |   0.00% |   SAME   |
|   I64   |      I32      |      2^30      |       2^10       | 149.822 ms |       0.00% | 149.822 ms |       0.00% |  -0.226 us |  -0.00% |   SAME   |
|   I64   |      I32      |      2^22      |       2^12       | 507.727 us |       0.22% | 508.280 us |       0.22% |   0.552 us |   0.11% |   SAME   |
|   I64   |      I32      |      2^26      |       2^12       |   7.185 ms |       0.02% |   7.183 ms |       0.02% |  -2.077 us |  -0.03% |   FAST   |
|   I64   |      I32      |      2^30      |       2^12       | 114.246 ms |       0.00% | 114.250 ms |       0.01% |   3.489 us |   0.00% |   SLOW   |
|   I64   |      I32      |      2^22      |       2^14       | 490.741 us |       0.27% | 491.190 us |       0.30% |   0.449 us |   0.09% |   SAME   |
|   I64   |      I32      |      2^26      |       2^14       |   6.270 ms |       0.04% |   6.272 ms |       0.03% |   2.054 us |   0.03% |   SLOW   |
|   I64   |      I32      |      2^30      |       2^14       |  98.549 ms |       0.00% |  98.501 ms |       0.01% | -47.741 us |  -0.05% |   FAST   |
|   I64   |      I32      |      2^22      |       2^16       | 858.224 us |       0.87% | 858.410 us |       0.87% |   0.185 us |   0.02% |   SAME   |
|   I64   |      I32      |      2^26      |       2^16       |   8.536 ms |       0.06% |   8.537 ms |       0.07% |   0.573 us |   0.01% |   SAME   |
|   I64   |      I32      |      2^30      |       2^16       | 129.724 ms |       0.01% | 129.732 ms |       0.01% |   8.807 us |   0.01% |   SAME   |
|   I64   |      I32      |      2^22      |       2^18       |   3.130 ms |       0.41% |   3.130 ms |       0.40% |   0.247 us |   0.01% |   SAME   |
|   I64   |      I32      |      2^26      |       2^18       |  10.152 ms |       0.14% |  10.151 ms |       0.14% |  -1.284 us |  -0.01% |   SAME   |
|   I64   |      I32      |      2^30      |       2^18       | 132.692 ms |       0.02% | 132.699 ms |       0.02% |   7.341 us |   0.01% |   SAME   |
|  I128   |      I32      |      2^22      |       2^10       | 791.931 us |       0.13% | 792.645 us |       0.12% |   0.714 us |   0.09% |   SAME   |
|  I128   |      I32      |      2^26      |       2^10       |  11.930 ms |       0.01% |  11.931 ms |       0.02% |   0.634 us |   0.01% |   SAME   |
|  I128   |      I32      |      2^30      |       2^10       | 190.126 ms |       0.00% | 190.125 ms |       0.00% |  -0.867 us |  -0.00% |   SAME   |
|  I128   |      I32      |      2^22      |       2^12       | 981.659 us |       1.09% | 982.300 us |       1.09% |   0.641 us |   0.07% |   SAME   |
|  I128   |      I32      |      2^26      |       2^12       |  13.819 ms |       0.07% |  13.812 ms |       0.07% |  -6.587 us |  -0.05% |   SAME   |
|  I128   |      I32      |      2^30      |       2^12       | 218.975 ms |       0.00% | 219.000 ms |       0.00% |  24.528 us |   0.01% |   SLOW   |
|  I128   |      I32      |      2^22      |       2^14       |   1.498 ms |       2.41% |   1.497 ms |       2.41% |  -0.946 us |  -0.06% |   SAME   |
|  I128   |      I32      |      2^26      |       2^14       |  18.036 ms |       0.12% |  18.024 ms |       0.12% | -12.753 us |  -0.07% |   SAME   |
|  I128   |      I32      |      2^30      |       2^14       | 283.190 ms |       0.01% | 283.147 ms |       0.01% | -43.699 us |  -0.02% |   FAST   |
|  I128   |      I32      |      2^22      |       2^16       |   2.668 ms |       1.05% |   2.667 ms |       0.97% |  -1.037 us |  -0.04% |   SAME   |
|  I128   |      I32      |      2^26      |       2^16       |  20.253 ms |       0.50% |  20.248 ms |       0.49% |  -5.091 us |  -0.03% |   SAME   |
|  I128   |      I32      |      2^30      |       2^16       | 301.640 ms |       0.03% | 301.631 ms |       0.03% |  -9.594 us |  -0.00% |   SAME   |
|  I128   |      I32      |      2^22      |       2^18       |   8.759 ms |       0.28% |   8.760 ms |       0.28% |   0.541 us |   0.01% |   SAME   |
|  I128   |      I32      |      2^26      |       2^18       |  25.593 ms |       1.61% |  25.591 ms |       1.55% |  -1.800 us |  -0.01% |   SAME   |
|  I128   |      I32      |      2^30      |       2^18       | 303.196 ms |       0.10% | 303.203 ms |       0.10% |   7.309 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^22      |       2^10       | 468.629 us |       0.26% | 469.329 us |       0.37% |   0.700 us |   0.15% |   SAME   |
|   F32   |      I32      |      2^26      |       2^10       |   6.901 ms |       0.01% |   6.901 ms |       0.01% |  -0.127 us |  -0.00% |   SAME   |
|   F32   |      I32      |      2^30      |       2^10       | 109.856 ms |       0.00% | 109.844 ms |       0.00% | -11.930 us |  -0.01% |   FAST   |
|   F32   |      I32      |      2^22      |       2^12       | 151.548 us |       0.76% | 151.593 us |       0.61% |   0.045 us |   0.03% |   SAME   |
|   F32   |      I32      |      2^26      |       2^12       |   1.861 ms |       0.11% |   1.861 ms |       0.10% |   0.457 us |   0.02% |   SAME   |
|   F32   |      I32      |      2^30      |       2^12       |  29.214 ms |       0.01% |  29.214 ms |       0.01% |   0.398 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^22      |       2^14       | 191.075 us |       0.43% | 190.952 us |       0.48% |  -0.123 us |  -0.06% |   SAME   |
|   F32   |      I32      |      2^26      |       2^14       |   2.496 ms |       0.04% |   2.496 ms |       0.05% |  -0.214 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^30      |       2^14       |  38.804 ms |       0.01% |  38.811 ms |       0.01% |   7.065 us |   0.02% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^16       | 251.983 us |       0.36% | 252.402 us |       0.39% |   0.419 us |   0.17% |   SAME   |
|   F32   |      I32      |      2^26      |       2^16       |   2.422 ms |       0.08% |   2.421 ms |       0.08% |  -0.820 us |  -0.03% |   SAME   |
|   F32   |      I32      |      2^30      |       2^16       |  36.399 ms |       0.01% |  36.404 ms |       0.01% |   5.584 us |   0.02% |   SLOW   |
|   F32   |      I32      |      2^22      |       2^18       | 898.390 us |       0.49% | 898.497 us |       0.49% |   0.107 us |   0.01% |   SAME   |
|   F32   |      I32      |      2^26      |       2^18       |   3.060 ms |       0.19% |   3.060 ms |       0.19% |   0.034 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^30      |       2^18       |  40.003 ms |       0.03% |  40.004 ms |       0.03% |   1.122 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^22      |       2^10       | 482.282 us |       0.20% | 481.918 us |       0.16% |  -0.364 us |  -0.08% |   SAME   |
|   F64   |      I32      |      2^26      |       2^10       |   7.066 ms |       0.01% |   7.066 ms |       0.02% |   0.055 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^30      |       2^10       | 112.433 ms |       0.00% | 112.434 ms |       0.01% |   1.040 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^22      |       2^12       | 435.209 us |       0.76% | 434.724 us |       0.76% |  -0.485 us |  -0.11% |   SAME   |
|   F64   |      I32      |      2^26      |       2^12       |   5.755 ms |       0.05% |   5.753 ms |       0.05% |  -1.459 us |  -0.03% |   SAME   |
|   F64   |      I32      |      2^30      |       2^12       |  91.086 ms |       0.00% |  91.078 ms |       0.00% |  -8.178 us |  -0.01% |   FAST   |
|   F64   |      I32      |      2^22      |       2^14       | 536.342 us |       1.99% | 536.492 us |       2.00% |   0.150 us |   0.03% |   SAME   |
|   F64   |      I32      |      2^26      |       2^14       |   5.869 ms |       0.21% |   5.869 ms |       0.21% |   0.440 us |   0.01% |   SAME   |
|   F64   |      I32      |      2^30      |       2^14       |  91.671 ms |       0.03% |  91.652 ms |       0.02% | -19.260 us |  -0.02% |   SAME   |
|   F64   |      I32      |      2^22      |       2^16       | 910.842 us |       0.91% | 910.145 us |       0.80% |  -0.696 us |  -0.08% |   SAME   |
|   F64   |      I32      |      2^26      |       2^16       |   6.499 ms |       0.70% |   6.500 ms |       0.70% |   1.223 us |   0.02% |   SAME   |
|   F64   |      I32      |      2^30      |       2^16       |  96.445 ms |       0.05% |  96.450 ms |       0.05% |   4.769 us |   0.00% |   SAME   |
|   F64   |      I32      |      2^22      |       2^18       |   3.227 ms |       0.47% |   3.227 ms |       0.47% |  -0.835 us |  -0.03% |   SAME   |
|   F64   |      I32      |      2^26      |       2^18       |   8.533 ms |       2.22% |   8.534 ms |       2.25% |   1.262 us |   0.01% |   SAME   |
|   F64   |      I32      |      2^30      |       2^18       |  95.969 ms |       0.13% |  95.959 ms |       0.13% | -10.684 us |  -0.01% |   SAME   |

@IlyaGrebnov
Copy link
Contributor Author

The crux of the issue is that segments in the [113, 352] range are currently classified as “large” and routed to the block-level radix sort instead of the faster warp-level merge sort. This PR changes Policy860 to use 32 threads per segment (a full warp), restoring the intended “medium” cutoff. Alternatively, we could keep the half-warp (16-thread) configuration and double the items per thread, which would also restore the intended “medium” cutoff. @NaderAlAwar is this is something you can help with?

@bernhardmgruber
Copy link
Contributor

As a side note, we are implementing a publicly accessible interface for users to specify tunings themselves, so in a hopefully soon future, you will be able fully override our tuning values with whatever you want. I can't give you an estimate on when this is though.

@b0nes164
Copy link

b0nes164 commented Nov 19, 2025

If you don't have a hard requirement to use CUB, and are open to more experimental flavor, you could try one of the libraries from these papers:

"Fast segmented sort on GPUs" - Hou et al. https://dl.acm.org/doi/10.1145/3079079.3079105
https://github.com/vtsynergy/bb_segsort

"Faster segmented sort on GPUs" - Kobus et al.
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_45
https://gitlab.rlp.net/pararch/faster-segmented-sort-on-gpus

There's also mine, but it's likely less heavily tested than those above, and I haven't had/have the time to work on it. But if your sort keys are less than 32bits, it will likely be faster than anything else, as it's a radix sort down to the finest granularity (sub-warp-level). (Kobus/Hou perform a shuffle-based bitonic sort up to the warp level, then bottom up merge sort, and don't start radix sorting until quite large segments. CUB I'm not sure.)
https://github.com/b0nes164/GPUSorting/tree/cleanup/GPUSortingCUDA/SegSort

The last time I benchmarked CUB (whatever version shipped with CUDA toolkit 12.5), the performance was the worst out of all the implementations, I think because it's not as granular in its casing of segment lengths . (See the performance cliff past segment length 128 in CUB::SegmentedSort).

image

This data was captured using the benchmarking suite from the Kobus paper repo.

@IlyaGrebnov
Copy link
Contributor Author

@b0nes164,

GPUSorting is an excellent project. I have learned a few things from your radix sort implementation, which I am reusing for BWT LF-mapping (it is similar to single pass of radix sort, but instead of scattering key you write the rank/position where each element needs to go).

But the reason I can not directly integrate GPUSorting is that, in my case, the segments are not continuous. During suffix sorting, some suffixes become unique and are already in sorted order, which leaves “holes” in the array as I only need to sort the segments that remain unsorted (each segment has length ≥ 2, they are separated by these gaps). Because of that, I need an interface where I can pass both offsets (e.g., the start positions / end position for each segment).

@b0nes164
Copy link

b0nes164 commented Nov 19, 2025

Super interesting; if I'm understanding it right, the suffixes are already sorted by length into bins, and you want to sort within those bins. If the bin has only one suffix, the bin is already sorted, creating the gap.

(Or possibly they're not sorted by length, instead you just save the write offset as you say, and scatter directly into the segmented sort?)

Hmm, if it's just changing the interface, that sounds like it could be a very fun holiday project. :^)

@gonidelis
Copy link
Member

@IlyaGrebnov We recently opened a relevant tracking issue for this issue #6696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[BUG]: Dated segmented sort tuning

5 participants