-
Notifications
You must be signed in to change notification settings - Fork 300
Increase threads per segment from 16 to 32 for segmented_sort #6636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On Policy860 (SM 8.6+, Ada/Hopper), MediumSegmentPolicy uses 16 threads/segment, while other arch policies use 32 (full warp). That cuts the “medium” capacity roughly in half: Policy860: 16 threads × 7 items/thread = 112 items Policy800: 32 threads × 7–11 items/thread = 224–352 items This means segments in [113, 352] are classified as “large” and routed to block-level radix sort instead of the faster warp-level merge sort.
|
Hi! Thank you for bringing this to our attention! I think we should run some benchmarks ourselves with your proposed change and then come back to you. |
|
@IlyaGrebnov I benchmarked your changes using our benchmark, which is the and then checking out your branch and rerunning the benchmark, saving the file to The results I obtained are below. Overall there are significant speedups for segment sizes in the range you mentioned ( |
|
The crux of the issue is that segments in the [113, 352] range are currently classified as “large” and routed to the block-level radix sort instead of the faster warp-level merge sort. This PR changes Policy860 to use 32 threads per segment (a full warp), restoring the intended “medium” cutoff. Alternatively, we could keep the half-warp (16-thread) configuration and double the items per thread, which would also restore the intended “medium” cutoff. @NaderAlAwar is this is something you can help with? |
|
As a side note, we are implementing a publicly accessible interface for users to specify tunings themselves, so in a hopefully soon future, you will be able fully override our tuning values with whatever you want. I can't give you an estimate on when this is though. |
|
If you don't have a hard requirement to use CUB, and are open to more experimental flavor, you could try one of the libraries from these papers: "Fast segmented sort on GPUs" - Hou et al. https://dl.acm.org/doi/10.1145/3079079.3079105 "Faster segmented sort on GPUs" - Kobus et al. There's also mine, but it's likely less heavily tested than those above, and I haven't had/have the time to work on it. But if your sort keys are less than 32bits, it will likely be faster than anything else, as it's a radix sort down to the finest granularity (sub-warp-level). (Kobus/Hou perform a shuffle-based bitonic sort up to the warp level, then bottom up merge sort, and don't start radix sorting until quite large segments. CUB I'm not sure.) The last time I benchmarked CUB (whatever version shipped with CUDA toolkit 12.5), the performance was the worst out of all the implementations, I think because it's not as granular in its casing of segment lengths . (See the performance cliff past segment length 128 in CUB::SegmentedSort).
This data was captured using the benchmarking suite from the Kobus paper repo. |
|
GPUSorting is an excellent project. I have learned a few things from your radix sort implementation, which I am reusing for BWT LF-mapping (it is similar to single pass of radix sort, but instead of scattering key you write the rank/position where each element needs to go). But the reason I can not directly integrate GPUSorting is that, in my case, the segments are not continuous. During suffix sorting, some suffixes become unique and are already in sorted order, which leaves “holes” in the array as I only need to sort the segments that remain unsorted (each segment has length ≥ 2, they are separated by these gaps). Because of that, I need an interface where I can pass both offsets (e.g., the start positions / end position for each segment). |
|
Super interesting; if I'm understanding it right, the suffixes are already sorted by length into bins, and you want to sort within those bins. If the bin has only one suffix, the bin is already sorted, creating the gap. (Or possibly they're not sorted by length, instead you just save the write offset as you say, and scatter directly into the segmented sort?) Hmm, if it's just changing the interface, that sounds like it could be a very fun holiday project. :^) |
|
@IlyaGrebnov We recently opened a relevant tracking issue for this issue #6696 |

On Policy860 (SM 8.6+, Ada/Hopper), MediumSegmentPolicy was using 16 threads per segment, while other architecture policies use a full warp (32 threads). That effectively cuts the “medium” capacity roughly in half:
As a result, segments in the [113, 352] range were classified as “large” and routed to the block-level radix sort instead of the faster warp-level merge sort. This PR changes Policy860 to use 32 threads per segment (full warp), aligning its MediumSegmentPolicy with the other architectures and restoring the intended “medium” cutoff on Ada/Hopper.
On standard industry benchmark using libcubwt library for Burrows-Wheeler transform construction, throughput improves by ~7% on average, with the best-affected cases (e.g., proteins.001.1, rs.13) speeding up by up to ~50%. A few inputs fluctuate within ~2%, which looks like acceptable variation.
Description
closes #6173
Checklist