Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

tenpercent · 2026-01-16T03:17:37Z

Summary

Replace recursive template instantiation in sequence_gen and uniform_sequence_gen with compiler intrinsic __make_integer_seq and pack expansion
Reduces maximum template nesting depth from 90 to 26 levels (71% reduction)
Improves template instantiation time by ~50%

Performance Results

Measured on example_grouped_conv_fwd_xdl_fp16:

Metric	Before	After	Improvement
Maximum Nesting Depth	90	26	71% reduction
Wall-Clock Template Time	36.8s	18.7s	49% faster
Cumulative Template Time	56.6s	25.8s	54% faster

Technical Details

The previous implementation used recursive divide-and-conquer template instantiation:

sequence_gen_impl<IBegin, NRemain, G> recursively split the range and merged results
This created O(log N) depth for the recursion plus additional depth from sequence_merge

The new implementation uses __make_integer_seq (Clang/HIP compiler intrinsic):

Generates indices 0..N-1 in a single compiler operation (O(1) depth)
Applies the functor via pack expansion: Sequence<F{}(Number<Is>{})...>
No recursive template instantiation required

Test plan

Build example_grouped_conv_fwd_xdl_fp16 successfully
Run full CI test suite
Verify no functional regressions

Replace recursive template instantiation with compiler intrinsic __make_integer_seq and pack expansion for O(1) instantiation depth. Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer After: Maximum nesting depth of 26 levels using flat pack expansion Performance improvements measured on example_grouped_conv_fwd_xdl_fp16: - Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster) - Template instantiation cumulative time: 56.6s -> 25.8s (54% faster) - Maximum nesting depth: 90 -> 26 (71% reduction) The key changes: - sequence_gen: Uses __make_integer_seq to generate indices 0..N-1, then applies functor F via pack expansion in a single step - uniform_sequence_gen: Uses __make_integer_seq with pack expansion to generate N copies of a constant value Co-Authored-By: Claude <[email protected]>

Replace linear recursive instantiation with direct pack expansion for 1-4 sequences, and binary tree reduction for larger cases. Before: O(N) depth for merging N sequences After: O(log N) depth with O(1) for up to 4 sequences This further reduces maximum nesting depth from 26 to 22 levels when combined with the previous sequence_gen optimization. Co-Authored-By: Claude <[email protected]>

tenpercent and others added 2 commits January 15, 2026 21:15

tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from a477221 to 57c8cb1 Compare January 16, 2026 03:34

tenpercent mentioned this pull request Jan 16, 2026

Rewrite StaticallyIndexedArray to use C-array instead of Tuple #3587

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

tenpercent commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Are you sure you want to change the base?

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Conversation

tenpercent commented Jan 16, 2026

Summary

Performance Results

Technical Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants