Skip to content

Conversation

@tenpercent
Copy link
Contributor

Summary

  • Replace recursive template instantiation in sequence_gen and uniform_sequence_gen with compiler intrinsic __make_integer_seq and pack expansion
  • Reduces maximum template nesting depth from 90 to 26 levels (71% reduction)
  • Improves template instantiation time by ~50%

Performance Results

Measured on example_grouped_conv_fwd_xdl_fp16:

Metric Before After Improvement
Maximum Nesting Depth 90 26 71% reduction
Wall-Clock Template Time 36.8s 18.7s 49% faster
Cumulative Template Time 56.6s 25.8s 54% faster

Technical Details

The previous implementation used recursive divide-and-conquer template instantiation:

  • sequence_gen_impl<IBegin, NRemain, G> recursively split the range and merged results
  • This created O(log N) depth for the recursion plus additional depth from sequence_merge

The new implementation uses __make_integer_seq (Clang/HIP compiler intrinsic):

  • Generates indices 0..N-1 in a single compiler operation (O(1) depth)
  • Applies the functor via pack expansion: Sequence<F{}(Number<Is>{})...>
  • No recursive template instantiation required

Test plan

  • Build example_grouped_conv_fwd_xdl_fp16 successfully
  • Run full CI test suite
  • Verify no functional regressions

tenpercent and others added 2 commits January 15, 2026 21:15
Replace recursive template instantiation with compiler intrinsic
__make_integer_seq and pack expansion for O(1) instantiation depth.

Before: Maximum nesting depth of 90 levels with recursive divide-and-conquer
After: Maximum nesting depth of 26 levels using flat pack expansion

Performance improvements measured on example_grouped_conv_fwd_xdl_fp16:
- Template instantiation wall-clock time: 36.8s -> 18.7s (49% faster)
- Template instantiation cumulative time: 56.6s -> 25.8s (54% faster)
- Maximum nesting depth: 90 -> 26 (71% reduction)

The key changes:
- sequence_gen: Uses __make_integer_seq to generate indices 0..N-1,
  then applies functor F via pack expansion in a single step
- uniform_sequence_gen: Uses __make_integer_seq with pack expansion
  to generate N copies of a constant value

Co-Authored-By: Claude <[email protected]>
Replace linear recursive instantiation with direct pack expansion
for 1-4 sequences, and binary tree reduction for larger cases.

Before: O(N) depth for merging N sequences
After: O(log N) depth with O(1) for up to 4 sequences

This further reduces maximum nesting depth from 26 to 22 levels
when combined with the previous sequence_gen optimization.

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants