Skip to content

Conversation

@tenpercent
Copy link
Contributor

@tenpercent tenpercent commented Jan 16, 2026

Summary

  • Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
  • Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
  • Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
  • Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

  1. Nested static_for loops with lambdas (918 applier::operator() instantiations)
  2. generate_tuple with lambdas (78+ instantiations per class)

Results (example_grouped_conv_fwd_xdl_fp16)

Metric Before After Improvement
Template instantiation time 23.4s 19.1s 18% reduction
applier instantiations 1132 127 89% reduction
generate_tuple lambdas 178 96 46% reduction

find_in_tuple_of_sequences Helper

Uses O(1) template depth via pack expansion instead of O(N) recursion:

Metric Recursive O(1) Pack Improvement
Instantiations 541 273 50% reduction
Time 430ms 133ms 69% reduction

Test Plan

  • Waiting for full CI

PR Stack

# PR Description
1 #3585 sequence_gen with __make_integer_seq
2 #3588 generate_identity_sequences helper
3 #3589 Named functors in transform_tensor_descriptor
4 #3590 container_concat optimization
5 #3596 O(1) pack expansion rewrites
6 #3600 TensorDescriptor/TensorAdaptor lambda elimination

The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).

Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.

Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
…tSize

The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).

Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.

Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
@tenpercent tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from f5ada17 to 9942fd6 Compare January 17, 2026 03:51
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from 11a8eed to e6040e1 Compare January 17, 2026 03:51
TensorAdaptor has identical InitializeElementSize and
GetTransformAndItsUpperDimension patterns as TensorDescriptor.
Apply the same optimization:
- Replace nested static_for lambdas with find_in_tuple_of_sequences
- Replace generate_tuple lambda with pack expansion

Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)
@tenpercent tenpercent force-pushed the mpodkory/find-transform-optimization branch from e6040e1 to a565d87 Compare January 17, 2026 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants