Replace nested static_for lambdas with compile-time search helper #3600

tenpercent · 2026-01-16T21:52:19Z

Summary

Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

Nested static_for loops with lambdas (918 applier::operator() instantiations)
generate_tuple with lambdas (78+ instantiations per class)

Results (example_grouped_conv_fwd_xdl_fp16)

Metric	Before	After	Improvement
Template instantiation time	23.4s	19.1s	18% reduction
`applier` instantiations	1132	127	89% reduction
`generate_tuple` lambdas	178	96	46% reduction

`find_in_tuple_of_sequences` Helper

Uses O(1) template depth via pack expansion instead of O(N) recursion:

Metric	Recursive	O(1) Pack	Improvement
Instantiations	541	273	50% reduction
Time	430ms	133ms	69% reduction

Test Plan

Waiting for full CI

PR Stack

#	PR	Description
1	#3585	sequence_gen with `__make_integer_seq`
2	#3588	generate_identity_sequences helper
3	#3589	Named functors in transform_tensor_descriptor
4	#3590	container_concat optimization
5	#3596	O(1) pack expansion rewrites
6	#3600	TensorDescriptor/TensorAdaptor lambda elimination

The GetTransformAndItsUpperDimension function used nested static_for loops with lambdas to search for a hidden dimension in UpperDimensionIdss. This caused 918 applier::operator() instantiations (81% of all applier instantiations). Replace with find_in_tuple_of_sequences helper that uses constexpr array lookup and if-constexpr recursion, eliminating the lambda instantiation overhead. Results on example_grouped_conv_fwd_xdl_fp16: - applier instantiations: 1132 -> 127 (89% reduction) - TensorDescriptor instantiations: 2503 -> 664 (73% reduction) - Template instantiation time: 23.4s -> 19.4s (17% reduction)

…tSize The InitializeElementSize function used generate_tuple with a lambda to compute visible dimension lengths. Each TensorDescriptor type created a unique lambda type, causing 78 instantiations (385ms). Replace with direct pack expansion using helper functions, eliminating the lambda instantiation overhead entirely. Results on example_grouped_conv_fwd_xdl_fp16: - generate_tuple lambdas: 178 -> 100 (44% reduction) - Template instantiation time: 19.5s -> 19.0s

TensorAdaptor has identical InitializeElementSize and GetTransformAndItsUpperDimension patterns as TensorDescriptor. Apply the same optimization: - Replace nested static_for lambdas with find_in_tuple_of_sequences - Replace generate_tuple lambda with pack expansion Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)

tenpercent marked this pull request as ready for review January 17, 2026 03:41

tenpercent requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, vidyasagar-amd and vpietila-amd as code owners January 17, 2026 03:41

tenpercent added 2 commits January 16, 2026 21:46

tenpercent force-pushed the mpodkory/recursive-to-pack-expansion branch from f5ada17 to 9942fd6 Compare January 17, 2026 03:51

tenpercent force-pushed the mpodkory/find-transform-optimization branch from 11a8eed to e6040e1 Compare January 17, 2026 03:51

tenpercent force-pushed the mpodkory/find-transform-optimization branch from e6040e1 to a565d87 Compare January 17, 2026 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace nested static_for lambdas with compile-time search helper #3600

Replace nested static_for lambdas with compile-time search helper #3600

tenpercent commented Jan 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replace nested static_for lambdas with compile-time search helper #3600

Are you sure you want to change the base?

Replace nested static_for lambdas with compile-time search helper #3600

Conversation

tenpercent commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Results (example_grouped_conv_fwd_xdl_fp16)

find_in_tuple_of_sequences Helper

Test Plan

PR Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tenpercent commented Jan 16, 2026 •

edited

Loading

`find_in_tuple_of_sequences` Helper