sme dot: size kc for L2 stripe reuse, not L1 per-call stripe by kasper0406 · Pull Request #10005 · google/XNNPACK

kasper0406 · 2026-04-19T13:06:39Z

Summary

schedule_dot's kc-block formula was sizing the wrong stripe against the wrong cache. It used block_n in the denominator (per-kernel-call footprint) when the stripe that actually stays hot and reused is kc × n, and it was matched against a hardcoded 128 KiB L1-style budget instead of the real L2.

This PR switches the formula to size the cross-m-iteration stripe against an L2-scale budget pulled from cpuinfo, splits k evenly when it overflows kc_max, and reserves a small safety headroom so the stripe never lands exactly at cache capacity.

On shared-L2 clusters (M-series P-cluster, big.LITTLE P-cluster) a single-threaded GEMM gets near-full L2, so the cpuinfo per-thread share under-counts available cache; the budget is max(per_thread × 2, total × 3/4) to handle that. On per-core-L2 systems (almost all x86, most mobile ARM) the first branch always wins and behavior is unchanged.

Benchmark results

Apple M4, single-threaded dot_fp32_sme2 square GEMMs. N=12 samples per cell, randomised interleaved order, 180 s initial cooldown + 10 s between samples. Mean ± stddev GFLOPS.

n	master	this PR	Δ
64	1244 ± 2	1246 ± 3	+0.1%
128	1540 ± 2	1540 ± 2	0%
256	1744 ± 1	1744 ± 1	0%
512	1653 ± 5	1643 ± 32	-0.6%
1024	1704 ± 30	1715 ± 8	+0.7%
1280	1637 ± 19	1928 ± 13	+17.7%
1536	1426 ± 23	1599 ± 59	+12.2%
1792	1418 ± 8	1583 ± 4	+11.7%
2048	1243 ± 5	1347 ± 39	+8.4%
2560	1211 ± 12	1348 ± 10	+11.3%
3072	1091 ± 17	1146 ± 15	+5.0%
4096	906 ± 4	924 ± 6	+2.0%
5120	915 ± 7	919 ± 12	+0.5%
6144	901 ± 8	901 ± 8	0%
7168	740 ± 44	898 ± 9	+21.4%
8192	573 ± 11	866 ± 9	+51.3%

Geomean: +8.2%. No regressions outside the noise floor.

At n ≥ 7168 the old schedule's fixed kc = 512 stripe overflows M4 Pro's L2 and drags the old formula into DRAM-bound territory; this PR's stripe stays ~11.7 MiB across the full range.

Caveats

Tuned for single-threaded latency. On multi-threaded workloads over a physically-shared L2 cluster, each thread using 3/4 of total oversubscribes the cache — the long-term fix is plumbing active-thread count from pthreadpool; flagged in get_l2_cache_size().

I didn't test it on anything but a M4 chip.

kasper0406 · 2026-04-19T13:06:51Z

Stacked PR on top of #10004

google-cla · 2026-04-19T13:06:56Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Correctly supply the A stride for transpose_a kernels

schedule_dot's k-block formula previously used block_n in the denominator, which sized kc so that a per-kernel-call stripe (kc × block_n) fit in the given cache. The real reuse pattern is different: inside an outer k iteration the scheduler sweeps all (m, n) tiles, so the cross-m-iteration stripe (kc × n) is what needs to stay resident in cache.

dsharlet · 2026-04-19T16:07:06Z

Thank you for the interesting work. A few comments, first, a mechanical issue: the work this change is based on was rolled back in #9993. We will almost certainly roll it forward again (possibly with no changes at all), so you'll either need to rebase, or wait until the roll forward lands.

Second, we are hoping to avoid using real cache sizes, and just use a reasonable fixed guess. The reason is that if the cache sizes vary, then our numerical results are not consistent across different hardware, because we will tile K by different sizes depending on the cache size. The rollback of the underlying change is why we want to avoid this: I suspect this change was rolled back due to a test that is overly sensitive to numerical issues (and why it might get rolled forward again with no changes).

Of course, if it is a significant performance improvement to use the real cache size, instead of a fixed guess, then we will use the real cache size.

If we can, let's split the cache size change and the scheduling change into separate PRs, so we can evaluate those separately (this would be preferable anyways). If the current "estimate" in the code performs badly, maybe we can have a different guess for x86 and ARM while still avoiding the real cache size. (There will be differences in numerical results between x86 and ARM anyways, so this wouldn't hurt to do.)

kasper0406 · 2026-04-21T18:18:53Z

Ah that good information @dsharlet, I hadn't considered the numerical instability difference with different cache size thresholds.

Disclaimer: I used Claude to orchestrate these benchmarks, so take them with a grain of salt, but probably worth double-checking on your end before adding back the new schedule_dot implementation.

I made a benchmark sweep here (single threaded M4 chip measurement). The reference 0% is master with the 128K L2 cache size. The implementation called master in the table is what currently exists on master without the updated schedule_dot implementation. The pr implementation is this PR including the updated schedule_dot implementation.

n	master	master	pr	pr	pr	implementation
	+ 8 MiB	+ 12 MiB	+ 128 K	+ 8 MiB	+ 12 MiB	l2 cache size
1024	+4%	+4%	−60%	+5%	+3%
1280	+18%	+19%	−66%	+19%	+19%
1536	+8%	+5%	−67%	+5%	+13%
2048	−49%	−53%	−80%	+6%	+9%
4096	−66%	−66%	−89%	−2%	−1%
7168	−51%	−52%	−93%	+30%	+34%
8192	−41%	−41%	−91%	+55%	+64%

So for this PR to be a net win, the cache_size_l2 used by schedule_dot would need to be significantly larger than the current 128 K default.

Also, I did a performance sweep of the schedule_dot implementation prior to its rollback from master but without this PR (named pre-rollback in the table):

n	post-rollback (current master)	pre-rollback	pre-rollback	our PR
	128 K default	128 K default	12 MiB budget	12 MiB budget
1024	1635	680	1688	1682
1280	1601	574	1916	1900
1536	1410	463	1590	1594
2048	1234	309	1014	1346
4096	905	100	840	900
7168	639	44	825	853
8192	502	45	800	823

So just a heads up to ensure you do a perf measurement prior to re-introducing it.

All benchmarks were run using binaries produced by bazel build -c opt //ynnpack/kernels/dot:schedule_bench

kasper0406 added 2 commits April 19, 2026 14:19

dot/schedule_bench: fix A_stride_m for transpose_a kernels

d49a1b5

Correctly supply the A stride for transpose_a kernels

kasper0406 force-pushed the kn/dot-schedule-l2 branch from 5439074 to c826a71 Compare April 19, 2026 13:39

kasper0406 marked this pull request as ready for review April 19, 2026 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sme dot: size kc for L2 stripe reuse, not L1 per-call stripe#10005

sme dot: size kc for L2 stripe reuse, not L1 per-call stripe#10005
kasper0406 wants to merge 2 commits intogoogle:masterfrom
kasper0406:kn/dot-schedule-l2

kasper0406 commented Apr 19, 2026

Uh oh!

kasper0406 commented Apr 19, 2026

Uh oh!

google-cla Bot commented Apr 19, 2026

Uh oh!

dsharlet commented Apr 19, 2026

Uh oh!

kasper0406 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kasper0406 commented Apr 19, 2026

Summary

Benchmark results

Caveats

Uh oh!

kasper0406 commented Apr 19, 2026

Uh oh!

google-cla Bot commented Apr 19, 2026

Uh oh!

dsharlet commented Apr 19, 2026

Uh oh!

kasper0406 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kasper0406 commented Apr 21, 2026 •

edited

Loading