sme dot: size kc for L2 stripe reuse, not L1 per-call stripe#10005
sme dot: size kc for L2 stripe reuse, not L1 per-call stripe#10005kasper0406 wants to merge 2 commits intogoogle:masterfrom
Conversation
|
Stacked PR on top of #10004 |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Correctly supply the A stride for transpose_a kernels
schedule_dot's k-block formula previously used block_n in the denominator, which sized kc so that a per-kernel-call stripe (kc × block_n) fit in the given cache. The real reuse pattern is different: inside an outer k iteration the scheduler sweeps all (m, n) tiles, so the cross-m-iteration stripe (kc × n) is what needs to stay resident in cache.
5439074 to
c826a71
Compare
|
Thank you for the interesting work. A few comments, first, a mechanical issue: the work this change is based on was rolled back in #9993. We will almost certainly roll it forward again (possibly with no changes at all), so you'll either need to rebase, or wait until the roll forward lands. Second, we are hoping to avoid using real cache sizes, and just use a reasonable fixed guess. The reason is that if the cache sizes vary, then our numerical results are not consistent across different hardware, because we will tile K by different sizes depending on the cache size. The rollback of the underlying change is why we want to avoid this: I suspect this change was rolled back due to a test that is overly sensitive to numerical issues (and why it might get rolled forward again with no changes). Of course, if it is a significant performance improvement to use the real cache size, instead of a fixed guess, then we will use the real cache size. If we can, let's split the cache size change and the scheduling change into separate PRs, so we can evaluate those separately (this would be preferable anyways). If the current "estimate" in the code performs badly, maybe we can have a different guess for x86 and ARM while still avoiding the real cache size. (There will be differences in numerical results between x86 and ARM anyways, so this wouldn't hurt to do.) |
|
Ah that good information @dsharlet, I hadn't considered the numerical instability difference with different cache size thresholds. Disclaimer: I used Claude to orchestrate these benchmarks, so take them with a grain of salt, but probably worth double-checking on your end before adding back the new I made a benchmark sweep here (single threaded M4 chip measurement). The reference 0% is master with the 128K L2 cache size. The implementation called
So for this PR to be a net win, the cache_size_l2 used by schedule_dot would need to be significantly larger than the current 128 K default. Also, I did a performance sweep of the
So just a heads up to ensure you do a perf measurement prior to re-introducing it. All benchmarks were run using binaries produced by |
Summary
schedule_dot's kc-block formula was sizing the wrong stripe against the wrong cache. It usedblock_nin the denominator (per-kernel-call footprint) when the stripe that actually stays hot and reused iskc × n, and it was matched against a hardcoded 128 KiB L1-style budget instead of the real L2.This PR switches the formula to size the cross-m-iteration stripe against an L2-scale budget pulled from cpuinfo, splits
kevenly when it overflowskc_max, and reserves a small safety headroom so the stripe never lands exactly at cache capacity.On shared-L2 clusters (M-series P-cluster, big.LITTLE P-cluster) a single-threaded GEMM gets near-full L2, so the cpuinfo per-thread share under-counts available cache; the budget is
max(per_thread × 2, total × 3/4)to handle that. On per-core-L2 systems (almost all x86, most mobile ARM) the first branch always wins and behavior is unchanged.Benchmark results
Apple M4, single-threaded
dot_fp32_sme2square GEMMs. N=12 samples per cell, randomised interleaved order, 180 s initial cooldown + 10 s between samples. Mean ± stddev GFLOPS.Geomean: +8.2%. No regressions outside the noise floor.
At n ≥ 7168 the old schedule's fixed
kc = 512stripe overflows M4 Pro's L2 and drags the old formula into DRAM-bound territory; this PR's stripe stays ~11.7 MiB across the full range.Caveats
Tuned for single-threaded latency. On multi-threaded workloads over a physically-shared L2 cluster, each thread using 3/4 of total oversubscribes the cache — the long-term fix is plumbing active-thread count from pthreadpool; flagged in
get_l2_cache_size().I didn't test it on anything but a M4 chip.