perf: Sumcheck, GKR & WHIR proving optimizations by Barnadrot · Pull Request #234 · leanEthereum/leanMultisig

Barnadrot · 2026-05-25T19:00:48Z

Summary

Architecture-independent prover optimizations that improve end-to-end proving time by 10% on Hetzner (x86 AVX-512) and 7.4% on M4 Mac (ARM NEON).

Benchmarks

Cross-platform benchmarks (1550 XMSS signatures, log_inv_rate=1, warm average of 8 proofs):

	Hetzner (s)	XMSS/s	M4 Mac (s)	XMSS/s
`main` (`2c70162`)	2.201	704	2.290	677
This PR (`74f9e05`)	1.980	783	2.121	731
Delta	-10.0%	+11.2%	-7.4%	+8.0%

Hetzner AX42-U - 64GB RAM - AMD Ryzen 7 PRO 8700GE
M4 Mac Mini - 32 GB RAM

Changes

Parallelize GKR phase2 sumcheck — the pair_coeffs inner loop was serial over up to 4M pairs. Parallelized via par_iter + reduce above a threshold. Replaced per-round eval_eq recomputation with incremental pairwise-addition fold.
Fused dual-eq materialization in WHIR — new compute_eval_eq_packed_dual builds two eq-polynomials in a single pass when the first two WHIR opening statements are full-domain. Removes one full eq-table traversal. Generalizes the base×ext sumcheck product path (removes hardcoded DIMENSION == 5 dispatch).
WHIR initial folding factor 7→8 — increases first FRI fold to consume 8 evaluation variables per round, reducing one commitment round. Recursion circuit updated to handle num_chunks=32.
Fix GKR pivot computation — corrected min → max for bytecode section log height, removing suboptimal pivot selection that caused unnecessary GKR work.
Remove 8 dead Poseidon outputs_right columns — the right half of permutation output was committed but never verified. Removes the columns, their 8 AIR constraints (99→91), and halves the result memory lookup (16→8 values). ~350MB RSS reduction.

Tradeoffs

Proof size: +15.4% (344 KiB → 397 KiB). Primarily from the WHIR
initial folding factor 7→8 (larger first-round polynomial), partially
offset by 8 fewer committed Poseidon columns.

Test plan

cargo test --release --test test_multisignatures (4 tests) — pass on both platforms
cargo test --release --test test_zk_alloc (1 test) — pass on both platforms
Paired A/B wall-clock benchmarks on Hetzner and M4 Mac Mini
All changes are architecture-independent (no platform-specific code paths)

The GKR phase2 pair_coeffs loop was completely sequential — processing up to 4M pairs on a single thread for the initial layer. This change: 1. Replaces the sequential for-loop with par_iter + parallel reduction (gated by PARALLEL_THRESHOLD for small arrays) 2. Replaces per-round eval_eq recomputation with incremental pairwise addition fold, saving O(2^k) ext-field muls per round Measured: -14.1% e2e (phase2 handles ~70% of GKR layer proving). Origin: blake3-autoresearch h19 (fc7cd33), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fused dual-point eq computation: process both full-domain eq polynomials in single recursive pass, eliminates 1.28GB DRAM round-trip (h42b) - Packed SIMD first-round product sumcheck (h68) - combine_statement zero_vec skip: uninitialized buffer + STORE path when first OOD statement covers full array (h37) Measured: -2.54% (h42b), -1.18% (h68), combined ~-3.7%. Origin: pw5-clean (3e95117), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Doubles the initial WHIR folding width (2^7=128 → 2^8=256 evaluation points per fold), which halves FFT rows (2^20 → 2^19), halves Merkle tree leaves, and eliminates one subsequent WHIR round (3 → 2). Adds num_chunks=32 support to decompose_and_verify_merkle_batch in the recursion circuit verifier (256/8=32 chunks per Merkle leaf). Measured: -2.73% e2e. Origin: blake3-autoresearch h14v2 (3aab3db), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix min_section_log calculation to use .max() before .min(), ensuring the bytecode section doesn't unnecessarily pull down the GKR pivot. Also updates the surface assertion for WHIR folding factor 8. The MIN_LOG_N_ROWS_PER_TABLE stays at 8 (no blake3 small table to pad). Measured: -8.1% e2e on blake3 branch (primarily from the pivot fix enabling the ENDIANNESS_PIVOT_GKR=12 fast path). Origin: blake3-autoresearch h24 (bc3bd7e), logup fix independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Compilation retry messages should go to stderr, not stdout, to avoid polluting JSON benchmark output. Origin: blake3-autoresearch (89a8b61), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove outputs_right from Poseidon1Cols16 struct, reducing committed columns by 8. The right half of the permutation output is no longer committed or looked up. Changes: - Poseidon1Cols16: removed outputs_right field (-8 columns) - bus_interactions: result lookup reduced from DIGEST_LEN*2 to DIGEST_LEN - eval: removed 8 flag_permute*(state[i+8]-outputs_right[i]) constraints - n_constraints: 99 → 91 - trace_gen: removed outputs_right generation - trace override: simplified to only handle half_output for outputs_left The lookup now writes only outputs_left (8 values) to memory. For permute rows: outputs_left = state (matches permuted output in memory). For compression rows: outputs_left = state + input (matches output in memory). For half_output rows: outputs_left[4..7] overridden with memory values. ALL 5 TESTS PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TomWambsgans · 2026-05-25T22:42:23Z

Hi tks for the PR! About the removal of 8 Poseidon columns: I think we cannot do this, we need to commit to them for soundness (the AIR constraints are here + there is the lookup). About folding factor of 8, I will come back to it later. About all the rest (where the 2 points mentionned above are remove): I tried to replicate it (see #235). TLDR: close to neutral on M4 Max, very nice perf improvement on hetzner server. I will continue working on this tomorrow. This is something good to merge, I just want to understand why the mac doesnt improve, and ensure it's not slowed down). Will keep you updated

Mac M4 Max BEFORE:

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸ 1293 XMSS/s - 1.199s   344 KiB    ± 0.9%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  656 XMSS/s - 0.775s   208 KiB    ± 1.1%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.570s   197 KiB    ± 0.5%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  937 XMSS/s - 1.654s   230 KiB    ± 0.7%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  648 XMSS/s - 0.784s   207 KiB    ± 0.4%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.538s   196 KiB    ± 1.2%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.449s   296 KiB    ± 1.7%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  913 XMSS/s - 0.848s   208 KiB    ± 0.9%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  912 XMSS/s - 0.850s   209 KiB    ± 0.9%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.535s   197 KiB    ± 1.4%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.804s 198 KiB ± 0.4% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 0.626s 131 KiB ± 1.2% 96,949 325,015 9,164 44,263

Mac M4 Max AFTER

             ┌──▸ ◇ 1550 R=1/2 ·· ▸ 1286 XMSS/s - 1.205s   345 KiB    ± 1.0%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  651 XMSS/s - 0.780s   208 KiB    ± 1.0%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.567s   196 KiB    ± 0.6%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  939 XMSS/s - 1.650s   229 KiB    ± 1.1%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  651 XMSS/s - 0.780s   208 KiB    ± 0.9%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.538s   196 KiB    ± 1.6%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.452s   296 KiB    ± 1.8%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  919 XMSS/s - 0.843s   208 KiB    ± 0.5%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  916 XMSS/s - 0.846s   209 KiB    ± 1.2%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.526s   196 KiB    ± 0.7%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.802s 197 KiB ± 1.3% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 0.629s 131 KiB ± 0.5% 96,949 325,015 9,164 44,263
:

Hetnzer AX42U BEFORE:

Aggregation program: 215,624 instructions

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸  535 XMSS/s - 2.895s   345 KiB    ± 1.5%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  289 XMSS/s - 1.759s   209 KiB    ± 1.1%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.988s   196 KiB    ± 0.3%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  377 XMSS/s - 4.107s   229 KiB    ± 2.1%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  292 XMSS/s - 1.741s   209 KiB    ± 0.7%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.947s   196 KiB    ± 0.2%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.785s   296 KiB    ± 1.0%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  424 XMSS/s - 1.829s   209 KiB    ± 0.9%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  424 XMSS/s - 1.828s   208 KiB    ± 0.5%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.940s   195 KiB    ± 1.3%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.631s 198 KiB ± 1.3% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 1.357s 131 KiB ± 0.8% 96,949 325,015 9,164 44,263

Hetnzer AX42U AFTER:

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸  587 XMSS/s - 2.640s   343 KiB    ± 0.5%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  322 XMSS/s - 1.576s   209 KiB    ± 0.4%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.938s   195 KiB    ± 0.4%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  387 XMSS/s - 4.000s   230 KiB    ± 1.0%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  323 XMSS/s - 1.574s   209 KiB    ± 0.6%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.884s   197 KiB    ± 0.5%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.716s   296 KiB    ± 0.3%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  479 XMSS/s - 1.619s   208 KiB    ± 0.8%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  479 XMSS/s - 1.619s   207 KiB    ± 0.7%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.882s   195 KiB    ± 0.6%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.503s 198 KiB ± 0.5% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 1.328s 130 KiB ± 1.7% 96,949 325,015 9,164 44,263

TomWambsgans · 2026-05-25T22:55:53Z

About initial folding factor = 8: it indeed makes the prover a bit faster, at the cost of bigger proofs. Recursion: cycles are decreased, poseidon + extension ops increased. Need to think about this. My initial intuition is that it's not worth it, but not totally sure.

Barnadrot · 2026-05-26T07:26:26Z

Agree with you regarding the Poseidon change that was an incorrect cherry pick from the larger blake3 branch. It was architecture dependent.

Ran it on M4 Mini and I am also getting neutral effect at -0.3%

Re folding factor -> This was my formula earlier, this was removed from this work due to its characteristics. Could be used to evaluate perf vs size tradeoff:

Any change that increases proof size is scored with: net = throughput_pct - 3 × proof_size_increase_pct. Hard ceiling at +20% proof size.
This would clearly favor PR235

* attempt to replicate parts of #234 Co-Authored-By: Barnadrot <kbarna.drot@gmail.com> * undo GKR pivot change * simplify sumcheck_utils * simplify open.rs * fmt * > instead of >= for parallele threshold in sumcheck_utils.rs (faster on M4 MAx, slightly slower on ax42u) --------- Co-authored-by: Tom Wambsgans <TomWambsgans@users.noreply.github.com> Co-authored-by: Barnadrot <kbarna.drot@gmail.com>

TomWambsgans · 2026-05-26T12:58:59Z

Thanks again.
e45a0ed

TomWambsgans · 2026-05-26T13:19:19Z

About intiial folding factor of 8:

BEFORE (7)

                                        time      size      cycles      memory   poseidons  extension-ops
  ┌──▸ ◇ 775 R=1/4 ·· ▸  909 XMSS/s - 0.853s   208 KiB     497,410   1,953,378     129,631      15,370
  ├──▸ ◇ 775 R=1/4 ·· ▸  907 XMSS/s - 0.854s   208 KiB     497,265   1,953,378     129,631      15,296
  ◆ 1550 R=1/4 ······ ▸               0.541s   196 KiB     216,455     617,276      21,175      88,2

AFTER (8)

                                        time      size      cycles      memory   poseidons  extension-ops
  ┌──▸ ◇ 775 R=1/4 ·· ▸  926 XMSS/s - 0.837s   240 KiB     497,410   1,953,378     129,631      15,370
  ├──▸ ◇ 775 R=1/4 ·· ▸  937 XMSS/s - 0.827s   241 KiB     497,265   1,953,378     129,631      15,296
  ◆ 1550 R=1/4 ······ ▸               0.552s   229 KiB     205,749     642,462      23,525     108,905

About recursion in particular:

size: 196 → 229 KiB (+16.84%)
cycles: 216,455 → 205,749 (−4.95%)
memory: 617,276 → 642,462 (+4.08%)
poseidons: 21,175 → 23,525 (+11.10%)
ext-ops: 88,211 → 108,905 (+23.46%)

the +23.46% extension-ops is non negligible. It may not directly reflect in the measured throughput, since it's the same power of 2 after padding (2^17), but this may not be true in the future.

My opinion: we should keep 7.

What do you think? If you agree, I will close this PR (now that e45a0ed has been merged)

Barnadrot · 2026-05-26T13:24:00Z

Agree, closing this PR in favor of the already merged #235

Barnadrot and others added 7 commits May 24, 2026 19:48

style: fix rustfmt and remove dead test module reference

ec0d7a7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Barnadrot changed the title ~~perf: prover optimizations (-10% Hetzner, -7.4% M4 Mac)~~ perf: prover optimizations May 25, 2026

Barnadrot changed the title ~~perf: prover optimizations~~ perf: Sumcheck, GKR & WHIR proving optimizations May 25, 2026

Barnadrot closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Sumcheck, GKR & WHIR proving optimizations#234

perf: Sumcheck, GKR & WHIR proving optimizations#234
Barnadrot wants to merge 7 commits into
leanEthereum:mainfrom
Barnadrot:non-blake3-dependent-optimizations

Barnadrot commented May 25, 2026 •

edited

Loading

Uh oh!

TomWambsgans commented May 25, 2026 •

edited

Loading

Uh oh!

TomWambsgans commented May 25, 2026

Uh oh!

Barnadrot commented May 26, 2026

Uh oh!

TomWambsgans commented May 26, 2026

Uh oh!

TomWambsgans commented May 26, 2026

Uh oh!

Barnadrot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Barnadrot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Changes

Tradeoffs

Test plan

Uh oh!

TomWambsgans commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomWambsgans commented May 25, 2026

Uh oh!

Barnadrot commented May 26, 2026

Uh oh!

TomWambsgans commented May 26, 2026

Uh oh!

TomWambsgans commented May 26, 2026

Uh oh!

Barnadrot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Barnadrot commented May 25, 2026 •

edited

Loading

TomWambsgans commented May 25, 2026 •

edited

Loading