Skip to content

perf: Sumcheck, GKR & WHIR proving optimizations#234

Closed
Barnadrot wants to merge 7 commits into
leanEthereum:mainfrom
Barnadrot:non-blake3-dependent-optimizations
Closed

perf: Sumcheck, GKR & WHIR proving optimizations#234
Barnadrot wants to merge 7 commits into
leanEthereum:mainfrom
Barnadrot:non-blake3-dependent-optimizations

Conversation

@Barnadrot
Copy link
Copy Markdown
Contributor

@Barnadrot Barnadrot commented May 25, 2026

Summary

Architecture-independent prover optimizations that improve end-to-end proving time by 10% on Hetzner (x86 AVX-512) and 7.4% on M4 Mac (ARM NEON).

Benchmarks

Cross-platform benchmarks (1550 XMSS signatures, log_inv_rate=1, warm average of 8 proofs):

Hetzner (s) XMSS/s M4 Mac (s) XMSS/s
main (2c70162) 2.201 704 2.290 677
This PR (74f9e05) 1.980 783 2.121 731
Delta -10.0% +11.2% -7.4% +8.0%

Hetzner AX42-U - 64GB RAM - AMD Ryzen 7 PRO 8700GE
M4 Mac Mini - 32 GB RAM

Changes

  1. Parallelize GKR phase2 sumcheck — the pair_coeffs inner loop was serial over up to 4M pairs. Parallelized via par_iter + reduce above a threshold. Replaced per-round eval_eq recomputation with incremental pairwise-addition fold.
  2. Fused dual-eq materialization in WHIR — new compute_eval_eq_packed_dual builds two eq-polynomials in a single pass when the first two WHIR opening statements are full-domain. Removes one full eq-table traversal. Generalizes the base×ext sumcheck product path (removes hardcoded DIMENSION == 5 dispatch).
  3. WHIR initial folding factor 7→8 — increases first FRI fold to consume 8 evaluation variables per round, reducing one commitment round. Recursion circuit updated to handle num_chunks=32.
  4. Fix GKR pivot computation — corrected minmax for bytecode section log height, removing suboptimal pivot selection that caused unnecessary GKR work.
  5. Remove 8 dead Poseidon outputs_right columns — the right half of permutation output was committed but never verified. Removes the columns, their 8 AIR constraints (99→91), and halves the result memory lookup (16→8 values). ~350MB RSS reduction.

Tradeoffs

  • Proof size: +15.4% (344 KiB → 397 KiB). Primarily from the WHIR
    initial folding factor 7→8 (larger first-round polynomial), partially
    offset by 8 fewer committed Poseidon columns.

Test plan

  • cargo test --release --test test_multisignatures (4 tests) — pass on both platforms
  • cargo test --release --test test_zk_alloc (1 test) — pass on both platforms
  • Paired A/B wall-clock benchmarks on Hetzner and M4 Mac Mini
  • All changes are architecture-independent (no platform-specific code paths)

Barnadrot and others added 7 commits May 24, 2026 19:48
The GKR phase2 pair_coeffs loop was completely sequential — processing up
to 4M pairs on a single thread for the initial layer. This change:

1. Replaces the sequential for-loop with par_iter + parallel reduction
   (gated by PARALLEL_THRESHOLD for small arrays)
2. Replaces per-round eval_eq recomputation with incremental pairwise
   addition fold, saving O(2^k) ext-field muls per round

Measured: -14.1% e2e (phase2 handles ~70% of GKR layer proving).

Origin: blake3-autoresearch h19 (fc7cd33), independent of blake3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fused dual-point eq computation: process both full-domain eq polynomials
  in single recursive pass, eliminates 1.28GB DRAM round-trip (h42b)
- Packed SIMD first-round product sumcheck (h68)
- combine_statement zero_vec skip: uninitialized buffer + STORE path
  when first OOD statement covers full array (h37)

Measured: -2.54% (h42b), -1.18% (h68), combined ~-3.7%.

Origin: pw5-clean (3e95117), independent of blake3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Doubles the initial WHIR folding width (2^7=128 → 2^8=256 evaluation
points per fold), which halves FFT rows (2^20 → 2^19), halves Merkle
tree leaves, and eliminates one subsequent WHIR round (3 → 2).

Adds num_chunks=32 support to decompose_and_verify_merkle_batch in the
recursion circuit verifier (256/8=32 chunks per Merkle leaf).

Measured: -2.73% e2e.

Origin: blake3-autoresearch h14v2 (3aab3db), independent of blake3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix min_section_log calculation to use .max() before .min(), ensuring
the bytecode section doesn't unnecessarily pull down the GKR pivot.
Also updates the surface assertion for WHIR folding factor 8.

The MIN_LOG_N_ROWS_PER_TABLE stays at 8 (no blake3 small table to pad).

Measured: -8.1% e2e on blake3 branch (primarily from the pivot fix
enabling the ENDIANNESS_PIVOT_GKR=12 fast path).

Origin: blake3-autoresearch h24 (bc3bd7e), logup fix independent of blake3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compilation retry messages should go to stderr, not stdout, to avoid
polluting JSON benchmark output.

Origin: blake3-autoresearch (89a8b61), independent of blake3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove outputs_right from Poseidon1Cols16 struct, reducing committed
columns by 8. The right half of the permutation output is no longer
committed or looked up.

Changes:
- Poseidon1Cols16: removed outputs_right field (-8 columns)
- bus_interactions: result lookup reduced from DIGEST_LEN*2 to DIGEST_LEN
- eval: removed 8 flag_permute*(state[i+8]-outputs_right[i]) constraints
- n_constraints: 99 → 91
- trace_gen: removed outputs_right generation
- trace override: simplified to only handle half_output for outputs_left

The lookup now writes only outputs_left (8 values) to memory.
For permute rows: outputs_left = state (matches permuted output in memory).
For compression rows: outputs_left = state + input (matches output in memory).
For half_output rows: outputs_left[4..7] overridden with memory values.

ALL 5 TESTS PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Barnadrot Barnadrot changed the title perf: prover optimizations (-10% Hetzner, -7.4% M4 Mac) perf: prover optimizations May 25, 2026
@Barnadrot Barnadrot changed the title perf: prover optimizations perf: Sumcheck, GKR & WHIR proving optimizations May 25, 2026
@TomWambsgans
Copy link
Copy Markdown
Collaborator

TomWambsgans commented May 25, 2026

Hi tks for the PR! About the removal of 8 Poseidon columns: I think we cannot do this, we need to commit to them for soundness (the AIR constraints are here + there is the lookup). About folding factor of 8, I will come back to it later. About all the rest (where the 2 points mentionned above are remove): I tried to replicate it (see #235). TLDR: close to neutral on M4 Max, very nice perf improvement on hetzner server. I will continue working on this tomorrow. This is something good to merge, I just want to understand why the mac doesnt improve, and ensure it's not slowed down). Will keep you updated

Mac M4 Max BEFORE:

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸ 1293 XMSS/s - 1.199s   344 KiB    ± 0.9%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  656 XMSS/s - 0.775s   208 KiB    ± 1.1%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.570s   197 KiB    ± 0.5%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  937 XMSS/s - 1.654s   230 KiB    ± 0.7%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  648 XMSS/s - 0.784s   207 KiB    ± 0.4%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.538s   196 KiB    ± 1.2%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.449s   296 KiB    ± 1.7%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  913 XMSS/s - 0.848s   208 KiB    ± 0.9%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  912 XMSS/s - 0.850s   209 KiB    ± 0.9%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.535s   197 KiB    ± 1.4%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.804s 198 KiB ± 0.4% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 0.626s 131 KiB ± 1.2% 96,949 325,015 9,164 44,263

Mac M4 Max AFTER

             ┌──▸ ◇ 1550 R=1/2 ·· ▸ 1286 XMSS/s - 1.205s   345 KiB    ± 1.0%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  651 XMSS/s - 0.780s   208 KiB    ± 1.0%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.567s   196 KiB    ± 0.6%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  939 XMSS/s - 1.650s   229 KiB    ± 1.1%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  651 XMSS/s - 0.780s   208 KiB    ± 0.9%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.538s   196 KiB    ± 1.6%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.452s   296 KiB    ± 1.8%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  919 XMSS/s - 0.843s   208 KiB    ± 0.5%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  916 XMSS/s - 0.846s   209 KiB    ± 1.2%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.526s   196 KiB    ± 0.7%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.802s 197 KiB ± 1.3% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 0.629s 131 KiB ± 0.5% 96,949 325,015 9,164 44,263
:

Hetnzer AX42U BEFORE:

Aggregation program: 215,624 instructions

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸  535 XMSS/s - 2.895s   345 KiB    ± 1.5%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  289 XMSS/s - 1.759s   209 KiB    ± 1.1%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.988s   196 KiB    ± 0.3%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  377 XMSS/s - 4.107s   229 KiB    ± 2.1%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  292 XMSS/s - 1.741s   209 KiB    ± 0.7%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.947s   196 KiB    ± 0.2%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.785s   296 KiB    ± 1.0%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  424 XMSS/s - 1.829s   209 KiB    ± 0.9%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  424 XMSS/s - 1.828s   208 KiB    ± 0.5%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.940s   195 KiB    ± 1.3%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.631s 198 KiB ± 1.3% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 1.357s 131 KiB ± 0.8% 96,949 325,015 9,164 44,263

Hetnzer AX42U AFTER:

                                                    time      size       ± %      cycles      memory   poseidons  extension-ops
             ┌──▸ ◇ 1550 R=1/2 ·· ▸  587 XMSS/s - 2.640s   343 KiB    ± 0.5%     993,664   3,898,904     259,056      30,602
             ├──▸ ◇ 508 R=1/4 ··· ▸  322 XMSS/s - 1.576s   209 KiB    ± 0.4%     326,333   1,283,093      85,042      10,068
        ┌──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.938s   195 KiB    ± 0.4%     255,888     757,341      28,196     102,605
        │    ┌──▸ ◇ 1550 R=1/4 ·· ▸  387 XMSS/s - 4.000s   230 KiB    ± 1.0%     993,646   3,898,904     259,056      30,619
        │    ├──▸ ◇ 508 R=1/4 ··· ▸  323 XMSS/s - 1.574s   209 KiB    ± 0.6%     326,383   1,283,093      85,042      10,118
        ├──▸ ◆ 2058 - 10 R=1/4 ·· ▸               0.884s   197 KiB    ± 0.5%     225,020     649,967      23,018      82,584
   ┌──▸ ◆ 4096 + 25 - 5 R=1/2 ··· ▸               0.716s   296 KiB    ± 0.3%     258,948     746,312      29,495      78,816
   │    ┌──▸ ◇ 775 R=1/4 ········ ▸  479 XMSS/s - 1.619s   208 KiB    ± 0.8%     497,406   1,953,378     129,631      15,361
   │    ├──▸ ◇ 775 R=1/4 ········ ▸  479 XMSS/s - 1.619s   207 KiB    ± 0.7%     497,451   1,953,378     129,631      15,363
   ├──▸ ◆ 1550 - 5 R=1/4 ········ ▸               0.882s   195 KiB    ± 0.6%     216,475     617,386      21,172      88,208

┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.503s 198 KiB ± 0.5% 304,202 868,006 34,786 97,818
◆ 5669 R=1/16 ····················· ▸ 1.328s 130 KiB ± 1.7% 96,949 325,015 9,164 44,263

@TomWambsgans
Copy link
Copy Markdown
Collaborator

About initial folding factor = 8: it indeed makes the prover a bit faster, at the cost of bigger proofs. Recursion: cycles are decreased, poseidon + extension ops increased. Need to think about this. My initial intuition is that it's not worth it, but not totally sure.

@Barnadrot
Copy link
Copy Markdown
Contributor Author

Agree with you regarding the Poseidon change that was an incorrect cherry pick from the larger blake3 branch. It was architecture dependent.

Ran it on M4 Mini and I am also getting neutral effect at -0.3%

Re folding factor -> This was my formula earlier, this was removed from this work due to its characteristics. Could be used to evaluate perf vs size tradeoff:

Any change that increases proof size is scored with: net = throughput_pct - 3 × proof_size_increase_pct. Hard ceiling at +20% proof size.
This would clearly favor PR235

TomWambsgans added a commit that referenced this pull request May 26, 2026
* attempt to replicate parts of #234

Co-Authored-By: Barnadrot <kbarna.drot@gmail.com>

* undo GKR pivot change

* simplify sumcheck_utils

* simplify open.rs

* fmt

* > instead of >= for parallele threshold in sumcheck_utils.rs (faster on M4 MAx, slightly slower on ax42u)

---------

Co-authored-by: Tom Wambsgans <TomWambsgans@users.noreply.github.com>
Co-authored-by: Barnadrot <kbarna.drot@gmail.com>
@TomWambsgans
Copy link
Copy Markdown
Collaborator

Thanks again.
e45a0ed

@TomWambsgans
Copy link
Copy Markdown
Collaborator

About intiial folding factor of 8:

BEFORE (7)

                                        time      size      cycles      memory   poseidons  extension-ops
  ┌──▸ ◇ 775 R=1/4 ·· ▸  909 XMSS/s - 0.853s   208 KiB     497,410   1,953,378     129,631      15,370
  ├──▸ ◇ 775 R=1/4 ·· ▸  907 XMSS/s - 0.854s   208 KiB     497,265   1,953,378     129,631      15,296
  ◆ 1550 R=1/4 ······ ▸               0.541s   196 KiB     216,455     617,276      21,175      88,2

AFTER (8)

                                        time      size      cycles      memory   poseidons  extension-ops
  ┌──▸ ◇ 775 R=1/4 ·· ▸  926 XMSS/s - 0.837s   240 KiB     497,410   1,953,378     129,631      15,370
  ├──▸ ◇ 775 R=1/4 ·· ▸  937 XMSS/s - 0.827s   241 KiB     497,265   1,953,378     129,631      15,296
  ◆ 1550 R=1/4 ······ ▸               0.552s   229 KiB     205,749     642,462      23,525     108,905

About recursion in particular:

  • size: 196 → 229 KiB (+16.84%)
  • cycles: 216,455 → 205,749 (−4.95%)
  • memory: 617,276 → 642,462 (+4.08%)
  • poseidons: 21,175 → 23,525 (+11.10%)
  • ext-ops: 88,211 → 108,905 (+23.46%)

the +23.46% extension-ops is non negligible. It may not directly reflect in the measured throughput, since it's the same power of 2 after padding (2^17), but this may not be true in the future.

My opinion: we should keep 7.

What do you think? If you agree, I will close this PR (now that e45a0ed has been merged)

@Barnadrot
Copy link
Copy Markdown
Contributor Author

Agree, closing this PR in favor of the already merged #235

@Barnadrot Barnadrot closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants