perf: Sumcheck, GKR & WHIR proving optimizations#234
Conversation
The GKR phase2 pair_coeffs loop was completely sequential — processing up to 4M pairs on a single thread for the initial layer. This change: 1. Replaces the sequential for-loop with par_iter + parallel reduction (gated by PARALLEL_THRESHOLD for small arrays) 2. Replaces per-round eval_eq recomputation with incremental pairwise addition fold, saving O(2^k) ext-field muls per round Measured: -14.1% e2e (phase2 handles ~70% of GKR layer proving). Origin: blake3-autoresearch h19 (fc7cd33), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fused dual-point eq computation: process both full-domain eq polynomials in single recursive pass, eliminates 1.28GB DRAM round-trip (h42b) - Packed SIMD first-round product sumcheck (h68) - combine_statement zero_vec skip: uninitialized buffer + STORE path when first OOD statement covers full array (h37) Measured: -2.54% (h42b), -1.18% (h68), combined ~-3.7%. Origin: pw5-clean (3e95117), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Doubles the initial WHIR folding width (2^7=128 → 2^8=256 evaluation points per fold), which halves FFT rows (2^20 → 2^19), halves Merkle tree leaves, and eliminates one subsequent WHIR round (3 → 2). Adds num_chunks=32 support to decompose_and_verify_merkle_batch in the recursion circuit verifier (256/8=32 chunks per Merkle leaf). Measured: -2.73% e2e. Origin: blake3-autoresearch h14v2 (3aab3db), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix min_section_log calculation to use .max() before .min(), ensuring the bytecode section doesn't unnecessarily pull down the GKR pivot. Also updates the surface assertion for WHIR folding factor 8. The MIN_LOG_N_ROWS_PER_TABLE stays at 8 (no blake3 small table to pad). Measured: -8.1% e2e on blake3 branch (primarily from the pivot fix enabling the ENDIANNESS_PIVOT_GKR=12 fast path). Origin: blake3-autoresearch h24 (bc3bd7e), logup fix independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compilation retry messages should go to stderr, not stdout, to avoid polluting JSON benchmark output. Origin: blake3-autoresearch (89a8b61), independent of blake3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove outputs_right from Poseidon1Cols16 struct, reducing committed columns by 8. The right half of the permutation output is no longer committed or looked up. Changes: - Poseidon1Cols16: removed outputs_right field (-8 columns) - bus_interactions: result lookup reduced from DIGEST_LEN*2 to DIGEST_LEN - eval: removed 8 flag_permute*(state[i+8]-outputs_right[i]) constraints - n_constraints: 99 → 91 - trace_gen: removed outputs_right generation - trace override: simplified to only handle half_output for outputs_left The lookup now writes only outputs_left (8 values) to memory. For permute rows: outputs_left = state (matches permuted output in memory). For compression rows: outputs_left = state + input (matches output in memory). For half_output rows: outputs_left[4..7] overridden with memory values. ALL 5 TESTS PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi tks for the PR! About the removal of 8 Poseidon columns: I think we cannot do this, we need to commit to them for soundness (the AIR constraints are here + there is the lookup). About folding factor of 8, I will come back to it later. About all the rest (where the 2 points mentionned above are remove): I tried to replicate it (see #235). TLDR: close to neutral on M4 Max, very nice perf improvement on hetzner server. I will continue working on this tomorrow. This is something good to merge, I just want to understand why the mac doesnt improve, and ensure it's not slowed down). Will keep you updated Mac M4 Max BEFORE: ┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.804s 198 KiB ± 0.4% 304,202 868,006 34,786 97,818 Mac M4 Max AFTER ┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 0.802s 197 KiB ± 1.3% 304,202 868,006 34,786 97,818 Hetnzer AX42U BEFORE: Aggregation program: 215,624 instructions ┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.631s 198 KiB ± 1.3% 304,202 868,006 34,786 97,818 Hetnzer AX42U AFTER: ┌──▸ ◆ 5661 + 10 - 2 R=1/4 ········ ▸ 1.503s 198 KiB ± 0.5% 304,202 868,006 34,786 97,818 |
|
About initial folding factor = 8: it indeed makes the prover a bit faster, at the cost of bigger proofs. Recursion: cycles are decreased, poseidon + extension ops increased. Need to think about this. My initial intuition is that it's not worth it, but not totally sure. |
|
Agree with you regarding the Poseidon change that was an incorrect cherry pick from the larger blake3 branch. It was architecture dependent. Ran it on M4 Mini and I am also getting neutral effect at -0.3% Re folding factor -> This was my formula earlier, this was removed from this work due to its characteristics. Could be used to evaluate perf vs size tradeoff:
|
* attempt to replicate parts of #234 Co-Authored-By: Barnadrot <kbarna.drot@gmail.com> * undo GKR pivot change * simplify sumcheck_utils * simplify open.rs * fmt * > instead of >= for parallele threshold in sumcheck_utils.rs (faster on M4 MAx, slightly slower on ax42u) --------- Co-authored-by: Tom Wambsgans <TomWambsgans@users.noreply.github.com> Co-authored-by: Barnadrot <kbarna.drot@gmail.com>
|
Thanks again. |
|
About intiial folding factor of 8: BEFORE (7) AFTER (8) About recursion in particular:
the +23.46% extension-ops is non negligible. It may not directly reflect in the measured throughput, since it's the same power of 2 after padding (2^17), but this may not be true in the future. My opinion: we should keep 7. What do you think? If you agree, I will close this PR (now that e45a0ed has been merged) |
|
Agree, closing this PR in favor of the already merged #235 |
Summary
Architecture-independent prover optimizations that improve end-to-end proving time by 10% on Hetzner (x86 AVX-512) and 7.4% on M4 Mac (ARM NEON).
Benchmarks
Cross-platform benchmarks (1550 XMSS signatures,
log_inv_rate=1, warm average of 8 proofs):main(2c70162)Hetzner AX42-U - 64GB RAM - AMD Ryzen 7 PRO 8700GE
M4 Mac Mini - 32 GB RAM
Changes
pair_coeffsinner loop was serial over up to 4M pairs. Parallelized viapar_iter+ reduce above a threshold. Replaced per-roundeval_eqrecomputation with incremental pairwise-addition fold.compute_eval_eq_packed_dualbuilds two eq-polynomials in a single pass when the first two WHIR opening statements are full-domain. Removes one full eq-table traversal. Generalizes the base×ext sumcheck product path (removes hardcodedDIMENSION == 5dispatch).num_chunks=32.min→maxfor bytecode section log height, removing suboptimal pivot selection that caused unnecessary GKR work.outputs_rightcolumns — the right half of permutation output was committed but never verified. Removes the columns, their 8 AIR constraints (99→91), and halves the result memory lookup (16→8 values). ~350MB RSS reduction.Tradeoffs
initial folding factor 7→8 (larger first-round polynomial), partially
offset by 8 fewer committed Poseidon columns.
Test plan
cargo test --release --test test_multisignatures(4 tests) — pass on both platformscargo test --release --test test_zk_alloc(1 test) — pass on both platforms