perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens by shreyas-londhe · Pull Request #252 · worldfnd/whir

shreyas-londhe · 2026-05-13T11:25:22Z

Memory + wall optimisations for `whir_zk::prove_blinded_polynomial`

Five cumulative commits on top of upstream/v1. Net trade: −204 MB peak for +220 ms wall vs upstream/v1. Memory savings are the headline; wall is a small regression vs v1 but recovers ~500 ms of the regression that the IRS-coeff change alone introduced.

Stack

Commit	What	Why
`c183108`	Drop `linear_forms` after covector build	Linear forms only needed to produce the covector; holding them past that point wastes peak.
`fc3e614`	Hold IRS coefficients, re-encode codeword on demand	`f_hat_witness.matrix` is the dominant resident allocation between commit and the two f̂ opens. Drop it after commit; re-encode just for each open. Costs +720 ms wall (the codeword is re-encoded twice, once per [[f̂]] open).
`97fea7c`	Output-pruned NTT for f̂ opens	Replaces the transient full re-encode (~~`O(N log N)` flops) with an output-pruned NTT (~~`O(N + k log N)` flops) that materialises only the queried codeword rows. Recovers ~500 ms of the `fc3e614` wall regression.
`d4ad1fb`	Parallelise `partial_interleaved_rs_encode` batches	Sequential pruned NTT lost to parallel `ntt_batch`. Restore batch-axis rayon via batch-major intermediate + transpose.
`c39ce01`	Clippy + signature cleanup	`f_hat_witness: &mut` is no longer accurate (never mutated).

Measurements (complete_age_check, m=20, BN254 Fr, 8 threads, 30 s cooldown, 3 iters, σ ≈ 1%)

Same machine, same provekit (db8ff0fd), low-noise methodology (cooldown between iters, no other load, single cached binary per ref):

Metric	upstream/v1 (`0a68627`)	HEAD (`c39ce01`)	Δ
Wall (min/med, ms)	2630 / 2650	2860 / 2870	+220 ms / +8.3%
Peak (MB)	880.0	676.0	−204 / −23.2%
Allocs	3.56M	3.56M	0%

Peak and allocs are deterministic. Wall σ on cooled-down runs is ~1%, so the +8.3% wall delta is real, not noise.

Intermediate stage (for context)

The IRS-coeff change alone (fc3e614) was measured by the original author at +720 ms wall for −99 MB peak. The pruned-NTT path (97fea7c + d4ad1fb) recovers ~500 ms of that wall regression — bringing the cumulative wall cost from +720 ms down to +220 ms while preserving (and slightly extending) the peak savings.

Thread scaling (HEAD binary, complete_age_check)

Threads	Wall (min/med ms)	Peak (MB)
1	10300 / 10400	676
2	6120 / 6140	676
4	3850 / 3860	676
8	2860 / 2870	676

Peak is identical at all thread counts — the 676 MB is set by the IRS-coeff resident witness, not by encode work buffers. Wall scales 1.0× → 3.6× from 1 to 8 cores (parallel fraction ≈ 73% by Amdahl).

When this PR is a win

OOM-bound deployments: 880 → 676 MB unlocks complete_age_check on machines that previously OOM'd, at an 8% wall cost.
Provers that batch many proofs: the −204 MB peak compounds — more concurrent provers per machine.

When it is not

Single-prove, latency-critical, no memory pressure: the +220 ms wall is a real regression with no return.

Algorithm (commit `97fea7c`)

Output-pruned radix-2 DIT NTT (Sorensen & Burrus 1993; Skinner 1976). For query set I of size k out of N, walks the butterfly DAG backwards from I to mark only the cone of butterflies contributing to the queried outputs. Field-op count drops from O(N log N) per NTT (full) to O(N + k log N) per NTT (pruned). Mask is built once per (N, I) via PartialNttPlan and reused across all num_polys × interleaving_depth NTTs of an interleaved IRS open.

Marking rule: at radix-2 DIT stage s, the butterfly at (a=base+k, b=base+k+half) has both inputs and both outputs at (a, b). If either output is in the live set at stage s, both inputs are marked live at stage s − 1. Standard contrapositive of "Y[i] depends on a[k] iff k ∈ cone(i)" for radix-2 DIT.

Twiddle-factor indexing: roots[k * (roots.len() / m)] retrieves ω_m^k correctly even when the shared roots table is at an order larger than m — same pattern as the existing apply_twiddles.

References

Sorensen & Burrus, "Efficient computation of the DFT with only a subset of input or output points," IEEE TSP 41(3), 1993.
Skinner, "Pruning the decimation in-time FFT algorithm," IEEE ASSP-24(2), 1976.

Doc comments on NttEngine::ntt_partial and PartialNttPlan carry the full reference.

Verification

155 whir lib tests pass, including 3 new property tests:
- test_ntt_partial_matches_full — random subsets across sizes 4..2^15 vs full NTT.
- test_ntt_partial_zero_padded_input — M < N case.
- test_partial_interleaved_rs_encode_matches_full — byte-identity vs full encode across four IRS shapes.
30 whir_zk tests pass. test_rejects_g_claim_forgery_via_rho (which was already broken on fc3e614 from a different upstream change) is fixed and still rejects the forged eval.
cargo clippy -- -D warnings clean.
Transcript byte-identity audited: open_inner_from_coeffs emits the same prover_hint_ark(&submatrix) + matrix_commit.open sequence as the original open_inner. No transcript divergence.
Independent correctness review confirmed the marking rule, twiddle indexing, bit-reversal, edge cases, and layout match.

Open follow-ups (out of scope)

Close the +220 ms gap: candidates are (a) per-call mask allocation in PartialNttPlan::new (~10 MB bools for N=2^19), (b) pruned-NTT per-butterfly branching overhead, (c) re-encode inherently costs more than holding the matrix resident.
Input pruning (combined Sorensen-Burrus): exploits M = N / 4 zero-padding for another 2–4× flop reduction per NTT.
IRS-coeff witness reduction: the resident IRS-coeff Vec now sets the peak. Reducing it is long-term zkWHIR v3 work.
Blinding-poly encode: small codeword (~32 KB), still uses full encode (intentional).

Take linear_forms by value in prepare_and_sumcheck / prove_blinded_polynomial and drop it as soon as the combined covector has been built. Each Covector in linear_forms holds num_witnesses field elements; for R1CS circuits with 3 matrices (A, B, C) and millions of witnesses this is ~100 MB freed before the WHIR commit phase, where peak memory is hit. Measured peak reduction on provekit (m=20 circuits): - complete_age_check: 880 -> 805 MB (-8.5%) - t_add_dsc_1850: 533 -> 497 MB (-6.8%) - t_add_id_data_1850: 222 -> 203 MB (-8.6%) - poseidon-rounds: 467 -> 467 MB (no change, small linear forms) Protocol-equivalent. Transcript byte-identical. E2E prove+verify passes.

codspeed-hq · 2026-05-13T11:31:35Z

Merging this PR will degrade performance by 18.82%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 4 regressed benchmarks
✅ 6 untouched benchmarks
⏩ 22 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`interleaved_rs_encode[(18, 4, 3)]`	155.2 ms	179.9 ms	-13.72%
❌	Simulation	`interleaved_rs_encode[(20, 4, 4)]`	613.3 ms	807.9 ms	-24.09%
❌	Simulation	`interleaved_rs_encode[(22, 4, 4)]`	3.4 s	4.1 s	-16.92%
❌	Simulation	`interleaved_rs_encode[(18, 2, 2)]`	68.8 ms	86.3 ms	-20.2%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing shreyas-londhe:perf/drop-linear-forms (c39ce01) with main (ec7aa32)²}

22 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on v1 (0a68627) during the generation of this report, so main (ec7aa32) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

The initial IRS commit witnesses (f_hat and blinding_poly) previously held their full Reed-Solomon encoded codewords resident from commit through the entire whir_zk::prove. The codeword is only consumed at open time (Merkle path generation + queried row extraction); the coefficients are smaller by the blowup factor (e.g. 4x at rate 1/4) and already retained for other protocol uses. Drop matrix immediately after commit. Re-encode transiently around each open and drop again after. Three encodes per whir_zk::prove call: one for each of f_hat's two opens (ood_stir_and_rounds, gamma_check) and one for blinding_poly's open in prove_blinding_polynomial. Measured on complete_age_check (m=20, N=5 interleaved): - peak: 805 -> 706 MB (-99 MB / -12.3%) - wall (median): 3500 -> 4220 ms (+20.6%, +720 ms) - allocs: 3.56M -> 3.61M (+50k) Combined with linear_forms drop (c183108) versus unoptimised v1: - peak: 880 -> 706 MB (-174 MB / -19.8%) Protocol-equivalent. Prove + verify roundtrip passes byte-identically. Re-encoded codeword matches the original since interleaved_rs_encode is deterministic.

Replace full Reed-Solomon re-encode at the two [[f̂]] open sites (ood_stir_and_rounds, gamma_check) with an output-pruned NTT that materialises only the queried codeword rows. The full `(num_cols × codeword_length)` codeword matrix is never resident: peak memory at the IRS-coeff bottleneck drops by a factor of `codeword_length / in_domain_samples` (≈ 4000× at m=20, k=127), and the per-encode flop count drops from O(N log N) to O(N + k log N). Algorithm: Sorensen-Burrus radix-2 DIT, walking the butterfly DAG backwards from the query set to mark only the cone of butterflies contributing to the requested outputs. Reuses the existing roots-of- unity cache. Reference: Sorensen & Burrus, "Efficient computation of the DFT with only a subset of input or output points" (IEEE TSP 41, 1993). See doc comment on `NttEngine::ntt_partial`. API additions: - `NttEngine::ntt_partial` + `ntt_partial_with_plan_into` - `PartialNttPlan` (per-(size,indices) pruning plan, reusable across batched NTTs that share the same query set) - `ntt::partial_interleaved_rs_encode` (mirrors `interleaved_rs_encode` but emits only the rows at `indices`) - `irs_commit::Config::{open_from_coeffs, open_at_indices_from_coeffs}` (functionally identical transcripts to `open`/`open_at_indices`; do not require `witness.matrix` to be populated) The blinding-poly re-encode in `prove()` is left untouched (small codeword, negligible cost). Tests: - Randomised property tests vs full NTT across sizes 4..2^15, sparse and dense query subsets, zero-padded M<N inputs, and edge cases (empty, singletons, repeated indices, size=1). - `partial_interleaved_rs_encode` byte-identity against `interleaved_rs_encode` + row extraction across four shapes spanning the regimes used in whir_zk (depth 1 vs 8, single vs multi-poly, rate-1/4 blowup). - All 155 existing whir tests still pass; fixed the pre-existing `test_rejects_g_claim_forgery_via_rho` to mirror the production open path (re-encode blinding_poly_witness before `prove_blinding_polynomial`; use new partial-encode opens for f̂).

Each (poly_idx, slot_idx) NTT in the partial encode is independent. Switch to a batch-major intermediate (`(num_cols, k)`) populated via `par_chunks_exact_mut` and transpose to the row-major output. Brings the partial encode in line with the parallel batching the existing `ntt_batch` performs inside the full encode.

Reviewer flagged that `f_hat_witness: &mut` in `ood_stir_and_rounds`, `gamma_check`, and `prove_blinded_polynomial` is no longer accurate — the partial-encode path never mutates the witness. Switch to `&` and drop the now-redundant `&*` reborrows. Misleading `&mut` could mask future bugs where the witness is unintentionally mutated. Also applies the smaller clippy/fmt nits the reviewer surfaced: - ntt_partial: allow(dead_code) (kept pub for external callers; the hot path uses ntt_partial_with_plan_into) - PartialNttPlan::size: const fn - ntt_partial_with_plan_into: allow(significant_drop_tightening); the roots-table RwLockReadGuard is intentionally held across all DIT stages, mirroring ntt_dispatch - assertion comparison form: `> n` instead of `>= n + 1` - cargo fmt `cargo clippy -- -D warnings` is now clean; 155 lib tests still pass.

shreyas-londhe changed the title ~~perf(whir_zk): drop linear_forms after covector build~~ perf(whir_zk): reduce peak memory in zk prover (IRS coeff + linear_forms drop) May 15, 2026

shreyas-londhe added 3 commits May 15, 2026 14:28

shreyas-londhe changed the title ~~perf(whir_zk): reduce peak memory in zk prover (IRS coeff + linear_forms drop)~~ perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252

perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252
shreyas-londhe wants to merge 5 commits into
worldfnd:v1from
shreyas-londhe:perf/drop-linear-forms

shreyas-londhe commented May 13, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyas-londhe commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Memory + wall optimisations for whir_zk::prove_blinded_polynomial

Stack

Measurements (complete_age_check, m=20, BN254 Fr, 8 threads, 30 s cooldown, 3 iters, σ ≈ 1%)

Intermediate stage (for context)

Thread scaling (HEAD binary, complete_age_check)

When this PR is a win

When it is not

Algorithm (commit 97fea7c)

Verification

Open follow-ups (out of scope)

Uh oh!

codspeed-hq Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 18.82%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shreyas-londhe commented May 13, 2026 •

edited

Loading

Memory + wall optimisations for `whir_zk::prove_blinded_polynomial`

Algorithm (commit `97fea7c`)

codspeed-hq Bot commented May 13, 2026 •

edited

Loading