perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252
perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252shreyas-londhe wants to merge 5 commits into
Conversation
Take linear_forms by value in prepare_and_sumcheck / prove_blinded_polynomial and drop it as soon as the combined covector has been built. Each Covector in linear_forms holds num_witnesses field elements; for R1CS circuits with 3 matrices (A, B, C) and millions of witnesses this is ~100 MB freed before the WHIR commit phase, where peak memory is hit. Measured peak reduction on provekit (m=20 circuits): - complete_age_check: 880 -> 805 MB (-8.5%) - t_add_dsc_1850: 533 -> 497 MB (-6.8%) - t_add_id_data_1850: 222 -> 203 MB (-8.6%) - poseidon-rounds: 467 -> 467 MB (no change, small linear forms) Protocol-equivalent. Transcript byte-identical. E2E prove+verify passes.
Merging this PR will degrade performance by 18.82%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | interleaved_rs_encode[(18, 4, 3)] |
155.2 ms | 179.9 ms | -13.72% |
| ❌ | Simulation | interleaved_rs_encode[(20, 4, 4)] |
613.3 ms | 807.9 ms | -24.09% |
| ❌ | Simulation | interleaved_rs_encode[(22, 4, 4)] |
3.4 s | 4.1 s | -16.92% |
| ❌ | Simulation | interleaved_rs_encode[(18, 2, 2)] |
68.8 ms | 86.3 ms | -20.2% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing shreyas-londhe:perf/drop-linear-forms (c39ce01) with main (ec7aa32)2
Footnotes
-
22 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
-
No successful run was found on
v1(0a68627) during the generation of this report, somain(ec7aa32) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
The initial IRS commit witnesses (f_hat and blinding_poly) previously held their full Reed-Solomon encoded codewords resident from commit through the entire whir_zk::prove. The codeword is only consumed at open time (Merkle path generation + queried row extraction); the coefficients are smaller by the blowup factor (e.g. 4x at rate 1/4) and already retained for other protocol uses. Drop matrix immediately after commit. Re-encode transiently around each open and drop again after. Three encodes per whir_zk::prove call: one for each of f_hat's two opens (ood_stir_and_rounds, gamma_check) and one for blinding_poly's open in prove_blinding_polynomial. Measured on complete_age_check (m=20, N=5 interleaved): - peak: 805 -> 706 MB (-99 MB / -12.3%) - wall (median): 3500 -> 4220 ms (+20.6%, +720 ms) - allocs: 3.56M -> 3.61M (+50k) Combined with linear_forms drop (c183108) versus unoptimised v1: - peak: 880 -> 706 MB (-174 MB / -19.8%) Protocol-equivalent. Prove + verify roundtrip passes byte-identically. Re-encoded codeword matches the original since interleaved_rs_encode is deterministic.
Replace full Reed-Solomon re-encode at the two [[f̂]] open sites
(ood_stir_and_rounds, gamma_check) with an output-pruned NTT that
materialises only the queried codeword rows. The full
`(num_cols × codeword_length)` codeword matrix is never resident:
peak memory at the IRS-coeff bottleneck drops by a factor of
`codeword_length / in_domain_samples` (≈ 4000× at m=20, k=127), and
the per-encode flop count drops from O(N log N) to O(N + k log N).
Algorithm: Sorensen-Burrus radix-2 DIT, walking the butterfly DAG
backwards from the query set to mark only the cone of butterflies
contributing to the requested outputs. Reuses the existing roots-of-
unity cache. Reference: Sorensen & Burrus, "Efficient computation of
the DFT with only a subset of input or output points" (IEEE TSP 41,
1993). See doc comment on `NttEngine::ntt_partial`.
API additions:
- `NttEngine::ntt_partial` + `ntt_partial_with_plan_into`
- `PartialNttPlan` (per-(size,indices) pruning plan, reusable
across batched NTTs that share the same query set)
- `ntt::partial_interleaved_rs_encode` (mirrors
`interleaved_rs_encode` but emits only the rows at `indices`)
- `irs_commit::Config::{open_from_coeffs, open_at_indices_from_coeffs}`
(functionally identical transcripts to `open`/`open_at_indices`;
do not require `witness.matrix` to be populated)
The blinding-poly re-encode in `prove()` is left untouched (small
codeword, negligible cost).
Tests:
- Randomised property tests vs full NTT across sizes 4..2^15,
sparse and dense query subsets, zero-padded M<N inputs, and
edge cases (empty, singletons, repeated indices, size=1).
- `partial_interleaved_rs_encode` byte-identity against
`interleaved_rs_encode` + row extraction across four shapes
spanning the regimes used in whir_zk (depth 1 vs 8, single vs
multi-poly, rate-1/4 blowup).
- All 155 existing whir tests still pass; fixed the pre-existing
`test_rejects_g_claim_forgery_via_rho` to mirror the production
open path (re-encode blinding_poly_witness before
`prove_blinding_polynomial`; use new partial-encode opens for
f̂).
Each (poly_idx, slot_idx) NTT in the partial encode is independent. Switch to a batch-major intermediate (`(num_cols, k)`) populated via `par_chunks_exact_mut` and transpose to the row-major output. Brings the partial encode in line with the parallel batching the existing `ntt_batch` performs inside the full encode.
Reviewer flagged that `f_hat_witness: &mut` in `ood_stir_and_rounds`, `gamma_check`, and `prove_blinded_polynomial` is no longer accurate — the partial-encode path never mutates the witness. Switch to `&` and drop the now-redundant `&*` reborrows. Misleading `&mut` could mask future bugs where the witness is unintentionally mutated. Also applies the smaller clippy/fmt nits the reviewer surfaced: - ntt_partial: allow(dead_code) (kept pub for external callers; the hot path uses ntt_partial_with_plan_into) - PartialNttPlan::size: const fn - ntt_partial_with_plan_into: allow(significant_drop_tightening); the roots-table RwLockReadGuard is intentionally held across all DIT stages, mirroring ntt_dispatch - assertion comparison form: `> n` instead of `>= n + 1` - cargo fmt `cargo clippy -- -D warnings` is now clean; 155 lib tests still pass.
Memory + wall optimisations for
whir_zk::prove_blinded_polynomialFive cumulative commits on top of
upstream/v1. Net trade: −204 MB peak for +220 ms wall vs upstream/v1. Memory savings are the headline; wall is a small regression vs v1 but recovers ~500 ms of the regression that the IRS-coeff change alone introduced.Stack
c183108linear_formsafter covector buildfc3e614f_hat_witness.matrixis the dominant resident allocation between commit and the two f̂ opens. Drop it after commit; re-encode just for each open. Costs +720 ms wall (the codeword is re-encoded twice, once per [[f̂]] open).97fea7cO(N log N)flops) with an output-pruned NTT (O(N + k log N)flops) that materialises only the queried codeword rows. Recovers ~500 ms of the fc3e614 wall regression.d4ad1fbpartial_interleaved_rs_encodebatchesntt_batch. Restore batch-axis rayon via batch-major intermediate + transpose.c39ce01f_hat_witness: &mutis no longer accurate (never mutated).Measurements (complete_age_check, m=20, BN254 Fr, 8 threads, 30 s cooldown, 3 iters, σ ≈ 1%)
Same machine, same provekit (
db8ff0fd), low-noise methodology (cooldown between iters, no other load, single cached binary per ref):0a68627)c39ce01)Peak and allocs are deterministic. Wall σ on cooled-down runs is ~1%, so the +8.3% wall delta is real, not noise.
Intermediate stage (for context)
The IRS-coeff change alone (
fc3e614) was measured by the original author at +720 ms wall for −99 MB peak. The pruned-NTT path (97fea7c+d4ad1fb) recovers ~500 ms of that wall regression — bringing the cumulative wall cost from +720 ms down to +220 ms while preserving (and slightly extending) the peak savings.Thread scaling (HEAD binary, complete_age_check)
Peak is identical at all thread counts — the 676 MB is set by the IRS-coeff resident witness, not by encode work buffers. Wall scales 1.0× → 3.6× from 1 to 8 cores (parallel fraction ≈ 73% by Amdahl).
When this PR is a win
complete_age_checkon machines that previously OOM'd, at an 8% wall cost.When it is not
Algorithm (commit
97fea7c)Output-pruned radix-2 DIT NTT (Sorensen & Burrus 1993; Skinner 1976). For query set
Iof sizekout ofN, walks the butterfly DAG backwards fromIto mark only the cone of butterflies contributing to the queried outputs. Field-op count drops fromO(N log N)per NTT (full) toO(N + k log N)per NTT (pruned). Mask is built once per(N, I)viaPartialNttPlanand reused across allnum_polys × interleaving_depthNTTs of an interleaved IRS open.Marking rule: at radix-2 DIT stage
s, the butterfly at(a=base+k, b=base+k+half)has both inputs and both outputs at(a, b). If either output is in the live set at stages, both inputs are marked live at stages − 1. Standard contrapositive of "Y[i]depends ona[k]iffk ∈ cone(i)" for radix-2 DIT.Twiddle-factor indexing:
roots[k * (roots.len() / m)]retrieves ω_m^k correctly even when the shared roots table is at an order larger thanm— same pattern as the existingapply_twiddles.References
Doc comments on
NttEngine::ntt_partialandPartialNttPlancarry the full reference.Verification
test_ntt_partial_matches_full— random subsets across sizes 4..2^15 vs full NTT.test_ntt_partial_zero_padded_input— M < N case.test_partial_interleaved_rs_encode_matches_full— byte-identity vs full encode across four IRS shapes.test_rejects_g_claim_forgery_via_rho(which was already broken onfc3e614from a different upstream change) is fixed and still rejects the forged eval.cargo clippy -- -D warningsclean.open_inner_from_coeffsemits the sameprover_hint_ark(&submatrix)+matrix_commit.opensequence as the originalopen_inner. No transcript divergence.Open follow-ups (out of scope)
PartialNttPlan::new(~10 MB bools for N=2^19), (b) pruned-NTT per-butterfly branching overhead, (c) re-encode inherently costs more than holding the matrix resident.M = N / 4zero-padding for another 2–4× flop reduction per NTT.Vecnow sets the peak. Reducing it is long-term zkWHIR v3 work.