Skip to content

perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252

Open
shreyas-londhe wants to merge 5 commits into
worldfnd:v1from
shreyas-londhe:perf/drop-linear-forms
Open

perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens#252
shreyas-londhe wants to merge 5 commits into
worldfnd:v1from
shreyas-londhe:perf/drop-linear-forms

Conversation

@shreyas-londhe
Copy link
Copy Markdown
Contributor

@shreyas-londhe shreyas-londhe commented May 13, 2026

Memory + wall optimisations for whir_zk::prove_blinded_polynomial

Five cumulative commits on top of upstream/v1. Net trade: −204 MB peak for +220 ms wall vs upstream/v1. Memory savings are the headline; wall is a small regression vs v1 but recovers ~500 ms of the regression that the IRS-coeff change alone introduced.

Stack

Commit What Why
c183108 Drop linear_forms after covector build Linear forms only needed to produce the covector; holding them past that point wastes peak.
fc3e614 Hold IRS coefficients, re-encode codeword on demand f_hat_witness.matrix is the dominant resident allocation between commit and the two f̂ opens. Drop it after commit; re-encode just for each open. Costs +720 ms wall (the codeword is re-encoded twice, once per [[f̂]] open).
97fea7c Output-pruned NTT for f̂ opens Replaces the transient full re-encode (O(N log N) flops) with an output-pruned NTT (O(N + k log N) flops) that materialises only the queried codeword rows. Recovers ~500 ms of the fc3e614 wall regression.
d4ad1fb Parallelise partial_interleaved_rs_encode batches Sequential pruned NTT lost to parallel ntt_batch. Restore batch-axis rayon via batch-major intermediate + transpose.
c39ce01 Clippy + signature cleanup f_hat_witness: &mut is no longer accurate (never mutated).

Measurements (complete_age_check, m=20, BN254 Fr, 8 threads, 30 s cooldown, 3 iters, σ ≈ 1%)

Same machine, same provekit (db8ff0fd), low-noise methodology (cooldown between iters, no other load, single cached binary per ref):

Metric upstream/v1 (0a68627) HEAD (c39ce01) Δ
Wall (min/med, ms) 2630 / 2650 2860 / 2870 +220 ms / +8.3%
Peak (MB) 880.0 676.0 −204 / −23.2%
Allocs 3.56M 3.56M 0%

Peak and allocs are deterministic. Wall σ on cooled-down runs is ~1%, so the +8.3% wall delta is real, not noise.

Intermediate stage (for context)

The IRS-coeff change alone (fc3e614) was measured by the original author at +720 ms wall for −99 MB peak. The pruned-NTT path (97fea7c + d4ad1fb) recovers ~500 ms of that wall regression — bringing the cumulative wall cost from +720 ms down to +220 ms while preserving (and slightly extending) the peak savings.

Thread scaling (HEAD binary, complete_age_check)

Threads Wall (min/med ms) Peak (MB)
1 10300 / 10400 676
2 6120 / 6140 676
4 3850 / 3860 676
8 2860 / 2870 676

Peak is identical at all thread counts — the 676 MB is set by the IRS-coeff resident witness, not by encode work buffers. Wall scales 1.0× → 3.6× from 1 to 8 cores (parallel fraction ≈ 73% by Amdahl).

When this PR is a win

  • OOM-bound deployments: 880 → 676 MB unlocks complete_age_check on machines that previously OOM'd, at an 8% wall cost.
  • Provers that batch many proofs: the −204 MB peak compounds — more concurrent provers per machine.

When it is not

  • Single-prove, latency-critical, no memory pressure: the +220 ms wall is a real regression with no return.

Algorithm (commit 97fea7c)

Output-pruned radix-2 DIT NTT (Sorensen & Burrus 1993; Skinner 1976). For query set I of size k out of N, walks the butterfly DAG backwards from I to mark only the cone of butterflies contributing to the queried outputs. Field-op count drops from O(N log N) per NTT (full) to O(N + k log N) per NTT (pruned). Mask is built once per (N, I) via PartialNttPlan and reused across all num_polys × interleaving_depth NTTs of an interleaved IRS open.

Marking rule: at radix-2 DIT stage s, the butterfly at (a=base+k, b=base+k+half) has both inputs and both outputs at (a, b). If either output is in the live set at stage s, both inputs are marked live at stage s − 1. Standard contrapositive of "Y[i] depends on a[k] iff k ∈ cone(i)" for radix-2 DIT.

Twiddle-factor indexing: roots[k * (roots.len() / m)] retrieves ω_m^k correctly even when the shared roots table is at an order larger than m — same pattern as the existing apply_twiddles.

References

  • Sorensen & Burrus, "Efficient computation of the DFT with only a subset of input or output points," IEEE TSP 41(3), 1993.
  • Skinner, "Pruning the decimation in-time FFT algorithm," IEEE ASSP-24(2), 1976.

Doc comments on NttEngine::ntt_partial and PartialNttPlan carry the full reference.

Verification

  • 155 whir lib tests pass, including 3 new property tests:
    • test_ntt_partial_matches_full — random subsets across sizes 4..2^15 vs full NTT.
    • test_ntt_partial_zero_padded_input — M < N case.
    • test_partial_interleaved_rs_encode_matches_full — byte-identity vs full encode across four IRS shapes.
  • 30 whir_zk tests pass. test_rejects_g_claim_forgery_via_rho (which was already broken on fc3e614 from a different upstream change) is fixed and still rejects the forged eval.
  • cargo clippy -- -D warnings clean.
  • Transcript byte-identity audited: open_inner_from_coeffs emits the same prover_hint_ark(&submatrix) + matrix_commit.open sequence as the original open_inner. No transcript divergence.
  • Independent correctness review confirmed the marking rule, twiddle indexing, bit-reversal, edge cases, and layout match.

Open follow-ups (out of scope)

  • Close the +220 ms gap: candidates are (a) per-call mask allocation in PartialNttPlan::new (~10 MB bools for N=2^19), (b) pruned-NTT per-butterfly branching overhead, (c) re-encode inherently costs more than holding the matrix resident.
  • Input pruning (combined Sorensen-Burrus): exploits M = N / 4 zero-padding for another 2–4× flop reduction per NTT.
  • IRS-coeff witness reduction: the resident IRS-coeff Vec now sets the peak. Reducing it is long-term zkWHIR v3 work.
  • Blinding-poly encode: small codeword (~32 KB), still uses full encode (intentional).

Take linear_forms by value in prepare_and_sumcheck / prove_blinded_polynomial
and drop it as soon as the combined covector has been built. Each Covector
in linear_forms holds num_witnesses field elements; for R1CS circuits with
3 matrices (A, B, C) and millions of witnesses this is ~100 MB freed
before the WHIR commit phase, where peak memory is hit.

Measured peak reduction on provekit (m=20 circuits):
- complete_age_check: 880 -> 805 MB (-8.5%)
- t_add_dsc_1850:     533 -> 497 MB (-6.8%)
- t_add_id_data_1850: 222 -> 203 MB (-8.6%)
- poseidon-rounds:    467 -> 467 MB (no change, small linear forms)

Protocol-equivalent. Transcript byte-identical. E2E prove+verify passes.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 13, 2026

Merging this PR will degrade performance by 18.82%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 4 regressed benchmarks
✅ 6 untouched benchmarks
⏩ 22 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation interleaved_rs_encode[(18, 4, 3)] 155.2 ms 179.9 ms -13.72%
Simulation interleaved_rs_encode[(20, 4, 4)] 613.3 ms 807.9 ms -24.09%
Simulation interleaved_rs_encode[(22, 4, 4)] 3.4 s 4.1 s -16.92%
Simulation interleaved_rs_encode[(18, 2, 2)] 68.8 ms 86.3 ms -20.2%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing shreyas-londhe:perf/drop-linear-forms (c39ce01) with main (ec7aa32)2

Open in CodSpeed

Footnotes

  1. 22 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on v1 (0a68627) during the generation of this report, so main (ec7aa32) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

The initial IRS commit witnesses (f_hat and blinding_poly) previously held
their full Reed-Solomon encoded codewords resident from commit through the
entire whir_zk::prove. The codeword is only consumed at open time (Merkle
path generation + queried row extraction); the coefficients are smaller by
the blowup factor (e.g. 4x at rate 1/4) and already retained for other
protocol uses.

Drop matrix immediately after commit. Re-encode transiently around each
open and drop again after. Three encodes per whir_zk::prove call: one
for each of f_hat's two opens (ood_stir_and_rounds, gamma_check) and one
for blinding_poly's open in prove_blinding_polynomial.

Measured on complete_age_check (m=20, N=5 interleaved):
- peak: 805 -> 706 MB (-99 MB / -12.3%)
- wall (median): 3500 -> 4220 ms (+20.6%, +720 ms)
- allocs: 3.56M -> 3.61M (+50k)

Combined with linear_forms drop (c183108) versus unoptimised v1:
- peak: 880 -> 706 MB (-174 MB / -19.8%)

Protocol-equivalent. Prove + verify roundtrip passes byte-identically.
Re-encoded codeword matches the original since interleaved_rs_encode is
deterministic.
@shreyas-londhe shreyas-londhe changed the title perf(whir_zk): drop linear_forms after covector build perf(whir_zk): reduce peak memory in zk prover (IRS coeff + linear_forms drop) May 15, 2026
Replace full Reed-Solomon re-encode at the two [[f̂]] open sites
(ood_stir_and_rounds, gamma_check) with an output-pruned NTT that
materialises only the queried codeword rows. The full
`(num_cols × codeword_length)` codeword matrix is never resident:
peak memory at the IRS-coeff bottleneck drops by a factor of
`codeword_length / in_domain_samples` (≈ 4000× at m=20, k=127), and
the per-encode flop count drops from O(N log N) to O(N + k log N).

Algorithm: Sorensen-Burrus radix-2 DIT, walking the butterfly DAG
backwards from the query set to mark only the cone of butterflies
contributing to the requested outputs. Reuses the existing roots-of-
unity cache. Reference: Sorensen & Burrus, "Efficient computation of
the DFT with only a subset of input or output points" (IEEE TSP 41,
1993). See doc comment on `NttEngine::ntt_partial`.

API additions:
  - `NttEngine::ntt_partial` + `ntt_partial_with_plan_into`
  - `PartialNttPlan` (per-(size,indices) pruning plan, reusable
    across batched NTTs that share the same query set)
  - `ntt::partial_interleaved_rs_encode` (mirrors
    `interleaved_rs_encode` but emits only the rows at `indices`)
  - `irs_commit::Config::{open_from_coeffs, open_at_indices_from_coeffs}`
    (functionally identical transcripts to `open`/`open_at_indices`;
    do not require `witness.matrix` to be populated)

The blinding-poly re-encode in `prove()` is left untouched (small
codeword, negligible cost).

Tests:
  - Randomised property tests vs full NTT across sizes 4..2^15,
    sparse and dense query subsets, zero-padded M<N inputs, and
    edge cases (empty, singletons, repeated indices, size=1).
  - `partial_interleaved_rs_encode` byte-identity against
    `interleaved_rs_encode` + row extraction across four shapes
    spanning the regimes used in whir_zk (depth 1 vs 8, single vs
    multi-poly, rate-1/4 blowup).
  - All 155 existing whir tests still pass; fixed the pre-existing
    `test_rejects_g_claim_forgery_via_rho` to mirror the production
    open path (re-encode blinding_poly_witness before
    `prove_blinding_polynomial`; use new partial-encode opens for
    f̂).
Each (poly_idx, slot_idx) NTT in the partial encode is independent.
Switch to a batch-major intermediate (`(num_cols, k)`) populated via
`par_chunks_exact_mut` and transpose to the row-major output. Brings
the partial encode in line with the parallel batching the existing
`ntt_batch` performs inside the full encode.
Reviewer flagged that `f_hat_witness: &mut` in `ood_stir_and_rounds`,
`gamma_check`, and `prove_blinded_polynomial` is no longer accurate —
the partial-encode path never mutates the witness. Switch to `&` and
drop the now-redundant `&*` reborrows. Misleading `&mut` could mask
future bugs where the witness is unintentionally mutated.

Also applies the smaller clippy/fmt nits the reviewer surfaced:
- ntt_partial: allow(dead_code) (kept pub for external callers; the
  hot path uses ntt_partial_with_plan_into)
- PartialNttPlan::size: const fn
- ntt_partial_with_plan_into: allow(significant_drop_tightening); the
  roots-table RwLockReadGuard is intentionally held across all DIT
  stages, mirroring ntt_dispatch
- assertion comparison form: `> n` instead of `>= n + 1`
- cargo fmt

`cargo clippy -- -D warnings` is now clean; 155 lib tests still pass.
@shreyas-londhe shreyas-londhe changed the title perf(whir_zk): reduce peak memory in zk prover (IRS coeff + linear_forms drop) perf(whir_zk): IRS-coeff residency + output-pruned NTT for f̂ opens May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant