compsec-epfl · z-tech · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,4 @@
 **/lag-poly-benches/target/
 .vscode
 .DS_Store
+.claude/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,8 +5,16 @@ All notable changes to this project will be documented in this file.
 ## [Unreleased]
 
 ### Added
-- **Base/Extension field support**: `multilinear_sumcheck` and `inner_product_sumcheck` now take two type parameters `<BF, EF>` — base field for evaluations, extension field for challenges. Set `EF = BF` when no extension is needed.
-- `pairwise::cross_field_reduce` — parallel helper for folding `BF` evaluations with an `EF` challenge.
+- **SIMD auto-dispatch** for Goldilocks (NEON + AVX-512 IFMA) across all three sumcheck variants.
+- **`poly_ops` module** — zero-allocation polynomial arithmetic on coefficient slices.
+- **`RoundPolyEvaluator` trait** for `coefficient_sumcheck` — user implements per-pair math, library handles iteration, parallelism, and reductions.
+- **Base/Extension field support** (`<BF, EF>`) for `multilinear_sumcheck` and `inner_product_sumcheck`.
+
+### Changed
+- **Inner product sumcheck**: 2 prover messages per round instead of 3 (verifier derives the third).
+- **Coefficient sumcheck**: sends d coefficients per round instead of d+1.
+- **`protogalaxy::fold`**: rewritten with flat buffers (93× faster at scale).
+- **`coefficient_sumcheck`** takes `&impl RoundPolyEvaluator<F>` instead of a closure.
 
 ## [0.0.2] - 2026-02-11
 

diff --git a/Cargo.toml b/Cargo.toml
@@ -15,7 +15,7 @@ ark-std ="0.5.0"
 memmap2 = "0.9.5"
 nohash-hasher = "0.2.0"
 rayon = { version = "1.10", optional = true }
-spongefish = { git = "https://github.com/arkworks-rs/spongefish", branch = "main", features = ["ark-ff"] }
+spongefish = { git = "https://github.com/z-tech/spongefish.git", branch = "smallfp-support", features = ["ark-ff"] }
 
 [dev-dependencies]
 criterion = "0.8"
@@ -33,3 +33,14 @@ parallel = [
 name = "provers"
 path = "benches/provers.rs"
 harness = false
+
+[[bench]]
+name = "simd_vs_generic"
+path = "benches/simd_vs_generic.rs"
+harness = false
+
+[patch.crates-io]
+ark-ff = { git = "https://github.com/arkworks-rs/algebra.git", branch = "master" }
+ark-poly = { git = "https://github.com/arkworks-rs/algebra.git", branch = "master" }
+ark-serialize = { git = "https://github.com/arkworks-rs/algebra.git", branch = "master" }
+spongefish = { git = "https://github.com/z-tech/spongefish.git", branch = "smallfp-support" }
diff --git a/README.md b/README.md
@@ -55,31 +55,45 @@ let sumcheck_transcript: ProductSumcheck<EF> = inner_product_sumcheck::<BF, EF>(
 claim = \sum_{x \in \{0,1\}^n} p(x), \quad \deg_{x_i}(p) \leq d
 ```
 
-Unlike the multilinear and inner product variants where `p` is multilinear (degree 1 in each variable, yielding degree-1 round polynomials), `coefficient_sumcheck` handles polynomials with arbitrary per-variable degree `d`, producing degree-`d` round polynomials. The user supplies a closure `compute_round_poly` that computes each round polynomial; the library handles transcript interaction and table reductions (both pairwise and tablewise) automatically.
+Unlike the multilinear and inner product variants where `p` is multilinear (degree 1 in each variable, yielding degree-1 round polynomials), `coefficient_sumcheck` handles polynomials with arbitrary per-variable degree `d`, producing degree-`d` round polynomials. The user implements `RoundPolyEvaluator` to define how a single pair of even/odd rows contributes to the round polynomial; the library handles iteration, parallelism, transcript interaction, and table reductions automatically.
 
 ```rust
-use efficient_sumcheck::coefficient_sumcheck::{coefficient_sumcheck, CoefficientSumcheck};
+use efficient_sumcheck::coefficient_sumcheck::{
+    coefficient_sumcheck, CoefficientSumcheck, RoundPolyEvaluator,
+};
 use efficient_sumcheck::transcript::SanityTranscript;
 use ark_poly::univariate::DensePolynomial;
 
+struct MyEvaluator;
+impl RoundPolyEvaluator<F> for MyEvaluator {
+    fn degree(&self) -> usize { 1 }
+
+    fn accumulate_pair(
+        &self,
+        coeffs: &mut [F],         // pre-zeroed buffer of length degree + 1
+        tw: &[(&[F], &[F])],      // (even_row, odd_row) per tablewise table
+        pw: &[(F, F)],            // (even, odd) per pairwise table
+    ) {
+        let (even, odd) = pw[0];
+        coeffs[0] += even;        // add to constant coefficient
+        coeffs[1] += odd - even;  // add to linear coefficient
+    }
+}
+
 let mut tablewise: Vec<Vec<Vec<F>>> = /* multi-column tables */;
 let mut pairwise: Vec<Vec<F>> = /* flat evaluation vectors */;
 let mut transcript = SanityTranscript::new(&mut rng);
 
 let result: CoefficientSumcheck<F> = coefficient_sumcheck(
-  |tablewise, pairwise| {
-      // Compute h(X) as a DensePolynomial<F> from current table state.
-      // Return coefficients in ascending order: [c0, c1, ..., cd].
-      DensePolynomial::from_coefficients_vec(vec![/* ... */])
-  },
+  &MyEvaluator,
   &mut tablewise,
   &mut pairwise,
   n_rounds,
   &mut transcript,
 );
 ```
 
-The closure receives immutable references to the current tables; after each round the library automatically reduces all pairwise and tablewise entries by folding with the verifier challenge.
+The evaluator receives one pair of rows at a time; the library iterates over all pairs (in parallel when the `parallel` feature is enabled), sums the per-pair polynomials, and reduces all pairwise and tablewise entries by folding with the verifier challenge after each round.
 
 ## Examples
 
@@ -103,37 +117,66 @@ Here, `batched_constraint_poly` merges dense evaluation vectors (out-of-domain s
 
 ### 2) WARP - Twin Constraint Batching
 
-[WARP](https://github.com/compsec-epfl/warp) also uses `coefficient_sumcheck` with `folding::protogalaxy::fold` to batch a codeword check and an R1CS constraint check into a single sumcheck. The codewords, witness vectors, and folding coefficients are stored as tablewise tables and the equality polynomial evaluations as a pairwise vector:
+[WARP](https://github.com/compsec-epfl/warp) also uses `coefficient_sumcheck` with `folding::protogalaxy::fold` to batch a codeword check and an R1CS constraint check into a single sumcheck. The user implements `RoundPolyEvaluator` to define the per-pair math; the library handles iteration, parallelism, and reductions:
 
 ```rust
-use efficient_sumcheck::coefficient_sumcheck::coefficient_sumcheck;
+use efficient_sumcheck::coefficient_sumcheck::{coefficient_sumcheck, RoundPolyEvaluator};
 use efficient_sumcheck::folding::protogalaxy;
 
+struct TwinConstraintEvaluator { r1cs: ..., omega: F, degree: usize }
+
+impl RoundPolyEvaluator<F> for TwinConstraintEvaluator {
+    fn degree(&self) -> usize { self.degree }
+    fn accumulate_pair(&self, coeffs: &mut [F], tw: &[(&[F], &[F])], pw: &[(F, F)]) {
+        let f = protogalaxy::fold(/* alpha pairs */, /* codeword polys */);
+        let p = protogalaxy::fold(/* beta pairs  */, /* constraint polys */);
+        let t = [pw[0].0, pw[0].1 - pw[0].0]; // linear tau polynomial
+        // h(X) = (f(X) + ω·p(X)) · t(X) — accumulated directly into coeffs
+        // ... using poly_ops::add_scaled and poly_ops::mul_add_into
+    }
+}
+
 let mut tablewise = [codewords, z_vecs, alpha_vecs, beta_vecs];
 let mut pairwise = [tau_eq_evals];
 
 let sc = coefficient_sumcheck(
-    |tw, pw| {
-        let (u, z, a, b) = (&tw[0], &tw[1], &tw[2], &tw[3]);
-        let tau = &pw[0];
-
-        let f = protogalaxy::fold(/* ... */, /* codeword polys */);
-        let p = protogalaxy::fold(/* ... */, /* constraint polys */);
-        let t = linear_poly(tau[0], tau[1]);
-
-        // h(X) = (f(X) + ω·p(X)) · t(X)
-        (f + p * omega).naive_mul(&t)
-    },
+    &TwinConstraintEvaluator { r1cs, omega, degree },
     &mut tablewise,
     &mut pairwise,
     log_l,
     &mut prover_state,
 );
-let gamma = sc.verifier_messages;
 ```
 
 After each round `coefficient_sumcheck` reduces all four tablewise tables and the pairwise equality evaluations by folding with the verifier's challenge.
 
+## SIMD Acceleration
+
+All three sumcheck variants auto-dispatch to SIMD-accelerated backends for Goldilocks (p = 2^64 − 2^32 + 1):
+
+- **aarch64 (NEON)**: 2-wide vectorized add/sub, scalar multiply fallback
+- **x86_64 (AVX-512 IFMA)**: 8-wide vectorized add/sub/mul via 52-bit fused multiply-accumulate
+
+The dispatch is transparent — no code changes needed. LLVM constant-folds the field detection at compile time, so the non-SIMD path has zero overhead.
+
+## Zero-Allocation Polynomial Arithmetic (`poly_ops`)
+
+The `poly_ops` module provides slice-based polynomial arithmetic with no heap allocation:
+
+```rust
+use efficient_sumcheck::poly_ops;
+
+let a = [F::from(1u64), F::from(2u64)];  // 1 + 2x
+let b = [F::from(3u64), F::from(4u64)];  // 3 + 4x
+let mut out = [F::ZERO; 3];
+
+poly_ops::mul_into(&mut out, &a, &b);         // out = a * b
+poly_ops::add_scaled(&mut out, s, &c);        // out += s * c
+let val = poly_ops::eval_at(&out, challenge); // Horner evaluation
+```
+
+These are designed for hot loops where `DensePolynomial` allocation overhead dominates — protogalaxy folding, R1CS constraint evaluation, etc. The `protogalaxy::fold` function uses them internally, achieving up to 93× speedup over the naive `DensePolynomial` approach.
+
 ## Advanced Usage
 
 Supporting the high-level interfaces are raw implementations of sumcheck [[LFKN92](#references)] using three proving algorithms: