Add BlindFold ZK support for all operators#239
Merged
Conversation
BlindFold constraint methods on all sumcheck params: - Single-stage: Square, Add, Sub, Neg, Mul, And, Cube, Iff, Concat, Reshape, Slice, ScalarConstDiv - Custom-flow: Div (with range checks + one-hot), Rsqrt (dual range checks) - Neural teleportation: Sigmoid, Tanh, Erf (default flow), Cos, Sin (custom flow with transcript-derived r_node_output) - Lookup: Relu (OpLookupProvider + shout one-hot) - Matrix ops: Einsum (8 variants), Sum - Gather: GatherLarge (shout one-hot), GatherSmall (HammingBooleanity) - SoftmaxLastAxis (4-stage pipeline with Pedersen-committed inter-stage claims) Sub-protocol constraints: - ReadRafSumcheckParams (ps_shout), ReadRafParams (simple shout) - HammingWeightSumcheckParams, BooleanitySumcheckParams, RaSumcheckParams - HammingBooleanitySumcheckParams, IdentityRCParams - TeleportDivisionParams ZK pipeline infrastructure: - prove/verify helpers for each custom-flow pattern - Batched constraint combining with offset challenge indices - Opening value reordering to match constraint required_openings order - Round polynomial coefficient padding for UniPoly trailing-zero stripping - Multi-chain initial claims for BlindFoldWitness - ZK-mode fallback for verifier append_virtual/append_dense/append_sparse - Prover append_sparse pushes to pending_claims for BlindFold - Production Pedersen generators from HyperKZG SRS - Softmax auxiliary_claims (public only) + inter_stage_commitments - zk_sumcheck_num_instances for transcript batching coefficient replay 29 ZK tests (25 single-op + 4 multi-op chains).
joltworks: ReadRafParams.table is only read inside the zk-gated output_constraint_challenge_values method. jolt-atlas-core: recip_mult_params is only mutated inside the zk-gated inv_sum_evals assignment in build_stage1_instances. Without the zk feature both trip dead_code/unused_mut, which CI promotes to errors via -D warnings.
Refactor SoftmaxLastAxisVerifier so its stage methods take decomposed args (accumulator, transcript, scale_bits) instead of &mut Verifier, and split each stageN into a build_stageN_verifiers method that returns the boxed instance vector. Drive the actual sumcheck verification from a new run_stage helper on the verifier struct (non-ZK path) or from verify_zk_sumcheck_instances (ZK path). This lets verify_softmax_zk drop the temp-Verifier construction with mem::take, drop the manual transcript-replay closure that was emulating the IOP appends, and call full per-stage verify_zk_sumcheck_instances for stages 1-4 instead of just absorbing the proof commitments. Add cache_R_zk and cache_r_exp_zk for the two private inter-stage claims: insert explicit (r, F::zero()) placeholders at the right OpeningId rather than relying on the accumulator's permissive zk_mode silent-zero fallback, and absorb the Pedersen commitment from the bundle into the transcript (mirror of the prover side in prove_softmax_zk). operand_link is left out of the ZK path with an explanatory comment: it is a verifier-side algebraic identity over SoftmaxZHi, SoftmaxZLo, SoftmaxSatDiff, X(r2), all of which are private openings that the verifier only sees as zk_mode placeholder zeros. Encoding the identity as a BlindFold R1CS constraint is the right fix and is tracked separately.
Expand the NOTE in verify_softmax_zk so whoever picks up the operand_link follow-up has the full recipe: which BlindFold-side primitive is missing (an AlgebraicIdentity layout variant emitting (SoP)*1 = 0 with no output_var / blinding / Pedersen commitment), why the existing extra_constraints mechanism is the wrong shape (commits and binds output_value via Pedersen, used today only for PCS binding and currently dormant in prove_zk), the exact OutputClaimConstraint terms for the softmax operand_link, and the estimated cost. No code change; comment only.
Replace the if-else-if matches!() chains in prove_zk and verify_zk with exhaustive match expressions over &node.operator. Adding a new Operator variant now fails to compile until it is routed to a specific flow or to one of the default branches, instead of silently falling through to the catch-all. Verifier dispatch additionally separates the previous `has_sumcheck` runtime check into two explicit arms: a no-sumcheck arm (Input, Identity, Broadcast, MoveAxis, Constant, IsNan) that runs only verify_zk_eval_reduction, and a default-with-sumcheck arm for the ordinary algebraic ops. Behavior preserved exactly: Clamp stays in the with-sumcheck arm to match the original verifier (the prover side returns None for Clamp, so this remains a latent assert mismatch if ever exercised; tracked as a separate follow-up not addressed here). MeanOfSquares similarly kept in the with-sumcheck arm to match the original (the prover panics on it via the unimplemented-arm panic in create_prover_instance, so it is broken either way until ZK support lands).
Clamp: the prover treats Clamp as a no-op (create_prover_instance returns None), but the verifier was routing it to the with-sumcheck arm, where it would hit an assert_eq! mismatch on proof_node_idx if ever exercised. Move Clamp to the no-sumcheck arm on the verifier so both sides only run eval_reduction. Also split the prover's default branch into explicit no-sumcheck and with-sumcheck arms so the dispatch shape is symmetric with the verifier and the no-op semantics for Clamp are visible at dispatch time rather than buried inside a runtime create_prover_instance branch. MeanOfSquares: not implemented in the OperatorProofTrait system at all (see ops/mod.rs:254 macro fallback). Make this surface as an explicit unimplemented!() at both prove and verify dispatch with a matching message, instead of a deep panic in create_prover_instance and an opaque assertion failure on the verifier. Behavior preserved: any model that contains MeanOfSquares fails loudly on either side.
The prover's `append_virtual`/`append_dense`/`append_sparse` on `ProverOpeningAccumulator` unconditionally appended cleartext claims to the transcript while the verifier in `zk_mode` did not. Add a `zk_mode` flag on the prover that suppresses those cleartext appends (keeping the `pending_claims` push so the downstream Pedersen commit step still hides them). `prove_zk` sets `zk_mode = true` at entry and toggles it off only for the two intentionally-public scalars: the output-claim append and Softmax's `send_auxiliary_vectors` / `cache_exp_sum`. Verifier-side, reorder `append_virtual` / `append_dense` / `append_sparse` to check `zk_mode` first. Without this, when the same `opening_id` is registered twice in a sumcheck batch (e.g. `SoftmaxExpQ` from both `RecipMult` and `ExpSum` cache_openings), the second call would fall into the standard branch (key now exists) and emit a stray `transcript.append_scalar(F::zero())` the prover never produces. Pull Broadcast / MoveAxis / Identity / IsNan / Clamp out of the no-sumcheck arm in `prove_zk` / `verify_zk` and run the operator's input-cache step. Without this their inputs (e.g. ScalarConstDiv feeding a Broadcast) saw empty openings and the next default-branch sumcheck panicked looking up `reduced_evaluations[node]`. `prove_zk_eval_reduction` now skips empty openings, matching the pre-existing skip in `verify_zk_eval_reduction`. `examples/gpt2_zk_bench.rs`: bump `PedersenGenerators::deterministic` to 4096 (16 was sized for the synthetic single-op ZK tests). `e2e_tests.rs`: add `test_gpt2_zk` as a regression target, mirroring `test_gpt2` and `#[ignore]`d on the model download. GPT-2 seq_len=16, MacBook M3: | stage | non-zk | zk | ratio | | ------ | ------ | ------- | ----- | | prove | 25.6 s | 50.4 s | 1.97x | | verify | 2.5 s | 4.4 s | 1.78x | | size | 2.43 M | 2.68 M | 1.10x | Run with `RUST_MIN_STACK=33554432 RAYON_NUM_THREADS=2` on macOS: the ZK pipeline overflows rayon's default 2 MB worker stack on GPT-2-sized MLE work, and the patched arkworks MSM (`ec/src/scalar_mul/variable_base/mod.rs:813-857`) builds a nested 2-thread `ThreadPoolBuilder` per chunk per call, so without the RAYON cap macOS hits `pthread_create EAGAIN` under heavy parallelism.
Removes the need to invoke them with `RUST_MIN_STACK=33554432 RAYON_NUM_THREADS=2`. The constraint comes from the patched arkworks MSM (`ec/src/scalar_mul/variable_base/mod.rs:813-857`) building a fresh nested 2-thread `rayon::ThreadPoolBuilder` per chunk per MSM call: with default rayon (≈ num CPUs) and many MSMs in rapid succession on GPT-2-sized polynomials, macOS hits the per-process pthread limit (`pthread_create` → EAGAIN). The ZK pipeline also overflows rayon's default 2 MB worker stack on deep MLE recursion. Both call sites now call `rayon::ThreadPoolBuilder::new().num_threads(2) .stack_size(32 MB).build_global()` at entry. `build_global` is a no-op if rayon was already initialised, so user-set env vars still win. A proper fix would patch arkworks-algebra to reuse a single pool instead of building one per MSM chunk, but that's upstream work.
The prover materializes Q = sum_i gamma_i * P_i_padded (`build_materialized_rlc`) and produces a HyperKZG opening on Q at the batched-opening `r_sumcheck`. `joint_claim` includes the per-poly Lagrange factor for zero-padded variables, matching the existing non-zk `compute_joint_claim`. The verifier reconstructs the joint commitment via `combine_commitments(bundle.commitments, gamma_powers)` and calls `PCS::verify`, binding `bundle.commitments` to the batched-opening reduction. `effective_coeffs = gamma_i * lagrange_i` are threaded as the BlindFold extra-constraint challenge values so the R1CS enforces the same identity on `y_com`. Adds `prove_batch_opening_sumcheck_zk` on `ProverOpeningAccumulator`, `sumchecks_keys()` on `VerifierOpeningAccumulator`, and a `BlindFoldBatchOpening` opening-id registration in `Opening::cache_openings` on both sides. Bench scopes the bounded rayon pool to ZK calls only so the non-zk path keeps full parallelism; Pedersen generators bumped to 4096 to cover the larger sumcheck stage. Adds `test_gpt2_zk` (ignored) mirroring `test_gpt2`. All 53 zk e2e tests pass; `gpt2_zk_bench` runs end-to-end. Known gap: the verifier reads `effective_coeffs` from `bundle.baked.extra_constraint_challenges` instead of regenerating them from `gamma_powers_v` and `r_sumcheck`. Tracked in wiki/jolt-atlas/.../batched-opening-sound-verifier.md.
The ZK verifier was skipping `Constant::verify` (`ops/constant.rs:28-43`) and `Input::verify` (`ops/input.rs:27-49`), which in non-zk locally evaluate the public tensor's MLE at the reduced opening point and check against the prover's reduced-eval claim. On GPT-2 `Constant::verify` alone accumulates 618 ms across 299 invocations, so the gap meant ZK verify ran faster than non-zk verify by skipping a real check, not by doing equivalent work cheaper. Port the check: - New `ZkProofBundle::public_node_reduced_claims: BTreeMap<usize, F>` carries cleartext reduced-eval claims for Input/Constant nodes. The prover populates from `prover.accumulator.reduced_evaluations[node_idx]`, which is already cleartext on the prover side. - The `Input | Constant` arm of `verify_zk` now evaluates `MultilinearPolynomial::from(public_tensor).evaluate(r_reduced)` and asserts equality with the bundle value, then writes the verified claim back into `accumulator.reduced_evaluations` (overwriting the `F::zero()` placeholder). Bench (patched arkworks, full 14-thread parallelism, GPT-2): | | non-zk | zk (before) | zk (after) | |--------|---------|-------------|-------------| | prove | 13.14 s | 22.4 s | 21.8 s | | verify | 1.04 s | 0.63 s ⚠ | 1.53 s ✓ | | ratio | - | 0.60× | 1.47× | ZK verify is now correctly slower than non-zk (1.47×) instead of faster (the previous 0.60× was the soundness symptom). All 53 zk e2e tests pass. Bench harness: `gpt2_zk_bench.rs` accepts a `ZK_BENCH_THREADS` env var (defaults to 2 to preserve baseline behaviour with upstream arkworks) to flip the local ZK rayon pool size between the throttled and the full-parallelism configurations without code edits. Remaining soundness gap (future work, documented in wiki/jolt-atlas/.../zk-prove-overhead.md): the cleartext check is honest-prover defense. Active malice (correct bundle value but wrong BlindFold witness value for the constant opening) would still verify. Jolt closes the analogous gap by baking public values into R1CS matrices via `ValueSource::Constant(i128)` or by committing public data as PCS polynomials.
Picks up formatting drift in zk.rs / opening_proof.rs / gpt2_zk_bench.rs from earlier commits on this branch so `cargo fmt --check` (CI) passes.
913f2b6 to
b166aaa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR subsumes all the other previous PRs on BlindFold. I will leave the other ones for reference, but we can close them. In particular, it implements BlindFold constraint methods on:
Test PR:
cargo run --release --features zk -p jolt-atlas-core --example gpt2_zk_bench.On a 14-core Apple M3:
blindfold-all-opsblindfold-all-opsBoth paths run on the same workload (
gpt2_zk_bench, 16-token GPT-2, max_num_vars=20, deterministic Pedersen generators).ZK_BENCH_THREADS=14for the ZK rayon pool.Prover (target: ~23.3 s)
commit_witness_polynomialspolynomial_map+ 631 mscommit_to_polynomials).BatchedSumcheck::prove_zk, cumulative busy 10.96 s (avg 9.94 ms per call). One per ONNX operator stage. Dominated by per-round Pedersen commits.prove_batch_opening_sumcheck_zk(Stage 7)build_materialized_rlc(Step 8)HyperKZG::prove(Step 8)r_sumcheck.BlindFoldProver::proveInside
BlindFoldProver::prove(1.99 s):BlindFold::sample_random_satisfying_pair(Nova cross-term sampling): 1.04 s (~52%)BlindFold::commit_cross_term_rows: 271 msBlindFoldInnerSumcheckProver::new: 117 msRelaxedR1CSInstance::fold: 109 msBlindFoldSpartanProver::witness_contributions: 86 msRelaxedR1CSWitness::fold: 35 msBlindFold::compute_cross_term: 11 msBlindFoldSpartanProver::new: 7 msTop three prover costs: per-node ZK sumchecks (~47%), bundle assembly + Hyrax row MSMs (~24%), BlindFold folding tail (~9%). Step 8 (~10%) and Stage 7 (~7.5%) are the remaining major chunks.
Verifier (target: ~2.44 s)
verify_zk_*loopConstant::verify-equivalent MLE eval (cum 645 ms on the non-zk path for comparison) + 1×Input::verify.BatchedSumcheck::verify_zkitself is essentially free (~60 ms across 1102 calls — just transcript absorbs).public_mle(opening_point), check baked challenge equals it, check eval-commitment is identity G1. Currently does the per-op MLE eval a second time — dedup opportunity.BlindFoldVerifier::verifyeval_commitmentscheck.HyperKZG::verifyVerifierR1CSBuilder::build+ bookkeeping.Top three verifier costs: Constant/Input local MLE evaluations (~57%), R1CS binding consistency (~25%, mostly redundant), BlindFold Spartan-verify (~16%). Step 8 PCS verify is essentially free.