Skip to content

Add BlindFold ZK support for all operators#239

Merged
Forpee merged 12 commits into
mainfrom
blindfold-all-ops
May 18, 2026
Merged

Add BlindFold ZK support for all operators#239
Forpee merged 12 commits into
mainfrom
blindfold-all-ops

Conversation

@Acentelles
Copy link
Copy Markdown
Collaborator

@Acentelles Acentelles commented May 3, 2026

This PR subsumes all the other previous PRs on BlindFold. I will leave the other ones for reference, but we can close them. In particular, it implements BlindFold constraint methods on:

  • Single-stage: Square, Add, Sub, Neg, Mul, And, Cube, Iff, Concat, Reshape, Slice, ScalarConstDiv
  • Custom-flow: Div (with range checks + one-hot), Rsqrt (dual range checks)
  • Neural teleportation: Sigmoid, Tanh, Erf (default flow), Cos, Sin (custom flow with transcript-derived r_node_output)
  • Lookup: Relu (OpLookupProvider + shout one-hot)
  • Matrix ops: Einsum (8 variants), Sum
  • Gather: GatherLarge (shout one-hot), GatherSmall (HammingBooleanity)
  • SoftmaxLastAxis (4-stage pipeline with Pedersen-committed inter-stage claims)

Test PR: cargo run --release --features zk -p jolt-atlas-core --example gpt2_zk_bench.

On a 14-core Apple M3:

Path Prove (mean ± range) Verify (mean ± range) Proof size
non-zk on blindfold-all-ops 12.81 s (12.5-13.2) 1.04 s (1.0-1.1) 2504.6 KiB
zk (BlindFold) on blindfold-all-ops 23.41 s (23.2-23.9) 2.43 s (2.2-2.6) 2729.96 KiB
ratio 1.83× 2.34× 1.09×

Both paths run on the same workload (gpt2_zk_bench, 16-token GPT-2, max_num_vars=20, deterministic Pedersen generators). ZK_BENCH_THREADS=14 for the ZK rayon pool.

Prover (target: ~23.3 s)

Stage Wall / busy Share What it does
commit_witness_polynomials 725 ms 3.1% Pedersen-commit every witness MLE (62.9 ms polynomial_map + 631 ms commit_to_polynomials).
Per-node ZK sumcheck loop ~11 s wall ~47% 1103× BatchedSumcheck::prove_zk, cumulative busy 10.96 s (avg 9.94 ms per call). One per ONNX operator stage. Dominated by per-round Pedersen commits.
prove_batch_opening_sumcheck_zk (Stage 7) 1.75 s 7.5% Single batched ZK sumcheck reducing all witness-MLE openings to one point.
build_materialized_rlc (Step 8) 335 ms 1.4% Build gamma-weighted RLC of all witness polys.
HyperKZG::prove (Step 8) 1.96 s 8.4% Single HyperKZG opening at r_sumcheck.
Bundle / Hyrax-row commits (uninstrumented) ~5.6 s ~24% Build R1CS, assign witness, Pedersen-commit witness rows.
BlindFoldProver::prove 1.99 s 8.5% Nova folding + Spartan tail. Sub-split below.

Inside BlindFoldProver::prove (1.99 s):

  • BlindFold::sample_random_satisfying_pair (Nova cross-term sampling): 1.04 s (~52%)
  • BlindFold::commit_cross_term_rows: 271 ms
  • BlindFoldInnerSumcheckProver::new: 117 ms
  • RelaxedR1CSInstance::fold: 109 ms
  • BlindFoldSpartanProver::witness_contributions: 86 ms
  • RelaxedR1CSWitness::fold: 35 ms
  • BlindFold::compute_cross_term: 11 ms
  • BlindFoldSpartanProver::new: 7 ms

Top three prover costs: per-node ZK sumchecks (~47%), bundle assembly + Hyrax row MSMs (~24%), BlindFold folding tail (~9%). Step 8 (~10%) and Stage 7 (~7.5%) are the remaining major chunks.

Verifier (target: ~2.44 s)

Stage Wall / busy Share What it does
Per-op verify_zk_* loop ~1.4 s ~57% 299× Constant::verify-equivalent MLE eval (cum 645 ms on the non-zk path for comparison) + 1× Input::verify. BatchedSumcheck::verify_zk itself is essentially free (~60 ms across 1102 calls — just transcript absorbs).
Public-node R1CS binding consistency ~0.6 s ~25% Per-binding: recompute public_mle(opening_point), check baked challenge equals it, check eval-commitment is identity G1. Currently does the per-op MLE eval a second time — dedup opportunity.
BlindFoldVerifier::verify 385 ms 16% Spartan + Nova-folded eval_commitments check.
Step 8 HyperKZG::verify 1.4 ms 0.06% One pairing + MSM.
Other (R1CS build, transcript) ~50 ms ~2% VerifierR1CSBuilder::build + bookkeeping.

Top three verifier costs: Constant/Input local MLE evaluations (~57%), R1CS binding consistency (~25%, mostly redundant), BlindFold Spartan-verify (~16%). Step 8 PCS verify is essentially free.

BlindFold constraint methods on all sumcheck params:
- Single-stage: Square, Add, Sub, Neg, Mul, And, Cube, Iff, Concat,
  Reshape, Slice, ScalarConstDiv
- Custom-flow: Div (with range checks + one-hot), Rsqrt (dual range checks)
- Neural teleportation: Sigmoid, Tanh, Erf (default flow),
  Cos, Sin (custom flow with transcript-derived r_node_output)
- Lookup: Relu (OpLookupProvider + shout one-hot)
- Matrix ops: Einsum (8 variants), Sum
- Gather: GatherLarge (shout one-hot), GatherSmall (HammingBooleanity)
- SoftmaxLastAxis (4-stage pipeline with Pedersen-committed inter-stage claims)

Sub-protocol constraints:
- ReadRafSumcheckParams (ps_shout), ReadRafParams (simple shout)
- HammingWeightSumcheckParams, BooleanitySumcheckParams, RaSumcheckParams
- HammingBooleanitySumcheckParams, IdentityRCParams
- TeleportDivisionParams

ZK pipeline infrastructure:
- prove/verify helpers for each custom-flow pattern
- Batched constraint combining with offset challenge indices
- Opening value reordering to match constraint required_openings order
- Round polynomial coefficient padding for UniPoly trailing-zero stripping
- Multi-chain initial claims for BlindFoldWitness
- ZK-mode fallback for verifier append_virtual/append_dense/append_sparse
- Prover append_sparse pushes to pending_claims for BlindFold
- Production Pedersen generators from HyperKZG SRS
- Softmax auxiliary_claims (public only) + inter_stage_commitments
- zk_sumcheck_num_instances for transcript batching coefficient replay

29 ZK tests (25 single-op + 4 multi-op chains).
@Acentelles Acentelles changed the base branch from main to blindfold-einsum-sum May 3, 2026 23:14
@Acentelles Acentelles changed the base branch from blindfold-einsum-sum to main May 3, 2026 23:15
@Acentelles Acentelles marked this pull request as draft May 3, 2026 23:16
Acentelles added 6 commits May 4, 2026 00:20
joltworks: ReadRafParams.table is only read inside the zk-gated
output_constraint_challenge_values method.
jolt-atlas-core: recip_mult_params is only mutated inside the zk-gated
inv_sum_evals assignment in build_stage1_instances.

Without the zk feature both trip dead_code/unused_mut, which CI promotes
to errors via -D warnings.
Refactor SoftmaxLastAxisVerifier so its stage methods take decomposed
args (accumulator, transcript, scale_bits) instead of &mut Verifier, and
split each stageN into a build_stageN_verifiers method that returns the
boxed instance vector. Drive the actual sumcheck verification from a new
run_stage helper on the verifier struct (non-ZK path) or from
verify_zk_sumcheck_instances (ZK path).

This lets verify_softmax_zk drop the temp-Verifier construction with
mem::take, drop the manual transcript-replay closure that was emulating
the IOP appends, and call full per-stage verify_zk_sumcheck_instances
for stages 1-4 instead of just absorbing the proof commitments.

Add cache_R_zk and cache_r_exp_zk for the two private inter-stage
claims: insert explicit (r, F::zero()) placeholders at the right
OpeningId rather than relying on the accumulator's permissive zk_mode
silent-zero fallback, and absorb the Pedersen commitment from the
bundle into the transcript (mirror of the prover side in
prove_softmax_zk).

operand_link is left out of the ZK path with an explanatory comment:
it is a verifier-side algebraic identity over SoftmaxZHi, SoftmaxZLo,
SoftmaxSatDiff, X(r2), all of which are private openings that the
verifier only sees as zk_mode placeholder zeros. Encoding the identity
as a BlindFold R1CS constraint is the right fix and is tracked
separately.
Expand the NOTE in verify_softmax_zk so whoever picks up the operand_link
follow-up has the full recipe: which BlindFold-side primitive is missing
(an AlgebraicIdentity layout variant emitting (SoP)*1 = 0 with no
output_var / blinding / Pedersen commitment), why the existing
extra_constraints mechanism is the wrong shape (commits and binds
output_value via Pedersen, used today only for PCS binding and currently
dormant in prove_zk), the exact OutputClaimConstraint terms for the
softmax operand_link, and the estimated cost.

No code change; comment only.
Replace the if-else-if matches!() chains in prove_zk and verify_zk with
exhaustive match expressions over &node.operator. Adding a new Operator
variant now fails to compile until it is routed to a specific flow or
to one of the default branches, instead of silently falling through to
the catch-all.

Verifier dispatch additionally separates the previous `has_sumcheck`
runtime check into two explicit arms: a no-sumcheck arm (Input,
Identity, Broadcast, MoveAxis, Constant, IsNan) that runs only
verify_zk_eval_reduction, and a default-with-sumcheck arm for the
ordinary algebraic ops. Behavior preserved exactly: Clamp stays in the
with-sumcheck arm to match the original verifier (the prover side
returns None for Clamp, so this remains a latent assert mismatch if
ever exercised; tracked as a separate follow-up not addressed here).
MeanOfSquares similarly kept in the with-sumcheck arm to match the
original (the prover panics on it via the unimplemented-arm panic in
create_prover_instance, so it is broken either way until ZK support
lands).
Clamp: the prover treats Clamp as a no-op (create_prover_instance
returns None), but the verifier was routing it to the with-sumcheck
arm, where it would hit an assert_eq! mismatch on proof_node_idx if
ever exercised. Move Clamp to the no-sumcheck arm on the verifier so
both sides only run eval_reduction. Also split the prover's default
branch into explicit no-sumcheck and with-sumcheck arms so the dispatch
shape is symmetric with the verifier and the no-op semantics for Clamp
are visible at dispatch time rather than buried inside a runtime
create_prover_instance branch.

MeanOfSquares: not implemented in the OperatorProofTrait system at all
(see ops/mod.rs:254 macro fallback). Make this surface as an explicit
unimplemented!() at both prove and verify dispatch with a matching
message, instead of a deep panic in create_prover_instance and an
opaque assertion failure on the verifier. Behavior preserved: any model
that contains MeanOfSquares fails loudly on either side.
@Acentelles Acentelles marked this pull request as ready for review May 4, 2026 17:58
The prover's `append_virtual`/`append_dense`/`append_sparse` on
`ProverOpeningAccumulator` unconditionally appended cleartext claims to
the transcript while the verifier in `zk_mode` did not. Add a `zk_mode`
flag on the prover that suppresses those cleartext appends (keeping the
`pending_claims` push so the downstream Pedersen commit step still hides
them). `prove_zk` sets `zk_mode = true` at entry and toggles it off only
for the two intentionally-public scalars: the output-claim append and
Softmax's `send_auxiliary_vectors` / `cache_exp_sum`.

Verifier-side, reorder `append_virtual` / `append_dense` / `append_sparse`
to check `zk_mode` first. Without this, when the same `opening_id` is
registered twice in a sumcheck batch (e.g. `SoftmaxExpQ` from both
`RecipMult` and `ExpSum` cache_openings), the second call would fall
into the standard branch (key now exists) and emit a stray
`transcript.append_scalar(F::zero())` the prover never produces.

Pull Broadcast / MoveAxis / Identity / IsNan / Clamp out of the
no-sumcheck arm in `prove_zk` / `verify_zk` and run the operator's
input-cache step. Without this their inputs (e.g. ScalarConstDiv
feeding a Broadcast) saw empty openings and the next default-branch
sumcheck panicked looking up `reduced_evaluations[node]`.

`prove_zk_eval_reduction` now skips empty openings, matching the
pre-existing skip in `verify_zk_eval_reduction`.

`examples/gpt2_zk_bench.rs`: bump `PedersenGenerators::deterministic` to
4096 (16 was sized for the synthetic single-op ZK tests).

`e2e_tests.rs`: add `test_gpt2_zk` as a regression target, mirroring
`test_gpt2` and `#[ignore]`d on the model download.

GPT-2 seq_len=16, MacBook M3:

  | stage  | non-zk | zk      | ratio |
  | ------ | ------ | ------- | ----- |
  | prove  | 25.6 s | 50.4 s  | 1.97x |
  | verify | 2.5 s  | 4.4 s   | 1.78x |
  | size   | 2.43 M | 2.68 M  | 1.10x |

Run with `RUST_MIN_STACK=33554432 RAYON_NUM_THREADS=2` on macOS:
the ZK pipeline overflows rayon's default 2 MB worker stack on
GPT-2-sized MLE work, and the patched arkworks MSM
(`ec/src/scalar_mul/variable_base/mod.rs:813-857`) builds a nested
2-thread `ThreadPoolBuilder` per chunk per call, so without the
RAYON cap macOS hits `pthread_create EAGAIN` under heavy parallelism.
Removes the need to invoke them with `RUST_MIN_STACK=33554432
RAYON_NUM_THREADS=2`. The constraint comes from the patched arkworks
MSM (`ec/src/scalar_mul/variable_base/mod.rs:813-857`) building a
fresh nested 2-thread `rayon::ThreadPoolBuilder` per chunk per MSM
call: with default rayon (≈ num CPUs) and many MSMs in rapid
succession on GPT-2-sized polynomials, macOS hits the per-process
pthread limit (`pthread_create` → EAGAIN). The ZK pipeline also
overflows rayon's default 2 MB worker stack on deep MLE recursion.

Both call sites now call `rayon::ThreadPoolBuilder::new().num_threads(2)
.stack_size(32 MB).build_global()` at entry. `build_global` is a no-op
if rayon was already initialised, so user-set env vars still win.

A proper fix would patch arkworks-algebra to reuse a single pool
instead of building one per MSM chunk, but that's upstream work.
The prover materializes Q = sum_i gamma_i * P_i_padded
(`build_materialized_rlc`) and produces a HyperKZG opening on Q at the
batched-opening `r_sumcheck`. `joint_claim` includes the per-poly
Lagrange factor for zero-padded variables, matching the existing non-zk
`compute_joint_claim`. The verifier reconstructs the joint commitment
via `combine_commitments(bundle.commitments, gamma_powers)` and calls
`PCS::verify`, binding `bundle.commitments` to the batched-opening
reduction.

`effective_coeffs = gamma_i * lagrange_i` are threaded as the BlindFold
extra-constraint challenge values so the R1CS enforces the same
identity on `y_com`. Adds `prove_batch_opening_sumcheck_zk` on
`ProverOpeningAccumulator`, `sumchecks_keys()` on
`VerifierOpeningAccumulator`, and a `BlindFoldBatchOpening` opening-id
registration in `Opening::cache_openings` on both sides.

Bench scopes the bounded rayon pool to ZK calls only so the non-zk path
keeps full parallelism; Pedersen generators bumped to 4096 to cover the
larger sumcheck stage. Adds `test_gpt2_zk` (ignored) mirroring
`test_gpt2`.

All 53 zk e2e tests pass; `gpt2_zk_bench` runs end-to-end.

Known gap: the verifier reads `effective_coeffs` from
`bundle.baked.extra_constraint_challenges` instead of regenerating them
from `gamma_powers_v` and `r_sumcheck`. Tracked in
wiki/jolt-atlas/.../batched-opening-sound-verifier.md.
The ZK verifier was skipping `Constant::verify` (`ops/constant.rs:28-43`)
and `Input::verify` (`ops/input.rs:27-49`), which in non-zk locally
evaluate the public tensor's MLE at the reduced opening point and check
against the prover's reduced-eval claim. On GPT-2 `Constant::verify`
alone accumulates 618 ms across 299 invocations, so the gap meant ZK
verify ran faster than non-zk verify by skipping a real check, not by
doing equivalent work cheaper.

Port the check:
- New `ZkProofBundle::public_node_reduced_claims: BTreeMap<usize, F>`
  carries cleartext reduced-eval claims for Input/Constant nodes. The
  prover populates from `prover.accumulator.reduced_evaluations[node_idx]`,
  which is already cleartext on the prover side.
- The `Input | Constant` arm of `verify_zk` now evaluates
  `MultilinearPolynomial::from(public_tensor).evaluate(r_reduced)` and
  asserts equality with the bundle value, then writes the verified
  claim back into `accumulator.reduced_evaluations` (overwriting the
  `F::zero()` placeholder).

Bench (patched arkworks, full 14-thread parallelism, GPT-2):

|        | non-zk  | zk (before) | zk (after)  |
|--------|---------|-------------|-------------|
| prove  | 13.14 s | 22.4 s      | 21.8 s      |
| verify | 1.04 s  | 0.63 s ⚠    | 1.53 s ✓    |
| ratio  | -       | 0.60×       | 1.47×       |

ZK verify is now correctly slower than non-zk (1.47×) instead of
faster (the previous 0.60× was the soundness symptom). All 53 zk e2e
tests pass.

Bench harness: `gpt2_zk_bench.rs` accepts a `ZK_BENCH_THREADS` env var
(defaults to 2 to preserve baseline behaviour with upstream arkworks)
to flip the local ZK rayon pool size between the throttled and the
full-parallelism configurations without code edits.

Remaining soundness gap (future work, documented in
wiki/jolt-atlas/.../zk-prove-overhead.md): the cleartext check is
honest-prover defense. Active malice (correct bundle value but wrong
BlindFold witness value for the constant opening) would still verify.
Jolt closes the analogous gap by baking public values into R1CS
matrices via `ValueSource::Constant(i128)` or by committing public
data as PCS polynomials.
Picks up formatting drift in zk.rs / opening_proof.rs / gpt2_zk_bench.rs
from earlier commits on this branch so `cargo fmt --check` (CI) passes.
@Acentelles Acentelles force-pushed the blindfold-all-ops branch from 913f2b6 to b166aaa Compare May 14, 2026 15:10
@Forpee Forpee merged commit 625259a into main May 18, 2026
7 checks passed
@Forpee Forpee deleted the blindfold-all-ops branch May 18, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants