Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS by Robby955 · Pull Request #796 · openai/parameter-golf

Robby955 · 2026-03-26T01:41:00Z

Summary

val_bpb: 0.2880 (3-seed mean, std 0.00006) | ~15.3 MB | 8xH100 SXM | 560s train + ~330s eval

Major update from our previous 0.4374 submission. Two additional techniques stacked on top:

Complementary training (from @pentxayc PR Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803): downweight loss on n-gram-predictable tokens during training, so the neural model specializes on what caching can't handle. COMP_ALPHA=0.50, orders 2-5, 200-step warmup.
Per-order multipliers (from @AayushBaniya2006 PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809): bigrams/trigrams suppressed to 0.3x alpha, orders 5-15 boosted to 2.0x, capped at alpha_max=0.95.

Plus our previous contributions:

Distributed cache pre-fill: each GPU rank pre-populates 15-gram hash tables with ALL preceding positions via vectorized numpy. Makes 8-GPU eval mathematically identical to single-GPU sequential. No NCCL needed.
Order-adaptive entropy gating (inspired by @travispchen PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798): per-order entropy thresholds — 15-gram matches trusted aggressively (center=2.5), bigrams only when model is confused (center=4.5).

3-seed results

Seed	Steps	Train (s)	Pre-quant BPB	Roundtrip BPB	Sliding + 15-gram BPB	Artifact bytes
1337	3,620	560	1.1698	1.1745	0.28797872	15,143,631
2024	3,587	560	1.1701	1.1749	0.28804071	15,124,675
2025	3,593	560	1.1702	1.1751	0.28809874	15,324,143
Mean					0.2880 (std 0.00006)

How we got here (ablation)

Each row adds one thing on top of the previous:

Config	BPB	Delta	What changed
Neural model only (no cache)	1.1425	—	EBLS baseline after GPTQ
+ 7-gram backoff + prefill	0.6565	-0.486	Cache + distributed prefill
+ extend to 15-gram	0.6189	-0.038	More context helps
+ order-adaptive gating	0.4374	-0.182	Trust high orders more
+ complementary training (alpha=0.20)	0.3707	-0.067	Focus model on hard tokens
+ per-order multipliers	0.2880	-0.083	Boost high orders, suppress bigrams

Architecture

EBLS (Empirical Bayes Layer Sharing): 3 shared transformer blocks looped 3x + 2 unique = 11 layers. Per-virtual-layer LoRA rank 8. 512d, 8 heads, 4 KV heads (GQA), MLP 3x LeakyReLU(0.5)^2, XSA-all(11), VRL(1-10), Val-GPTQ int6 + LZMA preset 9. 27.1M parameters.

Compliance

Legality

N-gram caching legality has not been formally resolved by OpenAI. @valerio-oai commented on PR #659 that it "is not illegal" and suggested entropy-based gating, but no definitive ruling has been issued. We believe our implementation is compliant — strictly backward-looking, score-first, no training data at eval time — but we respect whatever ruling is made.

We also maintain a separate neural-only submission (PR #734, 1.1198 BPB) that uses no n-gram techniques.

We welcome discussion — if there are concerns about any aspect of the approach, we're happy to address them.

Credits

This builds on a lot of community work:

Techniques we adopted:

@pentxayc (PR #803) — complementary training
@AayushBaniya2006 (PR #809) — per-order multipliers

N-gram cache lineage:

@deanbrr (PR #659) — original n-gram cache concept
@valerio-oai — legality guidance + entropy gating suggestion
@newjordan (PR #674) — first legal implementation
@lukacf (PR #702) — multi-order backoff + entropy-adaptive sigmoid
@Asukabot0 (PR #727) — 7-gram, first sub-1.0
@hypery11 (PR #788) — 9-gram extension
@travispchen (PR #798) — per-order entropy thresholds

Architecture foundations:

@raahilshah (PR #634) — XSA
@parinzee (PR #493) — LeakyReLU(0.5)^2
@signalrush (PR #414) — GPTQ + EMA + warmdown

Our novel contributions: distributed cache pre-fill, 15-gram extension, order-adaptive entropy gating, the combination/integration work, and the EBLS training architecture.

Feedback, questions, and corrections welcome.

3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

hypery11 · 2026-03-26T06:35:56Z

nice 🔥🔥🔥🔥

@pentxayc

Add complementary training (from @pentxayc openai#803) and per-order multipliers (from @AayushBaniya2006 openai#809) on top of distributed prefill + 15-gram + order-adaptive gating. New 3-seed results: 0.28798 / 0.28804 / 0.28810 All seeds under 16MB, training under 560s, eval under 330s. Updated README with legality hedge, full ablation, credits.

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

MatoTeziTanka mentioned this pull request Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Open

10 tasks

Update record: 0.4374 BPB — 15-gram + distributed prefill + order-ada…

d19036d

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

Robby955 changed the title ~~Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS~~ Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Mar 26, 2026

Idan3011 mentioned this pull request Mar 26, 2026

Record: Per-Order Adaptive Alpha + N-gram Backoff (val_bpb=0.2995, 3-seed) #810

Open

Robby955 changed the title ~~Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS~~ Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS#796

Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS#796
Robby955 wants to merge 3 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567

Robby955 commented Mar 26, 2026 •

edited

Loading

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Robby955 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-seed results

How we got here (ablation)

Architecture

Compliance

Legality

Credits

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Robby955 commented Mar 26, 2026 •

edited

Loading