Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS#796
Open
Robby955 wants to merge 3 commits intoopenai:mainfrom
Open
Conversation
3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 tasks
…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).
|
nice 🔥🔥🔥🔥 |
Add complementary training (from @pentxayc openai#803) and per-order multipliers (from @AayushBaniya2006 openai#809) on top of distributed prefill + 15-gram + order-adaptive gating. New 3-seed results: 0.28798 / 0.28804 / 0.28810 All seeds under 16MB, training under 560s, eval under 330s. Updated README with legality hedge, full ablation, credits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb: 0.2880 (3-seed mean, std 0.00006) | ~15.3 MB | 8xH100 SXM | 560s train + ~330s eval
Major update from our previous 0.4374 submission. Two additional techniques stacked on top:
Complementary training (from @pentxayc PR Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803): downweight loss on n-gram-predictable tokens during training, so the neural model specializes on what caching can't handle.
COMP_ALPHA=0.50, orders 2-5, 200-step warmup.Per-order multipliers (from @AayushBaniya2006 PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809): bigrams/trigrams suppressed to 0.3x alpha, orders 5-15 boosted to 2.0x, capped at
alpha_max=0.95.Plus our previous contributions:
Distributed cache pre-fill: each GPU rank pre-populates 15-gram hash tables with ALL preceding positions via vectorized numpy. Makes 8-GPU eval mathematically identical to single-GPU sequential. No NCCL needed.
Order-adaptive entropy gating (inspired by @travispchen PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798): per-order entropy thresholds — 15-gram matches trusted aggressively (center=2.5), bigrams only when model is confused (center=4.5).
3-seed results
How we got here (ablation)
Each row adds one thing on top of the previous:
Architecture
EBLS (Empirical Bayes Layer Sharing): 3 shared transformer blocks looped 3x + 2 unique = 11 layers. Per-virtual-layer LoRA rank 8. 512d, 8 heads, 4 KV heads (GQA), MLP 3x LeakyReLU(0.5)^2, XSA-all(11), VRL(1-10), Val-GPTQ int6 + LZMA preset 9. 27.1M parameters.
Compliance
Legality
N-gram caching legality has not been formally resolved by OpenAI. @valerio-oai commented on PR #659 that it "is not illegal" and suggested entropy-based gating, but no definitive ruling has been issued. We believe our implementation is compliant — strictly backward-looking, score-first, no training data at eval time — but we respect whatever ruling is made.
We also maintain a separate neural-only submission (PR #734, 1.1198 BPB) that uses no n-gram techniques.
We welcome discussion — if there are concerns about any aspect of the approach, we're happy to address them.
Credits
This builds on a lot of community work:
Techniques we adopted:
N-gram cache lineage:
Architecture foundations:
Our novel contributions: distributed cache pre-fill, 15-gram extension, order-adaptive entropy gating, the combination/integration work, and the EBLS training architecture.
Feedback, questions, and corrections welcome.