Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT#834
Open
AnirudhRahul wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
… + Backoff TTT Replaces the heuristic entropy-adaptive alpha with a learned 7-expert gate (Linear 512→7) that routes between the neural model and n-gram orders 2-7. The gate is trained end-to-end during the main training loop using a frozen n-gram oracle pre-computed from training data (counted within wallclock). 3-seed results (8xH100 SXM, 600s): seed 1337: val_bpb=0.1661 (15.74 MB) seed 42: val_bpb=0.1663 (15.76 MB) seed 2024: val_bpb=0.1666 (15.25 MB) mean: val_bpb=0.1663 (std=0.0003) Cleanup: removed dead code (adaptive LR, Polyak averaging, scalar mixer path, unused function params). Added detailed order-of-operations to README proving legality of the training and evaluation procedure. Based on PR openai#779 (deanbrr) BackoffNgramMixer + DriftFreeTTT architecture. Made-with: Cursor
878b7ed to
772ecb2
Compare
Asukabot0
added a commit
to Asukabot0/parameter-golf
that referenced
this pull request
Mar 26, 2026
PR openai#834 inspired architecture: - GpuNgramMixer: GPU-native n-gram backoff with torch.scatter_add_ - GateHead: Linear(512→7) softmax gate with neural floor - Frozen oracle: pre-fill from all training shards at startup - Gate trained jointly with model (0.1x auxiliary loss weight) - KNNDatastore: GPU cosine similarity search (eval-time complement) Enable: GATE_ENABLED=1 Config: GATE_ORDERS=6 GATE_BUCKETS=1048576 Eval integration pending (next commit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Linear(512 -> 7)) that predicts, for each token, how much to rely on the neural model versus n-gram orders 2-7.-log(p_mix)to the training loss. Gradients flow through the softmax gate into the new output head and back into the transformer hidden state, teaching the model when the n-gram is reliable and when it should fall back to the neural distribution.How the Output Head Is Trained
For each training token
t:h_t.h_tto 7 gate logits: one for the neural model and one for each n-gram order2..7.p_mix(t) = sum_i w_{t,i} * p_{t,i}L_mix = -log(p_mix(t))to the usual cross-entropy loss.
This means the gate head is not trained by distilling toward a heuristic alpha target. It is trained directly from the same next-token prediction signal as the main model. The oracle probabilities are treated as fixed lookup values, so gradients do not flow into the n-gram tables; they flow through the gate weights into the output head and transformer.
3-Seed Results
Key Design Decisions
Learned routing head
A
Linear(512 -> 7)head reads the transformer hidden state and produces logits over 7 experts: the neural model plus n-gram orders 2-7.Frozen n-gram oracle for training efficiency
The training-time n-gram tables are precomputed once and then frozen. During training we only do lookups, not live updates. This is an efficiency trick to keep the method fast enough for the 10-minute budget.
Causal eval procedure
Evaluation uses a fresh mixer built only from validation history. Each chunk is scored first, then added to the cache, then used for TTT.
Compliance
torch.compilehappens before wallclock and uses dummy data onlyTest Plan
0.16610.16630.1666