Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT by AnirudhRahul · Pull Request #834 · openai/parameter-golf

AnirudhRahul · 2026-03-26T08:14:43Z

Summary

PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 showed that n-gram mixing is very strong, but it used a hand-written entropy heuristic to decide when to trust the n-gram. The goal here is to let the model learn that decision itself.
To do that, we add a small output head on top of the transformer (Linear(512 -> 7)) that predicts, for each token, how much to rely on the neural model versus n-gram orders 2-7.
During training, that head is optimized directly against the true next-token objective: for each token we compute the probability assigned to the correct next token by each expert, form a weighted mixture using the learned gate weights, and add -log(p_mix) to the training loss. Gradients flow through the softmax gate into the new output head and back into the transformer hidden state, teaching the model when the n-gram is reliable and when it should fall back to the neural distribution.
We use a frozen n-gram oracle as an efficiency trick for training: prefill the n-gram tables once at startup, count that work inside the 10-minute wallclock budget, and then keep the tables read-only during optimization. This avoids live cache-update overhead during training and makes end-to-end gate learning practical.
This record submission achieves 0.1663 BPB mean over 3 seeds on the 10min / 16MB track.

How the Output Head Is Trained

For each training token t:

The transformer produces hidden state h_t.
The new output head maps h_t to 7 gate logits: one for the neural model and one for each n-gram order 2..7.
A masked softmax turns those logits into expert weights, masking out n-gram orders that are not valid for that context.
In parallel, the frozen oracle provides the n-gram probabilities for the correct next token, and the neural model provides its own probability for that same token.
We form the mixed probability assigned to the correct token:
p_mix(t) = sum_i w_{t,i} * p_{t,i}
We add the mixer loss
L_mix = -log(p_mix(t))
to the usual cross-entropy loss.

This means the gate head is not trained by distilling toward a heuristic alpha target. It is trained directly from the same next-token prediction signal as the main model. The oracle probabilities are treated as fixed lookup values, so gradients do not flow into the n-gram tables; they flow through the gate weights into the output head and transformer.

3-Seed Results

Seed	Post-TTT BPB	Artifact
1337	0.1661	15.74 MB
42	0.1663	15.76 MB
2024	0.1666	15.25 MB
Mean	0.1663

Key Design Decisions

Learned routing head
A Linear(512 -> 7) head reads the transformer hidden state and produces logits over 7 experts: the neural model plus n-gram orders 2-7.
Frozen n-gram oracle for training efficiency
The training-time n-gram tables are precomputed once and then frozen. During training we only do lookups, not live updates. This is an efficiency trick to keep the method fast enough for the 10-minute budget.
Causal eval procedure
Evaluation uses a fresh mixer built only from validation history. Each chunk is scored first, then added to the cache, then used for TTT.

Compliance

N-gram prefill is counted inside the 10-minute wallclock
torch.compile happens before wallclock and uses dummy data only
Eval cache is fresh and causal
Each chunk is scored before cache update or TTT
Artifact is under 16 MB

Test Plan

… + Backoff TTT Replaces the heuristic entropy-adaptive alpha with a learned 7-expert gate (Linear 512→7) that routes between the neural model and n-gram orders 2-7. The gate is trained end-to-end during the main training loop using a frozen n-gram oracle pre-computed from training data (counted within wallclock). 3-seed results (8xH100 SXM, 600s): seed 1337: val_bpb=0.1661 (15.74 MB) seed 42: val_bpb=0.1663 (15.76 MB) seed 2024: val_bpb=0.1666 (15.25 MB) mean: val_bpb=0.1663 (std=0.0003) Cleanup: removed dead code (adaptive LR, Polyak averaging, scalar mixer path, unused function params). Added detailed order-of-operations to README proving legality of the training and evaluation procedure. Based on PR openai#779 (deanbrr) BackoffNgramMixer + DriftFreeTTT architecture. Made-with: Cursor

PR openai#834 inspired architecture: - GpuNgramMixer: GPU-native n-gram backoff with torch.scatter_add_ - GateHead: Linear(512→7) softmax gate with neural floor - Frozen oracle: pre-fill from all training shards at startup - Gate trained jointly with model (0.1x auxiliary loss weight) - KNNDatastore: GPU cosine similarity search (eval-time complement) Enable: GATE_ENABLED=1 Config: GATE_ORDERS=6 GATE_BUCKETS=1048576 Eval integration pending (next commit). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

AnirudhRahul force-pushed the learned-multi-expert-gate-frozen-oracle branch from 878b7ed to 772ecb2 Compare March 26, 2026 08:23

AnirudhRahul changed the title ~~Record: 0.1663 BPB — Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB)~~ Record: 0.1663 BPB - Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB) Mar 26, 2026

AnirudhRahul changed the title ~~Record: 0.1663 BPB - Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB)~~ Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT#834

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT#834
AnirudhRahul wants to merge 1 commit intoopenai:mainfrom
AnirudhRahul:learned-multi-expert-gate-frozen-oracle

AnirudhRahul commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AnirudhRahul commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the Output Head Is Trained

3-Seed Results

Key Design Decisions

Compliance

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AnirudhRahul commented Mar 26, 2026 •

edited

Loading