Skip to content

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT#834

Open
AnirudhRahul wants to merge 1 commit intoopenai:mainfrom
AnirudhRahul:learned-multi-expert-gate-frozen-oracle
Open

Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT#834
AnirudhRahul wants to merge 1 commit intoopenai:mainfrom
AnirudhRahul:learned-multi-expert-gate-frozen-oracle

Conversation

@AnirudhRahul
Copy link

@AnirudhRahul AnirudhRahul commented Mar 26, 2026

Summary

  • PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 showed that n-gram mixing is very strong, but it used a hand-written entropy heuristic to decide when to trust the n-gram. The goal here is to let the model learn that decision itself.
  • To do that, we add a small output head on top of the transformer (Linear(512 -> 7)) that predicts, for each token, how much to rely on the neural model versus n-gram orders 2-7.
  • During training, that head is optimized directly against the true next-token objective: for each token we compute the probability assigned to the correct next token by each expert, form a weighted mixture using the learned gate weights, and add -log(p_mix) to the training loss. Gradients flow through the softmax gate into the new output head and back into the transformer hidden state, teaching the model when the n-gram is reliable and when it should fall back to the neural distribution.
  • We use a frozen n-gram oracle as an efficiency trick for training: prefill the n-gram tables once at startup, count that work inside the 10-minute wallclock budget, and then keep the tables read-only during optimization. This avoids live cache-update overhead during training and makes end-to-end gate learning practical.
  • This record submission achieves 0.1663 BPB mean over 3 seeds on the 10min / 16MB track.

How the Output Head Is Trained

For each training token t:

  1. The transformer produces hidden state h_t.
  2. The new output head maps h_t to 7 gate logits: one for the neural model and one for each n-gram order 2..7.
  3. A masked softmax turns those logits into expert weights, masking out n-gram orders that are not valid for that context.
  4. In parallel, the frozen oracle provides the n-gram probabilities for the correct next token, and the neural model provides its own probability for that same token.
  5. We form the mixed probability assigned to the correct token:
    p_mix(t) = sum_i w_{t,i} * p_{t,i}
  6. We add the mixer loss
    L_mix = -log(p_mix(t))
    to the usual cross-entropy loss.

This means the gate head is not trained by distilling toward a heuristic alpha target. It is trained directly from the same next-token prediction signal as the main model. The oracle probabilities are treated as fixed lookup values, so gradients do not flow into the n-gram tables; they flow through the gate weights into the output head and transformer.

3-Seed Results

Seed Post-TTT BPB Artifact
1337 0.1661 15.74 MB
42 0.1663 15.76 MB
2024 0.1666 15.25 MB
Mean 0.1663

Key Design Decisions

  1. Learned routing head
    A Linear(512 -> 7) head reads the transformer hidden state and produces logits over 7 experts: the neural model plus n-gram orders 2-7.

  2. Frozen n-gram oracle for training efficiency
    The training-time n-gram tables are precomputed once and then frozen. During training we only do lookups, not live updates. This is an efficiency trick to keep the method fast enough for the 10-minute budget.

  3. Causal eval procedure
    Evaluation uses a fresh mixer built only from validation history. Each chunk is scored first, then added to the cache, then used for TTT.

Compliance

  • N-gram prefill is counted inside the 10-minute wallclock
  • torch.compile happens before wallclock and uses dummy data only
  • Eval cache is fresh and causal
  • Each chunk is scored before cache update or TTT
  • Artifact is under 16 MB

Test Plan

  • Seed 1337: 0.1661
  • Seed 42: 0.1663
  • Seed 2024: 0.1666
  • All artifacts under 16 MB
  • Prefill counted inside wallclock
  • Eval causality verified

… + Backoff TTT

Replaces the heuristic entropy-adaptive alpha with a learned 7-expert gate
(Linear 512→7) that routes between the neural model and n-gram orders 2-7.
The gate is trained end-to-end during the main training loop using a frozen
n-gram oracle pre-computed from training data (counted within wallclock).

3-seed results (8xH100 SXM, 600s):
  seed 1337: val_bpb=0.1661 (15.74 MB)
  seed 42:   val_bpb=0.1663 (15.76 MB)
  seed 2024: val_bpb=0.1666 (15.25 MB)
  mean:      val_bpb=0.1663 (std=0.0003)

Cleanup: removed dead code (adaptive LR, Polyak averaging, scalar mixer
path, unused function params). Added detailed order-of-operations to
README proving legality of the training and evaluation procedure.

Based on PR openai#779 (deanbrr) BackoffNgramMixer + DriftFreeTTT architecture.

Made-with: Cursor
@AnirudhRahul AnirudhRahul force-pushed the learned-multi-expert-gate-frozen-oracle branch from 878b7ed to 772ecb2 Compare March 26, 2026 08:23
@AnirudhRahul AnirudhRahul changed the title Record: 0.1663 BPB — Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB) Record: 0.1663 BPB - Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB) Mar 26, 2026
@AnirudhRahul AnirudhRahul changed the title Record: 0.1663 BPB - Learned Multi-Expert Gate + Frozen N-gram Oracle + Backoff TTT (10min, 16MB) Record: 0.1663 BPB - N-gram-Aware Training + Frozen N-gram Oracle + Backoff TTT Mar 26, 2026
Asukabot0 added a commit to Asukabot0/parameter-golf that referenced this pull request Mar 26, 2026
PR openai#834 inspired architecture:
- GpuNgramMixer: GPU-native n-gram backoff with torch.scatter_add_
- GateHead: Linear(512→7) softmax gate with neural floor
- Frozen oracle: pre-fill from all training shards at startup
- Gate trained jointly with model (0.1x auxiliary loss weight)
- KNNDatastore: GPU cosine similarity search (eval-time complement)

Enable: GATE_ENABLED=1
Config: GATE_ORDERS=6 GATE_BUCKETS=1048576

Eval integration pending (next commit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant