Skip to content

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)#864

Closed
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/ngram-backoff-0.2841
Closed

Record: 11L Parallel Muon + N-gram Backoff Cache — val_bpb 0.2841 (3-seed mean)#864
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/ngram-backoff-0.2841

Conversation

@aryanbhosale
Copy link

Record: 11L Parallel Muon + N-gram Backoff Cache

val_bpb = 0.2841 (3-seed mean, std 0.0001) | ~15.85 MB | 8×H100 SXM

3-Seed Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed step_avg steps EMA bpb Quantized bpb N-gram bpb
1337 88.6ms 6,774 1.1193 1.1270 0.2841
42 88.8ms 6,757 1.1194 1.1276 0.2840
2024 88.7ms 6,769 1.1191 1.1275 0.2840
Mean 88.7ms 6,767 1.1193 1.1274 0.2841

Key Innovation: N-gram Backoff Cache (Eval-Time Only)

Order 2-9 backward-looking N-gram cache with entropy-adaptive alpha blending and 65K-token chunk updates:

  • For each scored token, blend model P(token) with N-gram frequency P(token)
  • Alpha adapts based on model entropy + n-gram order (high entropy + high order = more n-gram weight)
  • Per-order multipliers: orders 2-3 suppressed (0.3x), orders 5-9 boosted (2.0x)
  • Cache updated ONLY after scoring each 65K-token chunk (strictly backward-looking)
  • 4M hash buckets, XOR-of-products hashing

N-gram reduces BPB by 4x (1.1274 -> 0.2841) by exploiting repeated phrases and patterns.

Architecture (26.8M params)

11L 512d, 8H/4KV (GQA), MLP 3x LeakyReLU(0.5)², Parallel Muon (parameter banking + batched NS5), SmearGate, BigramHash(1024), Value Residual, Gated Attention, XSA4, Partial RoPE(16/64), EMA(0.997)+SWA, Late QAT, GPTQ-lite int6+zstd-22, FA3, torch.compile(fullgraph=True).

Timing

  • Training: 600s (6,770 steps at 88.7ms/step)
  • Eval (N-gram): ~420s
  • Total: within 600s train + 600s eval budgets

Compliance

  • Training under 600s on 8xH100
  • Eval under 600s on 8xH100 (~420s)
  • Total artifact under 16,000,000 bytes
  • N-gram cache strictly backward-looking (updated AFTER scoring)
  • No training data access during evaluation
  • No oracle/hindsight selection
  • 3-seed results with full logs

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants