Skip to content

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT#850

Open
callithyia wants to merge 1 commit intoopenai:mainfrom
callithyia:record/complementary-ngram-65k-0.3212
Open

Record: 0.3212 BPB — Complementary N-gram 65K + Int5 GPTQ + LoRA TTT#850
callithyia wants to merge 1 commit intoopenai:mainfrom
callithyia:record/complementary-ngram-65k-0.3212

Conversation

@callithyia
Copy link

Summary

  • val_bpb: 0.3212 (3-seed mean, std 0.0003)
  • Complementary training (alpha=0.50) + order-9 n-gram eval cache with 65K-token chunks (15x cache refresh)
  • Full Hessian GPTQ int5 + LZMA compression (~14.9 MB artifact)
  • LoRA TTT (rank 8, Polyak averaging, score-first backward-looking)
  • LeakyReLU(0.9)² + XSA-4 + VRL + Gated Attention + Parallel Muon

Results (8xH100 SXM)

Seed Steps ms/step val_bpb Post-quant BPB Artifact
1337 5,457 101 0.3211 1.1817 14,965,401 bytes
42 5,437 101 0.3210 1.1794 14,926,117 bytes
2024 5,498 101 0.3216 1.1831 14,874,853 bytes
Mean 5,464 101 0.3212 1.1814 14,922,124 bytes

Key Techniques

  • Complementary training: Downweights bigram-predictable tokens, making the model deliberately weaker where n-grams are strong
  • 65K-token chunks: Cache updates 15x more frequently than 1M chunks, reducing cold-cache penalty
  • Per-order entropy centers + multipliers: Orders 5-9 boosted 2x, orders 2-3 suppressed 0.3x
  • Full Hessian GPTQ: Activation-order column permutation + Cholesky error compensation (not naive quantization)
  • LoRA TTT: Rank 8, Q+V on blocks 9-10, Polyak averaging decay=0.998

Compliance

  • 3 seeds on 8xH100 SXM (1337, 42, 2024)
  • All seeds train ≤600s, eval ≤600s (~570s)
  • Artifact ≤16,000,000 bytes (~14.9MB)
  • No validation data during training
  • TTT backward-looking (score-first per chunk)
  • No multi-pass rescoring
  • Reproducible single script

Credits

Built on: PR #809 (n-gram cache), PR #803 (complementary training), PR #798 (entropy centers, Polyak TTT), PR #840 (65K chunks), PR #779 (integrated eval), PR #414 (GPTQ baseline).

3-seed mean 0.3212 (std 0.0003). Complementary training + order-9
n-gram eval cache with 65K-token chunks + Full Hessian GPTQ int5 +
LoRA TTT with Polyak averaging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant