Skip to content

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769

Open
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-styx-ngram-record
Open

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-styx-ngram-record

Conversation

@MatoTeziTanka
Copy link

@MatoTeziTanka MatoTeziTanka commented Mar 25, 2026

Summary

Results (8×H100 SXM, RunPod)

Current Seeds (v1.1 — sliding window fix + script cleanup)

Seed val_bpb Artifact Size Cache Hit Rate
42 0.8494 15,921,591 bytes 98.2%
1337 0.8482 15,919,103 bytes 98.2%
2024 0.8508 15,905,947 bytes 98.2%
Mean 0.8495 std: 0.0013

Training loop exit controlled by MAX_WALLCLOCK_SECONDS=600. Logged wallclock includes torch.cuda.synchronize() overhead (~60-120ms beyond the 600s check).

Superseded Seeds (v1.0)

We're showing the original v1.0 results for full transparency. They had two issues we caught in self-review: a seed 42 artifact that exceeded the 16MB cap, and a sliding window eval that never executed due to a double torch.compile invocation. Rather than quietly replace them, we're documenting what went wrong and why.

Seed val_bpb Artifact Size Note
42 0.8513 16,025,731 bytes Over 16MB cap
1337 0.8502 15,939,991 bytes
2024 0.8510 15,910,119 bytes
Mean 0.8508 std: 0.0006

These scores were from the int6 roundtrip eval path (non-sliding). The sliding window + n-gram cache eval path crashed silently under torchrun. Fixed in v1.1.

Overlap Verification

Stride BPB Hit Rate Overlap
64 (standard) 0.8494 98.2% 97%
2048 (zero overlap) 0.8709 97.9% 0%
No cache 1.1477

The 0.02 BPB gap between stride=64 and stride=2048 is the overlap contribution. The remaining 0.26 BPB improvement is genuine cache benefit from backward-looking n-gram statistics.

Rule Compliance Checklist

  • Artifact ≤ 16,000,000 bytes — All 3 seeds: 15.91–15.92 MB (78–94 KB headroom)
  • Training ≤ 10 min on 8×H100 SXM — 600s wallclock, ~6800 steps
  • Evaluation ≤ 10 min on 8×H100 SXM — Sliding window eval completes in ~371s
  • No training data access during evaluation — Eval paths use val_tokens only
  • No training on validation data — Mid-training val checks are inference-only (model.eval() + torch.no_grad())
  • N-gram cache is backward-looking — Cache updated AFTER scoring each window
  • No oracle/hindsight selection — Fixed alpha (0.2), no min(NLL) comparison, no target-dependent gating
  • No external downloads or network calls during eval — Self-contained artifact
  • 3 seeds with tight std — std 0.0013 across seeds 42, 1337, 2024
  • Cross-model peer review — Independent audit by GPT Codex (gpt-5.4) verified compliance, cache ordering, and artifact sizes against competition rules

Note on N-gram Cache Legality

The competition README does not address n-gram eval caches. No rule in the official documentation prohibits or permits this technique. The README states: "TTT only on tokens already graded" — our cache satisfies this: it is updated only with already-scored tokens. We note that 15+ concurrent PRs (#779, #797, #795, #786, #796, #798, #800, #806, among others) employ the same backward-looking n-gram cache concept.

Architecture

11L, 512d, GQA 8H/4KV, MLP 3×, LeakyReLU(0.9)², XSA (last 4 layers), Value Embedding, BigramHash(2048→128), Partial RoPE(16/64), LN Scale, EMA(0.997), Muon optimizer. Tied embeddings. Mixed int6/int8 quantization + LZMA compression.

Technique: 5-gram Eval Cache

During sliding window evaluation, a hash-based n-gram cache accumulates token statistics from already-scored windows. For each new window, the cache provides empirical next-token probabilities which are blended with the neural model's predictions using a fixed mixing coefficient. The cache is strictly causal — it never sees tokens before they are scored.

This is a pure eval-time technique. No architectural changes, no retraining, no TTT. The trained model is identical with or without the cache.

Related Work

The n-gram eval cache concept has seen significant community adoption since our initial analysis on Issue #140:

Our LeakyReLU(0.9)² slope sweep was independently cited by PR #764 (@ndokutovich).

Context

Same team that posted the compliance guide, LeakyReLU slope sweep, and n-gram cache analysis on Issue #140.

Docker: matotezitanka/proteus-pytorch:2.11.0-cuda12.8
RunPod template: Deploy PROTEUS+STYX

Verification

This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit.

Built with PROTEUS+STYX by Light Speed Up

3-seed mean: 0.8508 (std 0.0006), verified at stride=2048 (0.8709)
Beats SOTA openai#549 (1.1194) by 0.269 BPB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka MatoTeziTanka changed the title Record: PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache Mar 25, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Author

Update — size issue on seed 42

We got excited and rushed this submission. On closer audit:

  • Seed 42 artifact: 16,025,731 bytes — over the 16MB cap by 25,731 bytes
  • Seed 1337: 15,939,991 bytes — under cap
  • Seed 2024: 15,910,119 bytes — under cap

Also correcting: submission.json had artifact sizes copied from an earlier submission (PR #95), not this one. That's our mistake.

We need to fix the code size (99KB is bloated) or adjust compression to get all 3 seeds under 16MB before this is reviewable. Working on it — will update.

- Fixed torch.compile double-invocation that silently killed sliding window eval
- Trimmed train_gpt.py from 99KB to 72KB (removed dead TTT/QAT/LAWA/DTG code)
- All 3 seeds re-run with sliding window + n-gram cache eval
- New 3-seed mean: 0.8495 BPB (std 0.0013), all artifacts under 16,000,000 bytes
- Old v1.0 logs preserved for transparency
- Added rule compliance checklist, related work, cross-model audit (GPT Codex)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka MatoTeziTanka changed the title PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache Mar 26, 2026
@MatoTeziTanka
Copy link
Author

Update — v1.1 results (3 new seeds, sliding window fix, script cleanup)

Two fixes since the initial submission:

Script cleanup. The original train_gpt.py was 99,492 bytes — a kitchen-sink build from rapid iteration where we were focused on making things work, not on byte efficiency. When we originally wrote it, we didn't realize code bytes count toward the 16MB artifact cap. That bloated script included dead TTT scaffolding, unused QAT branches, a LAWA weight averaging path we never activated, a warmup block that restored both model and optimizer state (making it a no-op), and several experimental feature flags that were disabled by default. Once we understood code size matters, we stripped it to 72,603 bytes — removing every line that didn't contribute to the final trained model or eval. No functional changes, just dead code removal.

Sliding window eval fix. The original submission had a bug where torch.compile was called twice on the eval model — once at the module level for the int6 roundtrip eval, then again inside eval_val_sliding on forward_logits. This caused the sliding window eval to crash silently under torchrun, meaning the initially reported 0.8508 BPB was from the int6 roundtrip path alone, not the sliding window + n-gram cache path. Fix: removed the redundant torch.compile inside eval_val_sliding.

New 3-seed results (all re-run from scratch on 8×H100 SXM):

Seed Sliding BPB Artifact Size
42 0.8494 15,921,591 bytes
1337 0.8482 15,919,103 bytes
2024 0.8508 15,905,947 bytes
Mean 0.8495 std: 0.0013

All artifacts under 16,000,000 bytes. Updated logs, submission.json, and cleaned train_gpt.py included.

Verification. This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit. We believe cross-model review catches blind spots that single-model workflows miss.

Built with PROTEUS+STYX by Light Speed Up

@hypery11
Copy link

nice 🔥🔥🔥🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants