Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846
Record: Two-Pass N-gram Rescoring (val_bpb 0.1434)#846himanshudongre wants to merge 1 commit intoopenai:mainfrom
Conversation
|
This is most likely not a legal submission, see #573 (comment) |
|
Peer review note — two-pass rescoring and the backward-looking rule Nice work on the cache warmup analysis — the chunk-1-to-chunk-63 BPB curve is really interesting data. One question on compliance: the backward-looking rule states that you can only test-time train on tokens "you've already evaluated your model on." In Pass 2, the early chunks (1–15) are rescored using an n-gram cache that includes statistics from chunks 16–63 — tokens that came after the ones being rescored. The tokens were technically evaluated in Pass 1, but the cache used in Pass 2 contains forward-looking information relative to those early chunks. Is this meaningfully different from a single-pass approach where each chunk only sees statistics from prior chunks? Not challenging the submission — just flagging it as something the maintainers may want to weigh in on, since the rules don't explicitly address multi-pass evaluation. |
|
@she-llac @MatoTeziTanka Thanks for raising this — both fair points worth addressing. How this differs from #573: No model weights change between passes (neural model is frozen throughout) An analogy: re-reading chapter 1 of a book after finishing the whole book. You're not changing the book or your reading ability — you just have more context. That's different from taking a test, seeing your score, and retaking it. On the rules: That said, I recognise this is a grey area and respect that the rules may not have anticipated this pattern. I'd welcome an official ruling from @valerio-oai. If two-pass is deemed non-compliant, I'm happy to update the submission to report Pass 1 results only. Regardless of the ruling, the cold-cache analysis (chunk 1 at 1.15 BPB vs chunk 63 at 0.12 BPB) is an interesting finding — exploring legal ways to address this asymmetry (chunk ordering strategies, progressive cache warming) seems like a worthwhile research direction. |
|
@himanshudongre Appreciate the detailed response — the distinction from #573 is well-articulated. The frozen model + deterministic scoring vs. oracle selection is a meaningful difference. The book analogy is fair. The frequency table doesn't carry scoring feedback, just co-occurrence counts. The question is really about whether the rules intended "evaluated" to mean "scored once and done" or "processed in any order you like within the eval budget." That's a policy call above our pay grade. Agree that tagging @valerio-oai for an official ruling is the right move. And regardless of the outcome, the cold-cache asymmetry data is a genuinely useful contribution — that chunk 1 → chunk 63 curve tells you a lot about where the BPB gains are actually coming from. Good luck with the ruling. |
…ders Combines PR openai#834's learned multi-expert routing head with PR openai#846's two-pass cold-cache rescoring. Key changes: - Extended n-gram orders from 2-7 to 2-12 with 8M bucket hash tables - Two-pass eval: rescore first 15 chunks with full cache after pass 1 - Per-chunk loss tracking for precise pass-1/pass-2 delta computation - Configurable via env vars: NGRAM_MAX_ORDER, NGRAM_BUCKETS, TWO_PASS_ENABLED, TWO_PASS_RESCORE_CHUNKS Based on PR openai#834 (AnirudhRahul) + PR openai#846 (himanshudongre) stack.
|
This is an interesting thing to wake up to, and I need to know if I am studying it or not... my brain hurts. Hmmm its questionable. there is a "knowledge" of the answers within this method. I could see an issue with memorizing how man "X" there are in the answers, being a peek at the answers. We should not know how many (X) the test should have, and then score based on those results. |
Summary
Test plan