feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249)#253
Conversation
…libration (#249) Retrieval eval rows carried only rank order, so `expect_none` / OOD robustness was unmeasurable: doc-level retrieval never abstains, and the fused `Hit.score` is rank-based (RRF) or per-leg min-max normalised (CombSUM/MNZ), so the top score is ~constant for any query — covered or not. Surface two absolute-relevance numbers on every `PerQueryRow` + `NegativeRow`: - `top1_vec_cosine` — the cosine of the best text-vector match in the corpus (max cosine = `1 - min distance`), via a new eval-only `HybridSearcher.top_vector_cosine` probe. This is the OOD signal (covered ~0.7 vs off-corpus ~0.2 with a real embedder). It is read straight from the vector leg before fusion — `search()` and the ranking path are byte- untouched, so retrieval metrics can't move (`no-baseline-needed`). - `top1_score` — the top-ranked hit's fused `Hit.score` (the cross-encoder rerank score when the rerank leg ran). `diagnostics_table` renders both per positive plus a dedicated OOD-negatives section so separation is eyeball-able. The probe is gated to vector-using modes, so a pure `--retrieval bm25` ablation stays embedder-free (no extra embed calls, can't crash on an unreachable embedder); guards a blank query and a missing vector leg (`NotSupported` → None). Scope: step 1 of #249 (surface the scores); the score-based `negative_satisfaction` metric is deliberately left for later. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds ChangesAbsolute relevance scoring in eval rows
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ec-None defensive return Surface a blank-query assertion (covers the empty-query early return) and mark the q_vec-is-None branch no-cover — embed() of a non-blank query never returns [], so it's an unreachable mypy-required guard (same idiom as the telemetry defensive guards). Keeps codecov/patch honest without a contrived empty-embedder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/README.md`:
- Around line 65-75: Clarify the caveats for `top1_vec_cosine` and `top1_score`
in the eval README by making them explicitly phase- and fusion-dependent. Update
the `top1_score` description to mention that it may be a reranker score when
reranking runs, or a fused retrieval score such as RRF/CombSUM/CombMNZ depending
on configuration, rather than only RRF. Also revise the `top1_vec_cosine` note
to state that `--retrieval bm25` only disables the query-time cosine probe,
while `run_eval()` may still use embeddings during ingest on cache misses, so
the eval is not fully embedder-free.
In `@tests/test_eval_runner.py`:
- Around line 365-370: Clarify the bm25 test in run_eval() so it only asserts
query-time embedder-free behavior, not full eval embedder independence. Update
the test around _run_queries and api.ingest(..., embedder=effective_embedder) to
reflect that mode="bm25" skips vector probing and OOD cosine checks, but
ingestion still uses the embedder. Keep the assertion focused on zero
top_vector_cosine/top1_vec_cosine-related calls and remove wording that implies
a bm25 eval cannot fail on an unreachable embedder.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 04b1a125-b425-4cd0-b119-6afb3caf38ed
📒 Files selected for processing (7)
evals/BASELINES.mdevals/README.mdevals/datasets/mvp/queries.yamlsrc/dikw_core/domains/info/search.pysrc/dikw_core/eval/runner.pytests/test_eval_runner.pytests/test_search.py
| - `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in | ||
| the corpus** (max cosine, not necessarily the top-*ranked* hit), so it | ||
| answers "is any chunk close to this query?". This is the OOD signal: with a | ||
| **real** embedder a covered query scores high (~0.7) and an off-corpus one | ||
| low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't | ||
| probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and | ||
| under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and | ||
| won't separate OOD cleanly — use a real embedder to calibrate. | ||
| - `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder | ||
| rerank score when the rerank leg ran; otherwise the rank-based RRF score, | ||
| which is ~constant across queries and does *not* encode absolute relevance). |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Clarify that these caveats are fusion- and phase-dependent.
top1_score is only an RRF score under the default fusion; without reranking it can also be a CombSUM/CombMNZ fused score. Also, pure --retrieval bm25 disables the query-time cosine probe, but run_eval() still embeds during ingest on cache misses, so the whole eval is not inherently embedder-free.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@evals/README.md` around lines 65 - 75, Clarify the caveats for
`top1_vec_cosine` and `top1_score` in the eval README by making them explicitly
phase- and fusion-dependent. Update the `top1_score` description to mention that
it may be a reranker score when reranking runs, or a fused retrieval score such
as RRF/CombSUM/CombMNZ depending on configuration, rather than only RRF. Also
revise the `top1_vec_cosine` note to state that `--retrieval bm25` only disables
the query-time cosine probe, while `run_eval()` may still use embeddings during
ingest on cache misses, so the eval is not fully embedder-free.
| """#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay | ||
| embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine | ||
| probe is gated off too, so it issues zero extra embeds and a real-embedder | ||
| bm25 ablation can't crash on an unreachable embedder. Rows get | ||
| ``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still | ||
| surfaces. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Narrow the bm25 “embedder-free” claim to the query path.
run_eval() still calls api.ingest(..., embedder=effective_embedder) before _run_queries, so mode="bm25" only skips query-time vector probing here. This test proves “no top_vector_cosine calls,” not “a bm25 eval can’t fail on an unreachable embedder.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_eval_runner.py` around lines 365 - 370, Clarify the bm25 test in
run_eval() so it only asserts query-time embedder-free behavior, not full eval
embedder independence. Update the test around _run_queries and api.ingest(...,
embedder=effective_embedder) to reflect that mode="bm25" skips vector probing
and OOD cosine checks, but ingestion still uses the embedder. Keep the assertion
focused on zero top_vector_cosine/top1_vec_cosine-related calls and remove
wording that implies a bm25 eval cannot fail on an unreachable embedder.
Closes #249.
Problem
Retrieval eval rows carried only rank order, so
expect_none/ OOD (out-of-distribution) robustness was unmeasurable. Doc-level retrieval never abstains — an off-corpus query still has a rank-1 doc — and the fusedHit.scoreis rank-based (RRF) or per-leg min-max normalised (CombSUM/MNZ), so the top score is ~constant for any query. The discriminating signal is the absolute relevance of the best match, which the rows dropped entirely.Change (step 1 of #249 — surface the scores)
Every
PerQueryRow+NegativeRownow exposes two absolute-relevance numbers (which describe different chunks once a reranker is active, by design):top1_vec_cosine— the cosine of the best text-vector match in the corpus (1 - min distance), via a new eval-onlyHybridSearcher.top_vector_cosineprobe. This is the OOD signal (covered ~0.7 vs off-corpus ~0.2 with a real embedder). Read straight from the vector leg —search()and the ranking path are byte-untouched, so retrieval metrics can't move (no-baseline-needed).top1_score— the top-ranked hit's fusedHit.score(the cross-encoder rerank score when the rerank leg ran).diagnostics_tablerenders both per positive plus a dedicated OOD-negatives section so separation is eyeball-able.The score-based
negative_satisfactionmetric (step 2 of the issue) is deliberately left for a follow-up — this PR makes it computable.Robustness / scope
--retrieval bm25ablation stays embedder-free (no extra embed calls, can't crash on an unreachable embedder;top1_vec_cosine=None). Pinned by a test that spies the probe (0 calls in bm25 mode).NotSupported→None), mirroringsearch()'s graceful vec leg.baseline.pyreads onlymetrics; client render skips diagnostics).Tests
test_top_vector_cosine_discriminates_covered_vs_ood+test_eval_rows_surface_absolute_relevance_scores— covered query out-scores a zero-overlap OOD query (deterministic, embedder-driven, not a tautology).test_top_vector_cosine_none_without_text_vector_leg/..._none_when_no_text_version— gracefulNonepaths.test_eval_bm25_mode_skips_vector_cosine_probe— pure bm25 stays embedder-free.Delivery receipt
step 3 — verify:
tools/check.pygreen (ruff + mypy + 2409 passed / 138 skipped). Real-data: hermeticmvp(cache off) retrieval metrics unchanged (all 1.0 —search()untouched); new fields populate. Honest caveat recorded in BASELINES: underFakeEmbeddings(lexical) the mvp negatives don't separate cleanly (min-pos cosine 0.334 < max-neg 0.445) — expected, a real embedder is needed to calibrate; the mechanism is pinned by the zero-overlap unit test.step 5 — review: fresh-agent invariant/design review → PASS (Karpathy's rule held: the probe is deterministic scoping, not LLM reasoning, and eval-only;
1 - distancecorrect; no ranking change). The one "bundling" concern was a stale-local-maindiff artifact — the PR vsorigin/mainis #249-only. Workflow-backed/code-review(xhigh, 26 agents) → 8 findings, all resolved: probe-breaks-bm25-purity (gated off for bm25), missing blank-query guard (added),NotSupportednot caught (added), top1_score/top1_vec_cosine semantics conflated (docs clarified), double-embed (accepted — the chosen eval-probe approach, +1/query, bm25 exempt), plus doc/cleanup nits.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes