feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249) by helebest · Pull Request #253 · OpenDIKW/dikw-core

helebest · 2026-06-28T13:53:24Z

Closes #249.

Problem

Retrieval eval rows carried only rank order, so expect_none / OOD (out-of-distribution) robustness was unmeasurable. Doc-level retrieval never abstains — an off-corpus query still has a rank-1 doc — and the fused Hit.score is rank-based (RRF) or per-leg min-max normalised (CombSUM/MNZ), so the top score is ~constant for any query. The discriminating signal is the absolute relevance of the best match, which the rows dropped entirely.

Note: the issue assumed the absolute vector cosine already lived in Hit.score. It does not — under the default RRF fusion Hit.score is rank-based even in --retrieval vector; only the rerank score is absolute, and only when a reranker ran. So the raw cosine has to be captured separately (a deliberate read, not free plumbing).

Change (step 1 of #249 — surface the scores)

Every PerQueryRow + NegativeRow now exposes two absolute-relevance numbers (which describe different chunks once a reranker is active, by design):

top1_vec_cosine — the cosine of the best text-vector match in the corpus (1 - min distance), via a new eval-only HybridSearcher.top_vector_cosine probe. This is the OOD signal (covered ~0.7 vs off-corpus ~0.2 with a real embedder). Read straight from the vector leg — search() and the ranking path are byte-untouched, so retrieval metrics can't move (no-baseline-needed).
top1_score — the top-ranked hit's fused Hit.score (the cross-encoder rerank score when the rerank leg ran).

diagnostics_table renders both per positive plus a dedicated OOD-negatives section so separation is eyeball-able.

The score-based negative_satisfaction metric (step 2 of the issue) is deliberately left for a follow-up — this PR makes it computable.

Robustness / scope

The probe is gated to vector-using modes, so a pure --retrieval bm25 ablation stays embedder-free (no extra embed calls, can't crash on an unreachable embedder; top1_vec_cosine=None). Pinned by a test that spies the probe (0 calls in bm25 mode).
Guards a blank query and a missing vector leg (NotSupported → None), mirroring search()'s graceful vec leg.
Text-vector only — the multimodal/asset leg is out of scope for this text-OOD probe (documented).
Additive optional DTO fields; no Storage/Provider Protocol change, no on-disk layout change. Consumers unaffected (baseline.py reads only metrics; client render skips diagnostics).

Tests

test_top_vector_cosine_discriminates_covered_vs_ood + test_eval_rows_surface_absolute_relevance_scores — covered query out-scores a zero-overlap OOD query (deterministic, embedder-driven, not a tautology).
test_top_vector_cosine_none_without_text_vector_leg / ..._none_when_no_text_version — graceful None paths.
test_eval_bm25_mode_skips_vector_cosine_probe — pure bm25 stays embedder-free.

Delivery receipt

step 3 — verify: tools/check.py green (ruff + mypy + 2409 passed / 138 skipped). Real-data: hermetic mvp (cache off) retrieval metrics unchanged (all 1.0 — search() untouched); new fields populate. Honest caveat recorded in BASELINES: under FakeEmbeddings (lexical) the mvp negatives don't separate cleanly (min-pos cosine 0.334 < max-neg 0.445) — expected, a real embedder is needed to calibrate; the mechanism is pinned by the zero-overlap unit test.

step 5 — review: fresh-agent invariant/design review → PASS (Karpathy's rule held: the probe is deterministic scoping, not LLM reasoning, and eval-only; 1 - distance correct; no ranking change). The one "bundling" concern was a stale-local-main diff artifact — the PR vs origin/main is #249-only. Workflow-backed /code-review (xhigh, 26 agents) → 8 findings, all resolved: probe-breaks-bm25-purity (gated off for bm25), missing blank-query guard (added), NotSupported not caught (added), top1_score/top1_vec_cosine semantics conflated (docs clarified), double-embed (accepted — the chosen eval-probe approach, +1/query, bm25 exempt), plus doc/cleanup nits.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added clearer evaluation diagnostics for covered vs. out-of-domain queries, including new relevance indicators in reports.
- Improved support for marking queries that should have no match, with better reporting for those cases.
Bug Fixes
- Evaluation summaries now avoid extra vector checks in BM25-only runs.
- Missing or unavailable vector data now fails gracefully instead of breaking reporting.

…libration (#249) Retrieval eval rows carried only rank order, so `expect_none` / OOD robustness was unmeasurable: doc-level retrieval never abstains, and the fused `Hit.score` is rank-based (RRF) or per-leg min-max normalised (CombSUM/MNZ), so the top score is ~constant for any query — covered or not. Surface two absolute-relevance numbers on every `PerQueryRow` + `NegativeRow`: - `top1_vec_cosine` — the cosine of the best text-vector match in the corpus (max cosine = `1 - min distance`), via a new eval-only `HybridSearcher.top_vector_cosine` probe. This is the OOD signal (covered ~0.7 vs off-corpus ~0.2 with a real embedder). It is read straight from the vector leg before fusion — `search()` and the ranking path are byte- untouched, so retrieval metrics can't move (`no-baseline-needed`). - `top1_score` — the top-ranked hit's fused `Hit.score` (the cross-encoder rerank score when the rerank leg ran). `diagnostics_table` renders both per positive plus a dedicated OOD-negatives section so separation is eyeball-able. The probe is gated to vector-using modes, so a pure `--retrieval bm25` ablation stays embedder-free (no extra embed calls, can't crash on an unreachable embedder); guards a blank query and a missing vector leg (`NotSupported` → None). Scope: step 1 of #249 (surface the scores); the score-based `negative_satisfaction` metric is deliberately left for later. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-28T13:53:38Z

📝 Walkthrough

Walkthrough

Adds top1_score and top1_vec_cosine absolute relevance fields to PerQueryRow and NegativeRow. A new HybridSearcher.top_vector_cosine probe computes these values, gated to vector/hybrid modes with per-query caching. Diagnostics rendering, tests, documentation, and baselines are updated accordingly.

Changes

Absolute relevance scoring in eval rows

Layer / File(s)	Summary
`HybridSearcher.top_vector_cosine` probe `src/dikw_core/domains/info/search.py`	New eval-only method embeds query, runs single NN search (`limit=1`), returns `1.0 - distance` or `None` for blank query, missing embedder/model, `NotSupported`, or no hits.
Row dataclass schema extensions `src/dikw_core/eval/runner.py`	`PerQueryRow` and `NegativeRow` each gain optional `top1_score: float
Runner probe computation, caching, and row population `src/dikw_core/eval/runner.py`	`_run_queries` computes `top1_vec_cosine` only for `"vector"`/`"hybrid"` modes using a per-query cache, extracts `top1_score` from rank-0 `Hit.score`, and passes both into `NegativeRow` and `PerQueryRow` construction.
Diagnostics table rendering `src/dikw_core/eval/runner.py`	`EvalReport.diagnostics_table` displays `top1_score`/`top1_vec_cosine` for positive queries and adds a new OOD negatives section showing `vec_cos` and `top1` per `expect_none` query.
Tests `tests/test_eval_runner.py`, `tests/test_search.py`	Tests assert positive `top1_vec_cosine` exceeds negative's, diagnostics table contains OOD section, BM25 mode makes zero probe calls and yields `None` cosine, and the probe degrades gracefully without embedder or text index.
Docs and baselines `evals/README.md`, `evals/BASELINES.md`, `evals/datasets/mvp/queries.yaml`	README documents `expect_none` OOD query mode and new row fields; BASELINES.md records the 2026-06-28 entry; `queries.yaml` comment updated to reflect absolute relevance per row.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

OpenDIKW/dikw-core#251: Modifies the same _run_queries path in runner.py and HybridSearcher configuration surface in search.py, so the two PRs touch overlapping code paths in the eval runner and searcher.

Poem

🐇 A rabbit hops through corpus land,
measuring scores with careful hand.
OOD queries score oh so low,
while covered ones put on a show.
top1_vec_cosine tells the tale —
no empty rank shall pass the veil! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description covers the change well, but it does not follow the required template and omits the Summary, Test plan, and Notes sections.	Reformat the PR description to match the template, adding a Summary, checkbox Test plan, and Notes/breaking changes section.
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly states the main change: surfacing absolute relevance scores for OOD calibration in eval rows.
Linked Issues check	✅ Passed	It surfaces top1_score and top1_vec_cosine in eval rows and adds the eval-only vector probe needed for OOD measurement.
Out of Scope Changes check	✅ Passed	The docs, tests, and eval-report changes all support the stated OOD score-surfacing objective, with no clear unrelated additions.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-feat+eval-surface-relevance-scores

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-28T13:55:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…ec-None defensive return Surface a blank-query assertion (covers the empty-query early return) and mark the q_vec-is-None branch no-cover — embed() of a non-blank query never returns [], so it's an unreachable mypy-required guard (same idiom as the telemetry defensive guards). Keeps codecov/patch honest without a contrived empty-embedder. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/README.md`:
- Around line 65-75: Clarify the caveats for `top1_vec_cosine` and `top1_score`
in the eval README by making them explicitly phase- and fusion-dependent. Update
the `top1_score` description to mention that it may be a reranker score when
reranking runs, or a fused retrieval score such as RRF/CombSUM/CombMNZ depending
on configuration, rather than only RRF. Also revise the `top1_vec_cosine` note
to state that `--retrieval bm25` only disables the query-time cosine probe,
while `run_eval()` may still use embeddings during ingest on cache misses, so
the eval is not fully embedder-free.

In `@tests/test_eval_runner.py`:
- Around line 365-370: Clarify the bm25 test in run_eval() so it only asserts
query-time embedder-free behavior, not full eval embedder independence. Update
the test around _run_queries and api.ingest(..., embedder=effective_embedder) to
reflect that mode="bm25" skips vector probing and OOD cosine checks, but
ingestion still uses the embedder. Keep the assertion focused on zero
top_vector_cosine/top1_vec_cosine-related calls and remove wording that implies
a bm25 eval cannot fail on an unreachable embedder.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 04b1a125-b425-4cd0-b119-6afb3caf38ed

📥 Commits

Reviewing files that changed from the base of the PR and between c4e6bb4 and 9295e6d.

📒 Files selected for processing (7)

evals/BASELINES.md
evals/README.md
evals/datasets/mvp/queries.yaml
src/dikw_core/domains/info/search.py
src/dikw_core/eval/runner.py
tests/test_eval_runner.py
tests/test_search.py

coderabbitai · 2026-06-28T14:09:34Z

+- `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in
+  the corpus** (max cosine, not necessarily the top-*ranked* hit), so it
+  answers "is any chunk close to this query?". This is the OOD signal: with a
+  **real** embedder a covered query scores high (~0.7) and an off-corpus one
+  low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't
+  probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and
+  under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and
+  won't separate OOD cleanly — use a real embedder to calibrate.
+- `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder
+  rerank score when the rerank leg ran; otherwise the rank-based RRF score,
+  which is ~constant across queries and does *not* encode absolute relevance).


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify that these caveats are fusion- and phase-dependent.

top1_score is only an RRF score under the default fusion; without reranking it can also be a CombSUM/CombMNZ fused score. Also, pure --retrieval bm25 disables the query-time cosine probe, but run_eval() still embeds during ingest on cache misses, so the whole eval is not inherently embedder-free.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/README.md` around lines 65 - 75, Clarify the caveats for `top1_vec_cosine` and `top1_score` in the eval README by making them explicitly phase- and fusion-dependent. Update the `top1_score` description to mention that it may be a reranker score when reranking runs, or a fused retrieval score such as RRF/CombSUM/CombMNZ depending on configuration, rather than only RRF. Also revise the `top1_vec_cosine` note to state that `--retrieval bm25` only disables the query-time cosine probe, while `run_eval()` may still use embeddings during ingest on cache misses, so the eval is not fully embedder-free.

coderabbitai · 2026-06-28T14:09:34Z

+    """#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay
+    embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine
+    probe is gated off too, so it issues zero extra embeds and a real-embedder
+    bm25 ablation can't crash on an unreachable embedder. Rows get
+    ``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still
+    surfaces.


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Narrow the bm25 “embedder-free” claim to the query path.

run_eval() still calls api.ingest(..., embedder=effective_embedder) before _run_queries, so mode="bm25" only skips query-time vector probing here. This test proves “no top_vector_cosine calls,” not “a bm25 eval can’t fail on an unreachable embedder.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_eval_runner.py` around lines 365 - 370, Clarify the bm25 test in run_eval() so it only asserts query-time embedder-free behavior, not full eval embedder independence. Update the test around _run_queries and api.ingest(..., embedder=effective_embedder) to reflect that mode="bm25" skips vector probing and OOD cosine checks, but ingestion still uses the embedder. Keep the assertion focused on zero top_vector_cosine/top1_vec_cosine-related calls and remove wording that implies a bm25 eval cannot fail on an unreachable embedder.

helebest added the no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics label Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

helebest merged commit c15b4ee into main Jun 28, 2026
11 checks passed

helebest deleted the worktree-feat+eval-surface-relevance-scores branch June 28, 2026 14:10

helebest mentioned this pull request Jun 28, 2026

Flaky test: test_event_tape_replay_after_terminal — final event races the terminal status row #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249)#253

feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249)#253
helebest merged 2 commits into
mainfrom
worktree-feat+eval-surface-relevance-scores

helebest commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change (step 1 of #249 — surface the scores)

Robustness / scope

Tests

Delivery receipt

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

codecov Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

helebest commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

codecov Bot commented Jun 28, 2026 •

edited

Loading