Skip to content

feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249)#253

Merged
helebest merged 2 commits into
mainfrom
worktree-feat+eval-surface-relevance-scores
Jun 28, 2026
Merged

feat(eval): surface absolute relevance scores in eval rows for OOD calibration (#249)#253
helebest merged 2 commits into
mainfrom
worktree-feat+eval-surface-relevance-scores

Conversation

@helebest

@helebest helebest commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Closes #249.

Problem

Retrieval eval rows carried only rank order, so expect_none / OOD (out-of-distribution) robustness was unmeasurable. Doc-level retrieval never abstains — an off-corpus query still has a rank-1 doc — and the fused Hit.score is rank-based (RRF) or per-leg min-max normalised (CombSUM/MNZ), so the top score is ~constant for any query. The discriminating signal is the absolute relevance of the best match, which the rows dropped entirely.

Note: the issue assumed the absolute vector cosine already lived in Hit.score. It does not — under the default RRF fusion Hit.score is rank-based even in --retrieval vector; only the rerank score is absolute, and only when a reranker ran. So the raw cosine has to be captured separately (a deliberate read, not free plumbing).

Change (step 1 of #249 — surface the scores)

Every PerQueryRow + NegativeRow now exposes two absolute-relevance numbers (which describe different chunks once a reranker is active, by design):

  • top1_vec_cosine — the cosine of the best text-vector match in the corpus (1 - min distance), via a new eval-only HybridSearcher.top_vector_cosine probe. This is the OOD signal (covered ~0.7 vs off-corpus ~0.2 with a real embedder). Read straight from the vector leg — search() and the ranking path are byte-untouched, so retrieval metrics can't move (no-baseline-needed).
  • top1_score — the top-ranked hit's fused Hit.score (the cross-encoder rerank score when the rerank leg ran).

diagnostics_table renders both per positive plus a dedicated OOD-negatives section so separation is eyeball-able.

The score-based negative_satisfaction metric (step 2 of the issue) is deliberately left for a follow-up — this PR makes it computable.

Robustness / scope

  • The probe is gated to vector-using modes, so a pure --retrieval bm25 ablation stays embedder-free (no extra embed calls, can't crash on an unreachable embedder; top1_vec_cosine=None). Pinned by a test that spies the probe (0 calls in bm25 mode).
  • Guards a blank query and a missing vector leg (NotSupportedNone), mirroring search()'s graceful vec leg.
  • Text-vector only — the multimodal/asset leg is out of scope for this text-OOD probe (documented).
  • Additive optional DTO fields; no Storage/Provider Protocol change, no on-disk layout change. Consumers unaffected (baseline.py reads only metrics; client render skips diagnostics).

Tests

  • test_top_vector_cosine_discriminates_covered_vs_ood + test_eval_rows_surface_absolute_relevance_scores — covered query out-scores a zero-overlap OOD query (deterministic, embedder-driven, not a tautology).
  • test_top_vector_cosine_none_without_text_vector_leg / ..._none_when_no_text_version — graceful None paths.
  • test_eval_bm25_mode_skips_vector_cosine_probe — pure bm25 stays embedder-free.

Delivery receipt

step 3 — verify: tools/check.py green (ruff + mypy + 2409 passed / 138 skipped). Real-data: hermetic mvp (cache off) retrieval metrics unchanged (all 1.0 — search() untouched); new fields populate. Honest caveat recorded in BASELINES: under FakeEmbeddings (lexical) the mvp negatives don't separate cleanly (min-pos cosine 0.334 < max-neg 0.445) — expected, a real embedder is needed to calibrate; the mechanism is pinned by the zero-overlap unit test.

step 5 — review: fresh-agent invariant/design review → PASS (Karpathy's rule held: the probe is deterministic scoping, not LLM reasoning, and eval-only; 1 - distance correct; no ranking change). The one "bundling" concern was a stale-local-main diff artifact — the PR vs origin/main is #249-only. Workflow-backed /code-review (xhigh, 26 agents) → 8 findings, all resolved: probe-breaks-bm25-purity (gated off for bm25), missing blank-query guard (added), NotSupported not caught (added), top1_score/top1_vec_cosine semantics conflated (docs clarified), double-embed (accepted — the chosen eval-probe approach, +1/query, bm25 exempt), plus doc/cleanup nits.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added clearer evaluation diagnostics for covered vs. out-of-domain queries, including new relevance indicators in reports.
    • Improved support for marking queries that should have no match, with better reporting for those cases.
  • Bug Fixes

    • Evaluation summaries now avoid extra vector checks in BM25-only runs.
    • Missing or unavailable vector data now fails gracefully instead of breaking reporting.

…libration (#249)

Retrieval eval rows carried only rank order, so `expect_none` / OOD
robustness was unmeasurable: doc-level retrieval never abstains, and the
fused `Hit.score` is rank-based (RRF) or per-leg min-max normalised
(CombSUM/MNZ), so the top score is ~constant for any query — covered or not.

Surface two absolute-relevance numbers on every `PerQueryRow` + `NegativeRow`:
- `top1_vec_cosine` — the cosine of the best text-vector match in the corpus
  (max cosine = `1 - min distance`), via a new eval-only
  `HybridSearcher.top_vector_cosine` probe. This is the OOD signal (covered
  ~0.7 vs off-corpus ~0.2 with a real embedder). It is read straight from the
  vector leg before fusion — `search()` and the ranking path are byte-
  untouched, so retrieval metrics can't move (`no-baseline-needed`).
- `top1_score` — the top-ranked hit's fused `Hit.score` (the cross-encoder
  rerank score when the rerank leg ran).

`diagnostics_table` renders both per positive plus a dedicated OOD-negatives
section so separation is eyeball-able.

The probe is gated to vector-using modes, so a pure `--retrieval bm25`
ablation stays embedder-free (no extra embed calls, can't crash on an
unreachable embedder); guards a blank query and a missing vector leg
(`NotSupported` → None). Scope: step 1 of #249 (surface the scores); the
score-based `negative_satisfaction` metric is deliberately left for later.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest added the no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics label Jun 28, 2026
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds top1_score and top1_vec_cosine absolute relevance fields to PerQueryRow and NegativeRow. A new HybridSearcher.top_vector_cosine probe computes these values, gated to vector/hybrid modes with per-query caching. Diagnostics rendering, tests, documentation, and baselines are updated accordingly.

Changes

Absolute relevance scoring in eval rows

Layer / File(s) Summary
HybridSearcher.top_vector_cosine probe
src/dikw_core/domains/info/search.py
New eval-only method embeds query, runs single NN search (limit=1), returns 1.0 - distance or None for blank query, missing embedder/model, NotSupported, or no hits.
Row dataclass schema extensions
src/dikw_core/eval/runner.py
PerQueryRow and NegativeRow each gain optional `top1_score: float
Runner probe computation, caching, and row population
src/dikw_core/eval/runner.py
_run_queries computes top1_vec_cosine only for "vector"/"hybrid" modes using a per-query cache, extracts top1_score from rank-0 Hit.score, and passes both into NegativeRow and PerQueryRow construction.
Diagnostics table rendering
src/dikw_core/eval/runner.py
EvalReport.diagnostics_table displays top1_score/top1_vec_cosine for positive queries and adds a new OOD negatives section showing vec_cos and top1 per expect_none query.
Tests
tests/test_eval_runner.py, tests/test_search.py
Tests assert positive top1_vec_cosine exceeds negative's, diagnostics table contains OOD section, BM25 mode makes zero probe calls and yields None cosine, and the probe degrades gracefully without embedder or text index.
Docs and baselines
evals/README.md, evals/BASELINES.md, evals/datasets/mvp/queries.yaml
README documents expect_none OOD query mode and new row fields; BASELINES.md records the 2026-06-28 entry; queries.yaml comment updated to reflect absolute relevance per row.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • OpenDIKW/dikw-core#251: Modifies the same _run_queries path in runner.py and HybridSearcher configuration surface in search.py, so the two PRs touch overlapping code paths in the eval runner and searcher.

Poem

🐇 A rabbit hops through corpus land,
measuring scores with careful hand.
OOD queries score oh so low,
while covered ones put on a show.
top1_vec_cosine tells the tale —
no empty rank shall pass the veil! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description covers the change well, but it does not follow the required template and omits the Summary, Test plan, and Notes sections. Reformat the PR description to match the template, adding a Summary, checkbox Test plan, and Notes/breaking changes section.
Docstring Coverage ⚠️ Warning Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly states the main change: surfacing absolute relevance scores for OOD calibration in eval rows.
Linked Issues check ✅ Passed It surfaces top1_score and top1_vec_cosine in eval rows and adds the eval-only vector probe needed for OOD measurement.
Out of Scope Changes check ✅ Passed The docs, tests, and eval-report changes all support the stated OOD score-surfacing objective, with no clear unrelated additions.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-feat+eval-surface-relevance-scores

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…ec-None defensive return

Surface a blank-query assertion (covers the empty-query early return) and mark
the q_vec-is-None branch no-cover — embed() of a non-blank query never returns
[], so it's an unreachable mypy-required guard (same idiom as the telemetry
defensive guards). Keeps codecov/patch honest without a contrived empty-embedder.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/README.md`:
- Around line 65-75: Clarify the caveats for `top1_vec_cosine` and `top1_score`
in the eval README by making them explicitly phase- and fusion-dependent. Update
the `top1_score` description to mention that it may be a reranker score when
reranking runs, or a fused retrieval score such as RRF/CombSUM/CombMNZ depending
on configuration, rather than only RRF. Also revise the `top1_vec_cosine` note
to state that `--retrieval bm25` only disables the query-time cosine probe,
while `run_eval()` may still use embeddings during ingest on cache misses, so
the eval is not fully embedder-free.

In `@tests/test_eval_runner.py`:
- Around line 365-370: Clarify the bm25 test in run_eval() so it only asserts
query-time embedder-free behavior, not full eval embedder independence. Update
the test around _run_queries and api.ingest(..., embedder=effective_embedder) to
reflect that mode="bm25" skips vector probing and OOD cosine checks, but
ingestion still uses the embedder. Keep the assertion focused on zero
top_vector_cosine/top1_vec_cosine-related calls and remove wording that implies
a bm25 eval cannot fail on an unreachable embedder.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 04b1a125-b425-4cd0-b119-6afb3caf38ed

📥 Commits

Reviewing files that changed from the base of the PR and between c4e6bb4 and 9295e6d.

📒 Files selected for processing (7)
  • evals/BASELINES.md
  • evals/README.md
  • evals/datasets/mvp/queries.yaml
  • src/dikw_core/domains/info/search.py
  • src/dikw_core/eval/runner.py
  • tests/test_eval_runner.py
  • tests/test_search.py

Comment thread evals/README.md
Comment on lines +65 to +75
- `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in
the corpus** (max cosine, not necessarily the top-*ranked* hit), so it
answers "is any chunk close to this query?". This is the OOD signal: with a
**real** embedder a covered query scores high (~0.7) and an off-corpus one
low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't
probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and
under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and
won't separate OOD cleanly — use a real embedder to calibrate.
- `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder
rerank score when the rerank leg ran; otherwise the rank-based RRF score,
which is ~constant across queries and does *not* encode absolute relevance).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify that these caveats are fusion- and phase-dependent.

top1_score is only an RRF score under the default fusion; without reranking it can also be a CombSUM/CombMNZ fused score. Also, pure --retrieval bm25 disables the query-time cosine probe, but run_eval() still embeds during ingest on cache misses, so the whole eval is not inherently embedder-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/README.md` around lines 65 - 75, Clarify the caveats for
`top1_vec_cosine` and `top1_score` in the eval README by making them explicitly
phase- and fusion-dependent. Update the `top1_score` description to mention that
it may be a reranker score when reranking runs, or a fused retrieval score such
as RRF/CombSUM/CombMNZ depending on configuration, rather than only RRF. Also
revise the `top1_vec_cosine` note to state that `--retrieval bm25` only disables
the query-time cosine probe, while `run_eval()` may still use embeddings during
ingest on cache misses, so the eval is not fully embedder-free.

Comment thread tests/test_eval_runner.py
Comment on lines +365 to +370
"""#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay
embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine
probe is gated off too, so it issues zero extra embeds and a real-embedder
bm25 ablation can't crash on an unreachable embedder. Rows get
``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still
surfaces.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Narrow the bm25 “embedder-free” claim to the query path.

run_eval() still calls api.ingest(..., embedder=effective_embedder) before _run_queries, so mode="bm25" only skips query-time vector probing here. This test proves “no top_vector_cosine calls,” not “a bm25 eval can’t fail on an unreachable embedder.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 365 - 370, Clarify the bm25 test in
run_eval() so it only asserts query-time embedder-free behavior, not full eval
embedder independence. Update the test around _run_queries and api.ingest(...,
embedder=effective_embedder) to reflect that mode="bm25" skips vector probing
and OOD cosine checks, but ingestion still uses the embedder. Keep the assertion
focused on zero top_vector_cosine/top1_vec_cosine-related calls and remove
wording that implies a bm25 eval cannot fail on an unreachable embedder.

@helebest helebest merged commit c15b4ee into main Jun 28, 2026
11 checks passed
@helebest helebest deleted the worktree-feat+eval-surface-relevance-scores branch June 28, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface absolute relevance scores in eval rows to make expect_none / OOD robustness measurable

1 participant