OpenDIKW · helebest · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026 · coderabbitai
diff --git a/evals/BASELINES.md b/evals/BASELINES.md
@@ -7,6 +7,36 @@ regression from a re-run variance.
 Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml`
 are calibrated ~2-3 % below the most recent canonical-mode run.
 
+## 2026-06-28 — eval rows: surface absolute relevance scores (#249)
+
+**Change under test:** every eval report row (`PerQueryRow` + `NegativeRow`) now
+carries `top1_score` (the top-*ranked* hit's fused `Hit.score` — the cross-encoder
+rerank score when the rerank leg ran) and `top1_vec_cosine` (the cosine of the
+*best text-vector match in the corpus*, via a new `HybridSearcher.top_vector_cosine`
+probe — a different chunk than rank-0 once a reranker reorders, by design), so
+`expect_none` / OOD robustness is measurable. The probe is gated to vector-using
+modes, so a pure `--retrieval bm25` ablation stays embedder-free.
+`domains/info/search.py` gains only the **additive, read-only**
+`top_vector_cosine` method — `search()` and the ranking path are byte-untouched —
+so this is `no-baseline-needed`.
+
+**No-regression proof (hermetic, `FakeEmbeddings`, `cache_mode="off"`, packaged
+`mvp`):** retrieval metrics are unchanged — `hit@3 = hit@10 = mrr = ndcg@10 =
+recall@100 = 1.0`, identical to before the change (the new method never runs
+inside `search`). The new fields populate on every row.
+
+**Honest caveat — the cosine only discriminates with a real embedder.** On the
+same hermetic mvp run the OOD separation *overlaps* (min positive
+`top1_vec_cosine` 0.334 < max negative 0.445) because `FakeEmbeddings` is a
+lexical bag-of-words: off-corpus negatives ("weather in Tokyo") still share
+function words with the corpus. This is expected, not a defect — the feature
+*surfaces* the signal; its discriminating power is the embedder's job, which is
+exactly what #249 makes measurable. A real embedder gives covered ~0.7 vs OOD
+~0.2. The mechanism is pinned deterministically by
+`tests/test_eval_runner.py::test_eval_rows_surface_absolute_relevance_scores`
+(zero-lexical-overlap OOD query → covered cosine > negative cosine) and
+`tests/test_search.py::test_top_vector_cosine_discriminates_covered_vs_ood`.
+
 ## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250)
 
 **Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now

diff --git a/evals/README.md b/evals/README.md
@@ -56,6 +56,24 @@ Semantics: a query is a "hit at k" if *any* stem in `expect_any` appears
 in the top-k retrieval result. Paraphrases often live in multiple docs,
 and requiring *all* stems would be artificially punitive.
 
+Mark an out-of-domain query with `expect_none: true` (and no `expect_any`).
+Doc-level retrieval never abstains, so rank order carries no robustness
+signal for these — instead each report row (positive **and** negative)
+exposes two absolute-relevance numbers (which describe *different* chunks
+once a reranker is active, by design):
+
+- `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in
+  the corpus** (max cosine, not necessarily the top-*ranked* hit), so it
+  answers "is any chunk close to this query?". This is the OOD signal: with a
+  **real** embedder a covered query scores high (~0.7) and an off-corpus one
+  low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't
+  probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and
+  under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and
+  won't separate OOD cleanly — use a real embedder to calibrate.
+- `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder
+  rerank score when the rerank leg ran; otherwise the rank-based RRF score,
+  which is ~constant across queries and does *not* encode absolute relevance).
+
 ## Usage
 
 ```bash

diff --git a/evals/datasets/mvp/queries.yaml b/evals/datasets/mvp/queries.yaml
@@ -11,8 +11,11 @@
 # Stems:  karpathy-gist, karpathy-software-2-0, karpathy-recipe
 #
 # Mix: 3 in-domain (Karpathy essays) + 3 negatives = 6.
-# Negatives are observational — a Phase B answer-level judge (not Phase A
-# retrieval-only) would be the right place to gate them.
+# Negatives don't gate hit@k/MRR, but every report row (positive and
+# negative) now carries an absolute relevance — top1_vec_cosine + top1_score
+# (#249) — so OOD robustness is measurable: with a real embedder a covered
+# query out-scores an off-corpus one. (Under hermetic FakeEmbeddings the
+# cosine is lexical, so these particular negatives may not separate cleanly.)
 
 queries:
   # ---- In-domain (Karpathy essays) -----------------------------------------

diff --git a/src/dikw_core/domains/info/search.py b/src/dikw_core/domains/info/search.py
@@ -915,6 +915,49 @@ async def _embed_query_text(self, q: str) -> list[float] | None:
         vectors = await self._embedder.embed([q], model=self._embedding_model)
         return vectors[0] if vectors else None
 
+    async def top_vector_cosine(self, q: str) -> float | None:
+        """Eval-only: the absolute cosine similarity of the **best text-vector
+        match** in the corpus for ``q`` — i.e. ``1 - min(distance)`` over the
+        text vector index — independent of fusion mode and reranking.
+
+        This is NOT "the top-ranked hit's score": it is the single nearest chunk
+        by vector distance, which a reranker may have demoted below rank 0. That
+        is deliberate — it answers "is ANY chunk semantically close to this
+        query?", the discriminating signal for OOD / ``expect_none`` (covered
+        queries have a close match, off-corpus queries do not). The fused
+        ``Hit.score`` cannot carry it: RRF is rank-based and CombSUM/CombMNZ
+        min-max normalise each leg, so the top fused score is ~1.0 for *any*
+        query. The eval runner surfaces this per query; ``search`` itself never
+        calls it.
+
+        Text leg only — the multimodal/asset leg (which ``search`` also fuses on
+        a multimodal base) is intentionally out of scope for this text-OOD probe.
+
+        Similarity is ``1 - distance``: both shipped storage adapters score
+        ``vec_search`` by cosine distance unconditionally (sqlite pins a cosine
+        vec table; postgres uses ``<=>``), regardless of the
+        ``embedding_distance`` config field, so this holds for any base. Returns
+        ``None`` for a blank query, when no text vector leg is wired, when the
+        backend has no vector search / no active text version, or when the index
+        is empty.
+        """
+        if not q.strip():
+            return None
+        if self._embedder is None or self._embedding_model is None:
+            return None
+        q_vec = await self._embed_query_text(q)
+        if q_vec is None:  # pragma: no cover - defensive: embed() of a non-blank query never returns []
+            return None
+        try:
+            hits = await self._storage.vec_search(
+                q_vec, version_id=self._text_version_id, limit=1, layer=None
+            )
+        except NotSupported:
+            # Mirror search()'s vec leg (graceful_notsupported): a backend with
+            # no vector search / no active text version has no cosine to report.
+            return None
+        return None if not hits else 1.0 - hits[0].distance
+
     async def _embed_query_multimodal(self, q: str) -> list[float] | None:
         assert self._mm is not None
         vectors = await self._mm.embedder.embed(

diff --git a/src/dikw_core/eval/runner.py b/src/dikw_core/eval/runner.py
@@ -227,12 +227,25 @@ class PerQueryRow:
     ranked_docs: tuple[str, ...]
     ranked_chunks: tuple[str, ...]
     ranked_assets: tuple[str, ...]
+    # Absolute relevance signals for OOD / expect_none calibration (#249). These
+    # describe two DIFFERENT chunks once a reranker is active, by design:
+    # ``top1_score`` is the top-RANKED hit's fused ``Hit.score`` (the cross-encoder
+    # rerank score when the rerank leg ran; otherwise the rank-based fusion score,
+    # which is ~constant across queries and so non-discriminating on its own).
+    # ``top1_vec_cosine`` is the best vector match anywhere in the corpus (max
+    # cosine = ``1 - min distance``), fusion/rerank-independent — the signal that
+    # actually discriminates covered vs off-corpus queries ("is any chunk close?").
+    # ``None`` when there is no text vector leg (e.g. a pure ``bm25`` run).
+    top1_score: float | None = None
+    top1_vec_cosine: float | None = None
 
     def to_dict(self) -> dict[str, Any]:
         d: dict[str, Any] = {
             "q": self.q,
             "expect_any": list(self.expect_doc_any),
             "ranked": list(self.ranked_docs),
+            "top1_score": self.top1_score,
+            "top1_vec_cosine": self.top1_vec_cosine,
         }
         if self.q_id is not None:
             d["id"] = self.q_id
@@ -247,17 +260,27 @@ def to_dict(self) -> dict[str, Any]:
 
 @dataclass(frozen=True)
 class NegativeRow:
-    """One ``expect_none=True`` query's observed retrieval. Diagnostic only —
-    retrieval always returns something from a non-empty corpus, so there's no
-    pass/fail here yet; scoring negatives requires a score threshold or an
-    answer-level judge that we don't have in Phase A.
+    """One ``expect_none=True`` query's observed retrieval. Retrieval always
+    returns something from a non-empty corpus, so rank order alone carries no
+    pass/fail. The absolute relevance of the top hit does, though: ``top1_score``
+    (top fused ``Hit.score`` — the rerank score when the rerank leg ran) and
+    ``top1_vec_cosine`` (raw top-1 vector cosine, fusion/rerank-independent) make
+    OOD robustness measurable (a healthy engine scores an off-corpus query low).
+    A score-based ``negative_satisfaction`` metric can consume these later (#249).
     """
 
     q: str
     ranked: tuple[str, ...]  # doc stems, top SEARCH_LIMIT
+    top1_score: float | None = None
+    top1_vec_cosine: float | None = None
 
     def to_dict(self) -> dict[str, Any]:
-        return {"q": self.q, "ranked": list(self.ranked)}
+        return {
+            "q": self.q,
+            "ranked": list(self.ranked),
+            "top1_score": self.top1_score,
+            "top1_vec_cosine": self.top1_vec_cosine,
+        }
 
 
 class EvalReport(BaseModel):
@@ -347,15 +370,37 @@ def diagnostics_table(self) -> str:
             verdict = "—" if thr is None else ("pass" if v >= thr else "FAIL")
             thr_str = f"{thr:.3f}" if thr is not None else "    -"
             lines.append(f"{key:18s}  {v:6.3f}  {thr_str:9s}  {verdict}")
+        def _fmt(v: object) -> str:
+            return f"{v:.3f}" if isinstance(v, int | float) else "—"
+
+        def _short(q: str) -> str:
+            return q if len(q) <= 60 else q[:57] + "..."
+
         lines.append("")
         lines.append("per-query top-5 (doc view):")
         for row in self.per_query:
-            q_short = row["q"] if len(row["q"]) <= 60 else row["q"][:57] + "..."
             top5 = row["ranked"][:5]
             mark = "✓" if any(e in top5 for e in row["expect_any"]) else "✗"
-            lines.append(f"  {mark} {q_short}")
+            lines.append(f"  {mark} {_short(row['q'])}")
             lines.append(f"       expected: {row['expect_any']}")
             lines.append(f"       top-5:    {top5}")
+            lines.append(
+                f"       score:    top1={_fmt(row.get('top1_score'))} "
+                f"vec_cos={_fmt(row.get('top1_vec_cosine'))}"
+            )
+        # Negatives carry no rank verdict (retrieval never abstains), so the
+        # absolute top vector cosine IS the signal: a healthy engine has no chunk
+        # close to an off-corpus query. (``top1`` is the rerank score when a
+        # reranker ran — meaningful then, ~constant otherwise.) Surface both so a
+        # human can eyeball OOD separation (#249).
+        if self.negative_diagnostics:
+            lines.append("")
+            lines.append("expect_none / OOD negatives (lower vec_cos = better):")
+            for row in self.negative_diagnostics:
+                lines.append(
+                    f"  · {_short(row['q'])}  vec_cos={_fmt(row.get('top1_vec_cosine'))} "
+                    f"top1={_fmt(row.get('top1_score'))}"
+                )
         return "\n".join(lines)
 
 
@@ -993,13 +1038,34 @@ async def _run_queries(
             RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]
         ] = {}
         total_q = len(spec.queries)
+        # The top vector cosine is fusion/mode-independent (it reads the raw vec
+        # leg, not the fused result), so compute it once per distinct query text
+        # and reuse across modes — one extra query-embed per distinct query for
+        # the whole run, not per mode. It is the absolute OOD / expect_none signal
+        # the fused ``Hit.score`` cannot carry (RRF is rank-based; CombSUM/CombMNZ
+        # normalise per leg) (#249). Gate it on a vector-using mode so a PURE
+        # ``bm25`` ablation stays embedder-free (its whole point is FTS-only,
+        # comparable to published BM25 baselines, and it must not crash if a real
+        # embedder is unreachable). When the only mode is ``bm25``, every row's
+        # ``top1_vec_cosine`` is ``None``.
+        vec_probe_active = any(m in ("vector", "hybrid") for m in modes)
+        vec_cosine_by_q: dict[str, float | None] = {}
         for m in modes:
             positives: list[PerQueryRow] = []
             negatives: list[NegativeRow] = []
             for q_idx, q in enumerate(spec.queries, start=1):
                 if reporter is not None:
                     reporter.cancel_token().raise_if_cancelled()
                 hits = await searcher.search(q.q, limit=SEARCH_LIMIT, mode=m)
+                if vec_probe_active and q.q not in vec_cosine_by_q:
+                    vec_cosine_by_q[q.q] = await searcher.top_vector_cosine(q.q)
+                # ``top1_score`` is the top-RANKED hit's fused Hit.score (the
+                # cross-encoder rerank score when the rerank leg ran);
+                # ``top1_vec_cosine`` is the best vector match in the corpus
+                # (max cosine) — a different chunk than rank-0 once a reranker
+                # reorders, by design (it answers "is any chunk close?").
+                top1_score = hits[0].score if hits else None
+                top1_vec_cosine = vec_cosine_by_q.get(q.q)
                 ranked_docs = _project_doc_view(hits)
                 ranked_chunks = _project_chunk_view(
                     hits, chunk_runtime_to_named
@@ -1009,7 +1075,12 @@ async def _run_queries(
                 )
                 if q.expect_none:
                     negatives.append(
-                        NegativeRow(q=q.q, ranked=tuple(ranked_docs))
+                        NegativeRow(
+                            q=q.q,
+                            ranked=tuple(ranked_docs),
+                            top1_score=top1_score,
+                            top1_vec_cosine=top1_vec_cosine,
+                        )
                     )
                 else:
                     positives.append(
@@ -1022,6 +1093,8 @@ async def _run_queries(
                             ranked_docs=tuple(ranked_docs),
                             ranked_chunks=tuple(ranked_chunks),
                             ranked_assets=tuple(ranked_assets),
+                            top1_score=top1_score,
+                            top1_vec_cosine=top1_vec_cosine,
                         )
                     )
                 if reporter is not None:

diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py
@@ -315,6 +315,82 @@ async def test_run_eval_exposes_negative_diagnostics(tmp_path: Path) -> None:
     assert isinstance(neg["ranked"], list)
 
 
+@pytest.mark.asyncio
+async def test_eval_rows_surface_absolute_relevance_scores(tmp_path: Path) -> None:
+    """#249: per-query AND negative rows carry an absolute relevance score —
+    ``top1_score`` (the top hit's ``Hit.score``) and ``top1_vec_cosine`` (the
+    raw top-1 vector cosine) — so ``expect_none`` / OOD robustness becomes
+    measurable. The key signal: a covered query's ``top1_vec_cosine`` exceeds an
+    off-corpus query's, which the fused ``Hit.score`` (RRF rank-based) cannot
+    show on its own.
+    """
+    ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])])
+    (ds / "queries.yaml").write_text(
+        "queries:\n"
+        "  - q: foo and bar topics\n"
+        "    expect_any: [alpha]\n"
+        "  - q: totally unrelated quantum gibberish\n"
+        "    expect_none: true\n",
+        encoding="utf-8",
+    )
+    spec = load_dataset(ds)
+    report = await run_eval(spec)
+
+    assert len(report.per_query) == 1
+    pos = report.per_query[0]
+    assert isinstance(pos["top1_score"], float)
+    assert isinstance(pos["top1_vec_cosine"], float)
+
+    assert len(report.negative_diagnostics) == 1
+    neg = report.negative_diagnostics[0]
+    assert isinstance(neg["top1_score"], float)
+    assert isinstance(neg["top1_vec_cosine"], float)
+
+    # OOD separation: the covered query scores a higher absolute vector cosine
+    # than the off-corpus negative — the whole point of surfacing the score.
+    assert pos["top1_vec_cosine"] > neg["top1_vec_cosine"]
+
+    # The human-readable breakdown renders both scores per positive and a
+    # dedicated negatives section (so OOD separation is eyeball-able).
+    table = report.diagnostics_table()
+    assert "vec_cos=" in table
+    assert "expect_none / OOD negatives" in table
+    assert "totally unrelated quantum gibberish"[:57] in table
+
+
+@pytest.mark.asyncio
+async def test_eval_bm25_mode_skips_vector_cosine_probe(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    """#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay
+    embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine
+    probe is gated off too, so it issues zero extra embeds and a real-embedder
+    bm25 ablation can't crash on an unreachable embedder. Rows get
+    ``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still
+    surfaces.
+    """
+    from dikw_core.eval import runner as runner_mod
+
+    ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])])
+    spec = load_dataset(ds)
+
+    calls = 0
+    real_probe = runner_mod.HybridSearcher.top_vector_cosine
+
+    async def spy_probe(self, q):  # type: ignore[no-untyped-def]
+        nonlocal calls
+        calls += 1
+        return await real_probe(self, q)
+
+    monkeypatch.setattr(runner_mod.HybridSearcher, "top_vector_cosine", spy_probe)
+
+    report = await run_eval(spec, mode="bm25")
+    assert calls == 0, "pure bm25 eval must not invoke the vector-cosine probe"
+    assert report.per_query
+    assert all(r["top1_vec_cosine"] is None for r in report.per_query)
+    assert all(isinstance(r["top1_score"], float) for r in report.per_query)
+
+
 @pytest.mark.asyncio
 async def test_run_eval_all_negative_dataset_metrics_empty(tmp_path: Path) -> None:
     """A dataset of only negatives produces no hit@k/MRR values — nothing