diff --git a/evals/BASELINES.md b/evals/BASELINES.md index eecc447..1bc1ca9 100644 --- a/evals/BASELINES.md +++ b/evals/BASELINES.md @@ -7,6 +7,36 @@ regression from a re-run variance. Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml` are calibrated ~2-3 % below the most recent canonical-mode run. +## 2026-06-28 — eval rows: surface absolute relevance scores (#249) + +**Change under test:** every eval report row (`PerQueryRow` + `NegativeRow`) now +carries `top1_score` (the top-*ranked* hit's fused `Hit.score` — the cross-encoder +rerank score when the rerank leg ran) and `top1_vec_cosine` (the cosine of the +*best text-vector match in the corpus*, via a new `HybridSearcher.top_vector_cosine` +probe — a different chunk than rank-0 once a reranker reorders, by design), so +`expect_none` / OOD robustness is measurable. The probe is gated to vector-using +modes, so a pure `--retrieval bm25` ablation stays embedder-free. +`domains/info/search.py` gains only the **additive, read-only** +`top_vector_cosine` method — `search()` and the ranking path are byte-untouched — +so this is `no-baseline-needed`. + +**No-regression proof (hermetic, `FakeEmbeddings`, `cache_mode="off"`, packaged +`mvp`):** retrieval metrics are unchanged — `hit@3 = hit@10 = mrr = ndcg@10 = +recall@100 = 1.0`, identical to before the change (the new method never runs +inside `search`). The new fields populate on every row. + +**Honest caveat — the cosine only discriminates with a real embedder.** On the +same hermetic mvp run the OOD separation *overlaps* (min positive +`top1_vec_cosine` 0.334 < max negative 0.445) because `FakeEmbeddings` is a +lexical bag-of-words: off-corpus negatives ("weather in Tokyo") still share +function words with the corpus. This is expected, not a defect — the feature +*surfaces* the signal; its discriminating power is the embedder's job, which is +exactly what #249 makes measurable. A real embedder gives covered ~0.7 vs OOD +~0.2. The mechanism is pinned deterministically by +`tests/test_eval_runner.py::test_eval_rows_surface_absolute_relevance_scores` +(zero-lexical-overlap OOD query → covered cosine > negative cosine) and +`tests/test_search.py::test_top_vector_cosine_discriminates_covered_vs_ood`. + ## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250) **Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now diff --git a/evals/README.md b/evals/README.md index 0adaafa..1f6a2fc 100644 --- a/evals/README.md +++ b/evals/README.md @@ -56,6 +56,24 @@ Semantics: a query is a "hit at k" if *any* stem in `expect_any` appears in the top-k retrieval result. Paraphrases often live in multiple docs, and requiring *all* stems would be artificially punitive. +Mark an out-of-domain query with `expect_none: true` (and no `expect_any`). +Doc-level retrieval never abstains, so rank order carries no robustness +signal for these — instead each report row (positive **and** negative) +exposes two absolute-relevance numbers (which describe *different* chunks +once a reranker is active, by design): + +- `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in + the corpus** (max cosine, not necessarily the top-*ranked* hit), so it + answers "is any chunk close to this query?". This is the OOD signal: with a + **real** embedder a covered query scores high (~0.7) and an off-corpus one + low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't + probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and + under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and + won't separate OOD cleanly — use a real embedder to calibrate. +- `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder + rerank score when the rerank leg ran; otherwise the rank-based RRF score, + which is ~constant across queries and does *not* encode absolute relevance). + ## Usage ```bash diff --git a/evals/datasets/mvp/queries.yaml b/evals/datasets/mvp/queries.yaml index d1f190b..99e6fb5 100644 --- a/evals/datasets/mvp/queries.yaml +++ b/evals/datasets/mvp/queries.yaml @@ -11,8 +11,11 @@ # Stems: karpathy-gist, karpathy-software-2-0, karpathy-recipe # # Mix: 3 in-domain (Karpathy essays) + 3 negatives = 6. -# Negatives are observational — a Phase B answer-level judge (not Phase A -# retrieval-only) would be the right place to gate them. +# Negatives don't gate hit@k/MRR, but every report row (positive and +# negative) now carries an absolute relevance — top1_vec_cosine + top1_score +# (#249) — so OOD robustness is measurable: with a real embedder a covered +# query out-scores an off-corpus one. (Under hermetic FakeEmbeddings the +# cosine is lexical, so these particular negatives may not separate cleanly.) queries: # ---- In-domain (Karpathy essays) ----------------------------------------- diff --git a/src/dikw_core/domains/info/search.py b/src/dikw_core/domains/info/search.py index 450e7dc..a9f2a6e 100644 --- a/src/dikw_core/domains/info/search.py +++ b/src/dikw_core/domains/info/search.py @@ -915,6 +915,49 @@ async def _embed_query_text(self, q: str) -> list[float] | None: vectors = await self._embedder.embed([q], model=self._embedding_model) return vectors[0] if vectors else None + async def top_vector_cosine(self, q: str) -> float | None: + """Eval-only: the absolute cosine similarity of the **best text-vector + match** in the corpus for ``q`` — i.e. ``1 - min(distance)`` over the + text vector index — independent of fusion mode and reranking. + + This is NOT "the top-ranked hit's score": it is the single nearest chunk + by vector distance, which a reranker may have demoted below rank 0. That + is deliberate — it answers "is ANY chunk semantically close to this + query?", the discriminating signal for OOD / ``expect_none`` (covered + queries have a close match, off-corpus queries do not). The fused + ``Hit.score`` cannot carry it: RRF is rank-based and CombSUM/CombMNZ + min-max normalise each leg, so the top fused score is ~1.0 for *any* + query. The eval runner surfaces this per query; ``search`` itself never + calls it. + + Text leg only — the multimodal/asset leg (which ``search`` also fuses on + a multimodal base) is intentionally out of scope for this text-OOD probe. + + Similarity is ``1 - distance``: both shipped storage adapters score + ``vec_search`` by cosine distance unconditionally (sqlite pins a cosine + vec table; postgres uses ``<=>``), regardless of the + ``embedding_distance`` config field, so this holds for any base. Returns + ``None`` for a blank query, when no text vector leg is wired, when the + backend has no vector search / no active text version, or when the index + is empty. + """ + if not q.strip(): + return None + if self._embedder is None or self._embedding_model is None: + return None + q_vec = await self._embed_query_text(q) + if q_vec is None: # pragma: no cover - defensive: embed() of a non-blank query never returns [] + return None + try: + hits = await self._storage.vec_search( + q_vec, version_id=self._text_version_id, limit=1, layer=None + ) + except NotSupported: + # Mirror search()'s vec leg (graceful_notsupported): a backend with + # no vector search / no active text version has no cosine to report. + return None + return None if not hits else 1.0 - hits[0].distance + async def _embed_query_multimodal(self, q: str) -> list[float] | None: assert self._mm is not None vectors = await self._mm.embedder.embed( diff --git a/src/dikw_core/eval/runner.py b/src/dikw_core/eval/runner.py index 1aa7685..1ddce9c 100644 --- a/src/dikw_core/eval/runner.py +++ b/src/dikw_core/eval/runner.py @@ -227,12 +227,25 @@ class PerQueryRow: ranked_docs: tuple[str, ...] ranked_chunks: tuple[str, ...] ranked_assets: tuple[str, ...] + # Absolute relevance signals for OOD / expect_none calibration (#249). These + # describe two DIFFERENT chunks once a reranker is active, by design: + # ``top1_score`` is the top-RANKED hit's fused ``Hit.score`` (the cross-encoder + # rerank score when the rerank leg ran; otherwise the rank-based fusion score, + # which is ~constant across queries and so non-discriminating on its own). + # ``top1_vec_cosine`` is the best vector match anywhere in the corpus (max + # cosine = ``1 - min distance``), fusion/rerank-independent — the signal that + # actually discriminates covered vs off-corpus queries ("is any chunk close?"). + # ``None`` when there is no text vector leg (e.g. a pure ``bm25`` run). + top1_score: float | None = None + top1_vec_cosine: float | None = None def to_dict(self) -> dict[str, Any]: d: dict[str, Any] = { "q": self.q, "expect_any": list(self.expect_doc_any), "ranked": list(self.ranked_docs), + "top1_score": self.top1_score, + "top1_vec_cosine": self.top1_vec_cosine, } if self.q_id is not None: d["id"] = self.q_id @@ -247,17 +260,27 @@ def to_dict(self) -> dict[str, Any]: @dataclass(frozen=True) class NegativeRow: - """One ``expect_none=True`` query's observed retrieval. Diagnostic only — - retrieval always returns something from a non-empty corpus, so there's no - pass/fail here yet; scoring negatives requires a score threshold or an - answer-level judge that we don't have in Phase A. + """One ``expect_none=True`` query's observed retrieval. Retrieval always + returns something from a non-empty corpus, so rank order alone carries no + pass/fail. The absolute relevance of the top hit does, though: ``top1_score`` + (top fused ``Hit.score`` — the rerank score when the rerank leg ran) and + ``top1_vec_cosine`` (raw top-1 vector cosine, fusion/rerank-independent) make + OOD robustness measurable (a healthy engine scores an off-corpus query low). + A score-based ``negative_satisfaction`` metric can consume these later (#249). """ q: str ranked: tuple[str, ...] # doc stems, top SEARCH_LIMIT + top1_score: float | None = None + top1_vec_cosine: float | None = None def to_dict(self) -> dict[str, Any]: - return {"q": self.q, "ranked": list(self.ranked)} + return { + "q": self.q, + "ranked": list(self.ranked), + "top1_score": self.top1_score, + "top1_vec_cosine": self.top1_vec_cosine, + } class EvalReport(BaseModel): @@ -347,15 +370,37 @@ def diagnostics_table(self) -> str: verdict = "—" if thr is None else ("pass" if v >= thr else "FAIL") thr_str = f"{thr:.3f}" if thr is not None else " -" lines.append(f"{key:18s} {v:6.3f} {thr_str:9s} {verdict}") + def _fmt(v: object) -> str: + return f"{v:.3f}" if isinstance(v, int | float) else "—" + + def _short(q: str) -> str: + return q if len(q) <= 60 else q[:57] + "..." + lines.append("") lines.append("per-query top-5 (doc view):") for row in self.per_query: - q_short = row["q"] if len(row["q"]) <= 60 else row["q"][:57] + "..." top5 = row["ranked"][:5] mark = "✓" if any(e in top5 for e in row["expect_any"]) else "✗" - lines.append(f" {mark} {q_short}") + lines.append(f" {mark} {_short(row['q'])}") lines.append(f" expected: {row['expect_any']}") lines.append(f" top-5: {top5}") + lines.append( + f" score: top1={_fmt(row.get('top1_score'))} " + f"vec_cos={_fmt(row.get('top1_vec_cosine'))}" + ) + # Negatives carry no rank verdict (retrieval never abstains), so the + # absolute top vector cosine IS the signal: a healthy engine has no chunk + # close to an off-corpus query. (``top1`` is the rerank score when a + # reranker ran — meaningful then, ~constant otherwise.) Surface both so a + # human can eyeball OOD separation (#249). + if self.negative_diagnostics: + lines.append("") + lines.append("expect_none / OOD negatives (lower vec_cos = better):") + for row in self.negative_diagnostics: + lines.append( + f" · {_short(row['q'])} vec_cos={_fmt(row.get('top1_vec_cosine'))} " + f"top1={_fmt(row.get('top1_score'))}" + ) return "\n".join(lines) @@ -993,6 +1038,18 @@ async def _run_queries( RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]] ] = {} total_q = len(spec.queries) + # The top vector cosine is fusion/mode-independent (it reads the raw vec + # leg, not the fused result), so compute it once per distinct query text + # and reuse across modes — one extra query-embed per distinct query for + # the whole run, not per mode. It is the absolute OOD / expect_none signal + # the fused ``Hit.score`` cannot carry (RRF is rank-based; CombSUM/CombMNZ + # normalise per leg) (#249). Gate it on a vector-using mode so a PURE + # ``bm25`` ablation stays embedder-free (its whole point is FTS-only, + # comparable to published BM25 baselines, and it must not crash if a real + # embedder is unreachable). When the only mode is ``bm25``, every row's + # ``top1_vec_cosine`` is ``None``. + vec_probe_active = any(m in ("vector", "hybrid") for m in modes) + vec_cosine_by_q: dict[str, float | None] = {} for m in modes: positives: list[PerQueryRow] = [] negatives: list[NegativeRow] = [] @@ -1000,6 +1057,15 @@ async def _run_queries( if reporter is not None: reporter.cancel_token().raise_if_cancelled() hits = await searcher.search(q.q, limit=SEARCH_LIMIT, mode=m) + if vec_probe_active and q.q not in vec_cosine_by_q: + vec_cosine_by_q[q.q] = await searcher.top_vector_cosine(q.q) + # ``top1_score`` is the top-RANKED hit's fused Hit.score (the + # cross-encoder rerank score when the rerank leg ran); + # ``top1_vec_cosine`` is the best vector match in the corpus + # (max cosine) — a different chunk than rank-0 once a reranker + # reorders, by design (it answers "is any chunk close?"). + top1_score = hits[0].score if hits else None + top1_vec_cosine = vec_cosine_by_q.get(q.q) ranked_docs = _project_doc_view(hits) ranked_chunks = _project_chunk_view( hits, chunk_runtime_to_named @@ -1009,7 +1075,12 @@ async def _run_queries( ) if q.expect_none: negatives.append( - NegativeRow(q=q.q, ranked=tuple(ranked_docs)) + NegativeRow( + q=q.q, + ranked=tuple(ranked_docs), + top1_score=top1_score, + top1_vec_cosine=top1_vec_cosine, + ) ) else: positives.append( @@ -1022,6 +1093,8 @@ async def _run_queries( ranked_docs=tuple(ranked_docs), ranked_chunks=tuple(ranked_chunks), ranked_assets=tuple(ranked_assets), + top1_score=top1_score, + top1_vec_cosine=top1_vec_cosine, ) ) if reporter is not None: diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py index 22ca6ab..cdc4532 100644 --- a/tests/test_eval_runner.py +++ b/tests/test_eval_runner.py @@ -315,6 +315,82 @@ async def test_run_eval_exposes_negative_diagnostics(tmp_path: Path) -> None: assert isinstance(neg["ranked"], list) +@pytest.mark.asyncio +async def test_eval_rows_surface_absolute_relevance_scores(tmp_path: Path) -> None: + """#249: per-query AND negative rows carry an absolute relevance score — + ``top1_score`` (the top hit's ``Hit.score``) and ``top1_vec_cosine`` (the + raw top-1 vector cosine) — so ``expect_none`` / OOD robustness becomes + measurable. The key signal: a covered query's ``top1_vec_cosine`` exceeds an + off-corpus query's, which the fused ``Hit.score`` (RRF rank-based) cannot + show on its own. + """ + ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])]) + (ds / "queries.yaml").write_text( + "queries:\n" + " - q: foo and bar topics\n" + " expect_any: [alpha]\n" + " - q: totally unrelated quantum gibberish\n" + " expect_none: true\n", + encoding="utf-8", + ) + spec = load_dataset(ds) + report = await run_eval(spec) + + assert len(report.per_query) == 1 + pos = report.per_query[0] + assert isinstance(pos["top1_score"], float) + assert isinstance(pos["top1_vec_cosine"], float) + + assert len(report.negative_diagnostics) == 1 + neg = report.negative_diagnostics[0] + assert isinstance(neg["top1_score"], float) + assert isinstance(neg["top1_vec_cosine"], float) + + # OOD separation: the covered query scores a higher absolute vector cosine + # than the off-corpus negative — the whole point of surfacing the score. + assert pos["top1_vec_cosine"] > neg["top1_vec_cosine"] + + # The human-readable breakdown renders both scores per positive and a + # dedicated negatives section (so OOD separation is eyeball-able). + table = report.diagnostics_table() + assert "vec_cos=" in table + assert "expect_none / OOD negatives" in table + assert "totally unrelated quantum gibberish"[:57] in table + + +@pytest.mark.asyncio +async def test_eval_bm25_mode_skips_vector_cosine_probe( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay + embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine + probe is gated off too, so it issues zero extra embeds and a real-embedder + bm25 ablation can't crash on an unreachable embedder. Rows get + ``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still + surfaces. + """ + from dikw_core.eval import runner as runner_mod + + ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])]) + spec = load_dataset(ds) + + calls = 0 + real_probe = runner_mod.HybridSearcher.top_vector_cosine + + async def spy_probe(self, q): # type: ignore[no-untyped-def] + nonlocal calls + calls += 1 + return await real_probe(self, q) + + monkeypatch.setattr(runner_mod.HybridSearcher, "top_vector_cosine", spy_probe) + + report = await run_eval(spec, mode="bm25") + assert calls == 0, "pure bm25 eval must not invoke the vector-cosine probe" + assert report.per_query + assert all(r["top1_vec_cosine"] is None for r in report.per_query) + assert all(isinstance(r["top1_score"], float) for r in report.per_query) + + @pytest.mark.asyncio async def test_run_eval_all_negative_dataset_metrics_empty(tmp_path: Path) -> None: """A dataset of only negatives produces no hit@k/MRR values — nothing diff --git a/tests/test_search.py b/tests/test_search.py index 8d20580..2bf79e0 100644 --- a/tests/test_search.py +++ b/tests/test_search.py @@ -437,6 +437,59 @@ async def _populate_fixture_corpus(storage): return embedder, version_id +# ---- top_vector_cosine (eval OOD probe, #249) ------------------------------- + + +@pytest.mark.asyncio +async def test_top_vector_cosine_discriminates_covered_vs_ood(tmp_path) -> None: + """``top_vector_cosine`` surfaces the absolute top-1 cosine similarity — a + covered query scores higher than an out-of-corpus one. This is the OOD / + expect_none signal the fused ``Hit.score`` cannot carry (RRF is rank-based; + CombSUM/CombMNZ min-max normalise per leg), so #249's eval rows read it + straight from the vector leg. + """ + storage = SQLiteStorage(tmp_path / "idx.sqlite") + await storage.connect() + await storage.migrate() + embedder, version_id = await _populate_fixture_corpus(storage) + searcher = HybridSearcher( + storage, embedder, embedding_model="fake", text_version_id=version_id + ) + covered = await searcher.top_vector_cosine( + "DIKW pyramid data information knowledge wisdom" + ) + ood = await searcher.top_vector_cosine("zzzqqq nonexistent gibberish token") + assert covered is not None and ood is not None + assert covered > ood, f"covered ({covered}) must out-score OOD ({ood})" + # Blank / whitespace-only query → None (mirrors search()'s empty-query guard). + assert await searcher.top_vector_cosine(" ") is None + await storage.close() + + +@pytest.mark.asyncio +async def test_top_vector_cosine_none_without_text_vector_leg(tmp_path) -> None: + """No text vector leg wired (no embedder) → ``None``, gracefully.""" + storage = SQLiteStorage(tmp_path / "idx.sqlite") + await storage.connect() + await storage.migrate() + searcher = HybridSearcher(storage, None) + assert await searcher.top_vector_cosine("anything") is None + await storage.close() + + +@pytest.mark.asyncio +async def test_top_vector_cosine_none_when_no_text_version(tmp_path) -> None: + """Embedder wired but no text embeddings indexed yet → ``vec_search`` raises + ``NotSupported``; the probe degrades to ``None`` (mirrors ``search``'s + graceful vec leg) rather than propagating.""" + storage = SQLiteStorage(tmp_path / "idx.sqlite") + await storage.connect() + await storage.migrate() + searcher = HybridSearcher(storage, FakeEmbeddings(), embedding_model="fake") + assert await searcher.top_vector_cosine("anything") is None + await storage.close() + + @pytest.mark.asyncio async def test_mode_bm25_skips_vector_leg(tmp_path) -> None: storage = SQLiteStorage(tmp_path / "idx.sqlite")