Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions evals/BASELINES.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,36 @@ regression from a re-run variance.
Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml`
are calibrated ~2-3 % below the most recent canonical-mode run.

## 2026-06-28 — eval rows: surface absolute relevance scores (#249)

**Change under test:** every eval report row (`PerQueryRow` + `NegativeRow`) now
carries `top1_score` (the top-*ranked* hit's fused `Hit.score` — the cross-encoder
rerank score when the rerank leg ran) and `top1_vec_cosine` (the cosine of the
*best text-vector match in the corpus*, via a new `HybridSearcher.top_vector_cosine`
probe — a different chunk than rank-0 once a reranker reorders, by design), so
`expect_none` / OOD robustness is measurable. The probe is gated to vector-using
modes, so a pure `--retrieval bm25` ablation stays embedder-free.
`domains/info/search.py` gains only the **additive, read-only**
`top_vector_cosine` method — `search()` and the ranking path are byte-untouched —
so this is `no-baseline-needed`.

**No-regression proof (hermetic, `FakeEmbeddings`, `cache_mode="off"`, packaged
`mvp`):** retrieval metrics are unchanged — `hit@3 = hit@10 = mrr = ndcg@10 =
recall@100 = 1.0`, identical to before the change (the new method never runs
inside `search`). The new fields populate on every row.

**Honest caveat — the cosine only discriminates with a real embedder.** On the
same hermetic mvp run the OOD separation *overlaps* (min positive
`top1_vec_cosine` 0.334 < max negative 0.445) because `FakeEmbeddings` is a
lexical bag-of-words: off-corpus negatives ("weather in Tokyo") still share
function words with the corpus. This is expected, not a defect — the feature
*surfaces* the signal; its discriminating power is the embedder's job, which is
exactly what #249 makes measurable. A real embedder gives covered ~0.7 vs OOD
~0.2. The mechanism is pinned deterministically by
`tests/test_eval_runner.py::test_eval_rows_surface_absolute_relevance_scores`
(zero-lexical-overlap OOD query → covered cosine > negative cosine) and
`tests/test_search.py::test_top_vector_cosine_discriminates_covered_vs_ood`.

## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250)

**Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now
Expand Down
18 changes: 18 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,24 @@ Semantics: a query is a "hit at k" if *any* stem in `expect_any` appears
in the top-k retrieval result. Paraphrases often live in multiple docs,
and requiring *all* stems would be artificially punitive.

Mark an out-of-domain query with `expect_none: true` (and no `expect_any`).
Doc-level retrieval never abstains, so rank order carries no robustness
signal for these — instead each report row (positive **and** negative)
exposes two absolute-relevance numbers (which describe *different* chunks
once a reranker is active, by design):

- `top1_vec_cosine` — the cosine of the **best text-vector match anywhere in
the corpus** (max cosine, not necessarily the top-*ranked* hit), so it
answers "is any chunk close to this query?". This is the OOD signal: with a
**real** embedder a covered query scores high (~0.7) and an off-corpus one
low (~0.2). Caveats: text-vector only (the multimodal/asset leg isn't
probed); `None` for a pure `--retrieval bm25` run (kept embedder-free); and
under the hermetic `FakeEmbeddings` the cosine is lexical bag-of-words and
won't separate OOD cleanly — use a real embedder to calibrate.
- `top1_score` — the **top-ranked** hit's fused `Hit.score` (the cross-encoder
rerank score when the rerank leg ran; otherwise the rank-based RRF score,
which is ~constant across queries and does *not* encode absolute relevance).
Comment on lines +65 to +75

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify that these caveats are fusion- and phase-dependent.

top1_score is only an RRF score under the default fusion; without reranking it can also be a CombSUM/CombMNZ fused score. Also, pure --retrieval bm25 disables the query-time cosine probe, but run_eval() still embeds during ingest on cache misses, so the whole eval is not inherently embedder-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/README.md` around lines 65 - 75, Clarify the caveats for
`top1_vec_cosine` and `top1_score` in the eval README by making them explicitly
phase- and fusion-dependent. Update the `top1_score` description to mention that
it may be a reranker score when reranking runs, or a fused retrieval score such
as RRF/CombSUM/CombMNZ depending on configuration, rather than only RRF. Also
revise the `top1_vec_cosine` note to state that `--retrieval bm25` only disables
the query-time cosine probe, while `run_eval()` may still use embeddings during
ingest on cache misses, so the eval is not fully embedder-free.


## Usage

```bash
Expand Down
7 changes: 5 additions & 2 deletions evals/datasets/mvp/queries.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,11 @@
# Stems: karpathy-gist, karpathy-software-2-0, karpathy-recipe
#
# Mix: 3 in-domain (Karpathy essays) + 3 negatives = 6.
# Negatives are observational — a Phase B answer-level judge (not Phase A
# retrieval-only) would be the right place to gate them.
# Negatives don't gate hit@k/MRR, but every report row (positive and
# negative) now carries an absolute relevance — top1_vec_cosine + top1_score
# (#249) — so OOD robustness is measurable: with a real embedder a covered
# query out-scores an off-corpus one. (Under hermetic FakeEmbeddings the
# cosine is lexical, so these particular negatives may not separate cleanly.)

queries:
# ---- In-domain (Karpathy essays) -----------------------------------------
Expand Down
43 changes: 43 additions & 0 deletions src/dikw_core/domains/info/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -915,6 +915,49 @@ async def _embed_query_text(self, q: str) -> list[float] | None:
vectors = await self._embedder.embed([q], model=self._embedding_model)
return vectors[0] if vectors else None

async def top_vector_cosine(self, q: str) -> float | None:
"""Eval-only: the absolute cosine similarity of the **best text-vector
match** in the corpus for ``q`` — i.e. ``1 - min(distance)`` over the
text vector index — independent of fusion mode and reranking.

This is NOT "the top-ranked hit's score": it is the single nearest chunk
by vector distance, which a reranker may have demoted below rank 0. That
is deliberate — it answers "is ANY chunk semantically close to this
query?", the discriminating signal for OOD / ``expect_none`` (covered
queries have a close match, off-corpus queries do not). The fused
``Hit.score`` cannot carry it: RRF is rank-based and CombSUM/CombMNZ
min-max normalise each leg, so the top fused score is ~1.0 for *any*
query. The eval runner surfaces this per query; ``search`` itself never
calls it.

Text leg only — the multimodal/asset leg (which ``search`` also fuses on
a multimodal base) is intentionally out of scope for this text-OOD probe.

Similarity is ``1 - distance``: both shipped storage adapters score
``vec_search`` by cosine distance unconditionally (sqlite pins a cosine
vec table; postgres uses ``<=>``), regardless of the
``embedding_distance`` config field, so this holds for any base. Returns
``None`` for a blank query, when no text vector leg is wired, when the
backend has no vector search / no active text version, or when the index
is empty.
"""
if not q.strip():
return None
if self._embedder is None or self._embedding_model is None:
return None
q_vec = await self._embed_query_text(q)
if q_vec is None: # pragma: no cover - defensive: embed() of a non-blank query never returns []
return None
try:
hits = await self._storage.vec_search(
q_vec, version_id=self._text_version_id, limit=1, layer=None
)
except NotSupported:
# Mirror search()'s vec leg (graceful_notsupported): a backend with
# no vector search / no active text version has no cosine to report.
return None
return None if not hits else 1.0 - hits[0].distance

async def _embed_query_multimodal(self, q: str) -> list[float] | None:
assert self._mm is not None
vectors = await self._mm.embedder.embed(
Expand Down
89 changes: 81 additions & 8 deletions src/dikw_core/eval/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,12 +227,25 @@ class PerQueryRow:
ranked_docs: tuple[str, ...]
ranked_chunks: tuple[str, ...]
ranked_assets: tuple[str, ...]
# Absolute relevance signals for OOD / expect_none calibration (#249). These
# describe two DIFFERENT chunks once a reranker is active, by design:
# ``top1_score`` is the top-RANKED hit's fused ``Hit.score`` (the cross-encoder
# rerank score when the rerank leg ran; otherwise the rank-based fusion score,
# which is ~constant across queries and so non-discriminating on its own).
# ``top1_vec_cosine`` is the best vector match anywhere in the corpus (max
# cosine = ``1 - min distance``), fusion/rerank-independent — the signal that
# actually discriminates covered vs off-corpus queries ("is any chunk close?").
# ``None`` when there is no text vector leg (e.g. a pure ``bm25`` run).
top1_score: float | None = None
top1_vec_cosine: float | None = None

def to_dict(self) -> dict[str, Any]:
d: dict[str, Any] = {
"q": self.q,
"expect_any": list(self.expect_doc_any),
"ranked": list(self.ranked_docs),
"top1_score": self.top1_score,
"top1_vec_cosine": self.top1_vec_cosine,
}
if self.q_id is not None:
d["id"] = self.q_id
Expand All @@ -247,17 +260,27 @@ def to_dict(self) -> dict[str, Any]:

@dataclass(frozen=True)
class NegativeRow:
"""One ``expect_none=True`` query's observed retrieval. Diagnostic only —
retrieval always returns something from a non-empty corpus, so there's no
pass/fail here yet; scoring negatives requires a score threshold or an
answer-level judge that we don't have in Phase A.
"""One ``expect_none=True`` query's observed retrieval. Retrieval always
returns something from a non-empty corpus, so rank order alone carries no
pass/fail. The absolute relevance of the top hit does, though: ``top1_score``
(top fused ``Hit.score`` — the rerank score when the rerank leg ran) and
``top1_vec_cosine`` (raw top-1 vector cosine, fusion/rerank-independent) make
OOD robustness measurable (a healthy engine scores an off-corpus query low).
A score-based ``negative_satisfaction`` metric can consume these later (#249).
"""

q: str
ranked: tuple[str, ...] # doc stems, top SEARCH_LIMIT
top1_score: float | None = None
top1_vec_cosine: float | None = None

def to_dict(self) -> dict[str, Any]:
return {"q": self.q, "ranked": list(self.ranked)}
return {
"q": self.q,
"ranked": list(self.ranked),
"top1_score": self.top1_score,
"top1_vec_cosine": self.top1_vec_cosine,
}


class EvalReport(BaseModel):
Expand Down Expand Up @@ -347,15 +370,37 @@ def diagnostics_table(self) -> str:
verdict = "—" if thr is None else ("pass" if v >= thr else "FAIL")
thr_str = f"{thr:.3f}" if thr is not None else " -"
lines.append(f"{key:18s} {v:6.3f} {thr_str:9s} {verdict}")
def _fmt(v: object) -> str:
return f"{v:.3f}" if isinstance(v, int | float) else "—"

def _short(q: str) -> str:
return q if len(q) <= 60 else q[:57] + "..."

lines.append("")
lines.append("per-query top-5 (doc view):")
for row in self.per_query:
q_short = row["q"] if len(row["q"]) <= 60 else row["q"][:57] + "..."
top5 = row["ranked"][:5]
mark = "✓" if any(e in top5 for e in row["expect_any"]) else "✗"
lines.append(f" {mark} {q_short}")
lines.append(f" {mark} {_short(row['q'])}")
lines.append(f" expected: {row['expect_any']}")
lines.append(f" top-5: {top5}")
lines.append(
f" score: top1={_fmt(row.get('top1_score'))} "
f"vec_cos={_fmt(row.get('top1_vec_cosine'))}"
)
# Negatives carry no rank verdict (retrieval never abstains), so the
# absolute top vector cosine IS the signal: a healthy engine has no chunk
# close to an off-corpus query. (``top1`` is the rerank score when a
# reranker ran — meaningful then, ~constant otherwise.) Surface both so a
# human can eyeball OOD separation (#249).
if self.negative_diagnostics:
lines.append("")
lines.append("expect_none / OOD negatives (lower vec_cos = better):")
for row in self.negative_diagnostics:
lines.append(
f" · {_short(row['q'])} vec_cos={_fmt(row.get('top1_vec_cosine'))} "
f"top1={_fmt(row.get('top1_score'))}"
)
return "\n".join(lines)


Expand Down Expand Up @@ -993,13 +1038,34 @@ async def _run_queries(
RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]
] = {}
total_q = len(spec.queries)
# The top vector cosine is fusion/mode-independent (it reads the raw vec
# leg, not the fused result), so compute it once per distinct query text
# and reuse across modes — one extra query-embed per distinct query for
# the whole run, not per mode. It is the absolute OOD / expect_none signal
# the fused ``Hit.score`` cannot carry (RRF is rank-based; CombSUM/CombMNZ
# normalise per leg) (#249). Gate it on a vector-using mode so a PURE
# ``bm25`` ablation stays embedder-free (its whole point is FTS-only,
# comparable to published BM25 baselines, and it must not crash if a real
# embedder is unreachable). When the only mode is ``bm25``, every row's
# ``top1_vec_cosine`` is ``None``.
vec_probe_active = any(m in ("vector", "hybrid") for m in modes)
vec_cosine_by_q: dict[str, float | None] = {}
for m in modes:
positives: list[PerQueryRow] = []
negatives: list[NegativeRow] = []
for q_idx, q in enumerate(spec.queries, start=1):
if reporter is not None:
reporter.cancel_token().raise_if_cancelled()
hits = await searcher.search(q.q, limit=SEARCH_LIMIT, mode=m)
if vec_probe_active and q.q not in vec_cosine_by_q:
vec_cosine_by_q[q.q] = await searcher.top_vector_cosine(q.q)
# ``top1_score`` is the top-RANKED hit's fused Hit.score (the
# cross-encoder rerank score when the rerank leg ran);
# ``top1_vec_cosine`` is the best vector match in the corpus
# (max cosine) — a different chunk than rank-0 once a reranker
# reorders, by design (it answers "is any chunk close?").
top1_score = hits[0].score if hits else None
top1_vec_cosine = vec_cosine_by_q.get(q.q)
ranked_docs = _project_doc_view(hits)
ranked_chunks = _project_chunk_view(
hits, chunk_runtime_to_named
Expand All @@ -1009,7 +1075,12 @@ async def _run_queries(
)
if q.expect_none:
negatives.append(
NegativeRow(q=q.q, ranked=tuple(ranked_docs))
NegativeRow(
q=q.q,
ranked=tuple(ranked_docs),
top1_score=top1_score,
top1_vec_cosine=top1_vec_cosine,
)
)
else:
positives.append(
Expand All @@ -1022,6 +1093,8 @@ async def _run_queries(
ranked_docs=tuple(ranked_docs),
ranked_chunks=tuple(ranked_chunks),
ranked_assets=tuple(ranked_assets),
top1_score=top1_score,
top1_vec_cosine=top1_vec_cosine,
)
)
if reporter is not None:
Expand Down
76 changes: 76 additions & 0 deletions tests/test_eval_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,82 @@ async def test_run_eval_exposes_negative_diagnostics(tmp_path: Path) -> None:
assert isinstance(neg["ranked"], list)


@pytest.mark.asyncio
async def test_eval_rows_surface_absolute_relevance_scores(tmp_path: Path) -> None:
"""#249: per-query AND negative rows carry an absolute relevance score —
``top1_score`` (the top hit's ``Hit.score``) and ``top1_vec_cosine`` (the
raw top-1 vector cosine) — so ``expect_none`` / OOD robustness becomes
measurable. The key signal: a covered query's ``top1_vec_cosine`` exceeds an
off-corpus query's, which the fused ``Hit.score`` (RRF rank-based) cannot
show on its own.
"""
ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])])
(ds / "queries.yaml").write_text(
"queries:\n"
" - q: foo and bar topics\n"
" expect_any: [alpha]\n"
" - q: totally unrelated quantum gibberish\n"
" expect_none: true\n",
encoding="utf-8",
)
spec = load_dataset(ds)
report = await run_eval(spec)

assert len(report.per_query) == 1
pos = report.per_query[0]
assert isinstance(pos["top1_score"], float)
assert isinstance(pos["top1_vec_cosine"], float)

assert len(report.negative_diagnostics) == 1
neg = report.negative_diagnostics[0]
assert isinstance(neg["top1_score"], float)
assert isinstance(neg["top1_vec_cosine"], float)

# OOD separation: the covered query scores a higher absolute vector cosine
# than the off-corpus negative — the whole point of surfacing the score.
assert pos["top1_vec_cosine"] > neg["top1_vec_cosine"]

# The human-readable breakdown renders both scores per positive and a
# dedicated negatives section (so OOD separation is eyeball-able).
table = report.diagnostics_table()
assert "vec_cos=" in table
assert "expect_none / OOD negatives" in table
assert "totally unrelated quantum gibberish"[:57] in table


@pytest.mark.asyncio
async def test_eval_bm25_mode_skips_vector_cosine_probe(
tmp_path: Path, monkeypatch: pytest.MonkeyPatch
) -> None:
"""#249 / bm25 purity: a pure ``--retrieval bm25`` eval must stay
embedder-free — `search(mode="bm25")` makes no embed call, and the OOD cosine
probe is gated off too, so it issues zero extra embeds and a real-embedder
bm25 ablation can't crash on an unreachable embedder. Rows get
``top1_vec_cosine=None`` while ``top1_score`` (the fused bm25 score) still
surfaces.
Comment on lines +365 to +370

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Narrow the bm25 “embedder-free” claim to the query path.

run_eval() still calls api.ingest(..., embedder=effective_embedder) before _run_queries, so mode="bm25" only skips query-time vector probing here. This test proves “no top_vector_cosine calls,” not “a bm25 eval can’t fail on an unreachable embedder.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 365 - 370, Clarify the bm25 test in
run_eval() so it only asserts query-time embedder-free behavior, not full eval
embedder independence. Update the test around _run_queries and api.ingest(...,
embedder=effective_embedder) to reflect that mode="bm25" skips vector probing
and OOD cosine checks, but ingestion still uses the embedder. Keep the assertion
focused on zero top_vector_cosine/top1_vec_cosine-related calls and remove
wording that implies a bm25 eval cannot fail on an unreachable embedder.

"""
from dikw_core.eval import runner as runner_mod

ds = _write_dataset(tmp_path, queries=[("foo and bar topics", ["alpha"])])
spec = load_dataset(ds)

calls = 0
real_probe = runner_mod.HybridSearcher.top_vector_cosine

async def spy_probe(self, q): # type: ignore[no-untyped-def]
nonlocal calls
calls += 1
return await real_probe(self, q)

monkeypatch.setattr(runner_mod.HybridSearcher, "top_vector_cosine", spy_probe)

report = await run_eval(spec, mode="bm25")
assert calls == 0, "pure bm25 eval must not invoke the vector-cosine probe"
assert report.per_query
assert all(r["top1_vec_cosine"] is None for r in report.per_query)
assert all(isinstance(r["top1_score"], float) for r in report.per_query)


@pytest.mark.asyncio
async def test_run_eval_all_negative_dataset_metrics_empty(tmp_path: Path) -> None:
"""A dataset of only negatives produces no hit@k/MRR values — nothing
Expand Down
Loading
Loading