Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions docs/eval-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,12 +242,19 @@ or non-destructiveness. Two gates:
embedder**: the hermetic `FakeEmbeddings` is lexical bag-of-words and can't
exercise the semantically-close-but-wrong failure mode rerank targets, so a
fake-embedder run is non-informative for this change. scifact (large recall
pool) is the strongest signal; mvp is the dogfood sanity check. **Run each
arm with its own snapshot** (`cache_root`, or `cache_mode="off"`): the eval
cache key is `(corpus, embedder)` only and `_run_queries` reloads retrieval
config from the cached base's `dikw.yml`, so a shared `cache_root` makes the
second arm silently reuse the first arm's `rerank_enabled` — same reason RRF
weight A/B uses the offline re-fusion tools, not the `run_eval` cache.
pool) is the strongest signal; mvp is the dogfood sanity check. **Query-time
knobs are read live, so a shared `cache_root` is safe for these arms** — the
snapshot cache key covers only ingest-time inputs (corpus, embedding
model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal
identity), while everything applied at query time — the search-time
`RetrievalConfig` knobs (`rrf_k`, weights, `fusion`, `graph_*`,
`rerank_enabled`, `rerank_candidate_k`) **and** the rerank provider wiring
(`provider.rerank_model` / base_url) — is read from the caller's live config
in `_run_queries`, not the baked snapshot. So flipping `rerank_enabled`, the
rerank *model*, or any fusion knob between arms under the default
`cache_mode="read_write"` is both fast (no re-embed) and correct. Only an
**ingest-time** change (the embedding model/dim, or `cjk_tokenizer`) forces a
fresh snapshot — and the key handles that automatically.

The Stage A K-layer fan-out + atomicity-lint baseline (2026-05-08), the
wikilink graph leg ablation (2026-05-08), and the rerank-stage SciFact
Expand Down
35 changes: 35 additions & 0 deletions evals/BASELINES.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,41 @@ regression from a re-run variance.
Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml`
are calibrated ~2-3 % below the most recent canonical-mode run.

## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250)

**Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now
includes the one **ingest-time** `RetrievalConfig` field, `cjk_tokenizer` (baked
into the FTS index), and `_run_queries` reads every **query-time** knob from the
caller's **live** config instead of the baked snapshot's `dikw.yml`: the
search-time `RetrievalConfig` fields (`rrf_k`, `bm25_weight`, `vector_weight`,
`fusion`, `same_doc_penalty_alpha`, `graph_*`, `rerank_enabled`,
`rerank_candidate_k`) **and** the rerank provider wiring (`provider.rerank_model`
/ base_url) — the reranker scores `(query, chunk)` at query time and never
touches the stored index, so it is read live, not keyed. Eval-infra only —
`domains/info/search.py` and the retrieval algorithm are untouched, so this is
`no-baseline-needed` (a new benchmark row would measure nothing new).

**Consequence for the older entries below:** the "one snapshot per arm" /
"DROP the snapshot between arms" footgun they describe (e.g. the 2026-06-25 rerank
entry) **no longer applies to any query-time change**. Flipping `rerank_enabled`,
the rerank *model*, `rrf_k`, weights, `fusion`, or `graph_*` between arms under
the default `cache_mode="read_write"` now takes effect correctly and reuses the
embeddings (fast). Only an **ingest-time** change (the embedding model/dim, or
`cjk_tokenizer`) forces a fresh snapshot — and the key does that automatically.

**Verification:** the cache-**hit** regression this PR fixes is covered by the
unit tests `tests/test_eval_runner.py::test_eval_cache_search_time_config_read_live_on_hit`
(asserts the warm run builds the searcher from the live `rrf_k`, not the baked
one — goes red on pre-fix code, which threaded `cfg.retrieval`) and
`::test_eval_cache_cjk_tokenizer_change_forces_reingest`; a tripwire test
(`::test_retrieval_config_fields_are_classified_for_eval_cache`) fails if a new
`RetrievalConfig` field is added unclassified. A hermetic `mvp` run
(`FakeEmbeddings`, `cache_mode="off"`) is only a flow sanity check — it
fresh-builds a base each run, so it does NOT exercise the cache-hit path and
would pass on pre-fix code too; it merely confirms the live config is honored at
all (default config vs `fusion="combsum"` → `mrr` 1.0 vs 0.833, `ndcg@10` 1.0 vs
0.877) and that the default-config numbers are unchanged (algorithm untouched).

## 2026-06-25 — rerank stage: post-fusion cross-encoder (opt-in), SciFact off-vs-on

**Change under test:** the new optional rerank stage in `HybridSearcher.search`
Expand Down
7 changes: 6 additions & 1 deletion evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,12 @@ embedder for benchmark-scale work).

Tune the fusion weights by editing the `retrieval:` block in your
base's `dikw.yml`, then re-run `dikw client eval --dataset <name>
--retrieval all` to compare. Pin the winning combination:
--retrieval all` to compare. These are **search-time** knobs, read live
each run, so re-running under the default `--cache read_write` is safe and
fast (the snapshot is reused, only the ranking is recomputed). The one
exception is `cjk_tokenizer`: it is **ingest-time** (baked into the FTS
index), so it is part of the snapshot cache key — changing it
automatically forces a fresh snapshot. Pin the winning combination:

```yaml
retrieval:
Expand Down
100 changes: 79 additions & 21 deletions src/dikw_core/eval/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,15 +152,17 @@ def _corpus_cache_key(
model: str,
dim: int | None,
*,
cjk_tokenizer: str,
mm_fingerprint: str | None = None,
) -> str:
"""Stable cache key combining dataset name, model, dim, corpus hash, schema.
"""Stable cache key combining dataset name, model, dim, corpus hash,
tokenizer, schema.

Algorithm:
1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir
2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable;
``cache_mode="rebuild"`` is the escape hatch)
3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}``
3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}``
Comment on lines +158 to +165

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Key snapshots by the full embedding version identity.

The cache key still only includes embedding_model and embedding_dim, but ProviderConfig treats embedding_revision, embedding_normalize, and embedding_distance as part of the stored vector version identity. Under read_write, changing one of those fields can reuse an old snapshot while querying with the live embedder/config, mixing fresh query vectors with stale indexed vectors.

Suggested direction
 def _corpus_cache_key(
     spec: DatasetSpec,
     model: str,
     dim: int | None,
     *,
+    embedding_revision: str,
+    embedding_normalize: bool,
+    embedding_distance: str,
     cjk_tokenizer: str,
     mm_fingerprint: str | None = None,
 ) -> str:
 key = _corpus_cache_key(
     spec,
     effective_provider_cfg.embedding_model,
     effective_provider_cfg.embedding_dim,
+    embedding_revision=effective_provider_cfg.embedding_revision,
+    embedding_normalize=effective_provider_cfg.embedding_normalize,
+    embedding_distance=effective_provider_cfg.embedding_distance,
     cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
     mm_fingerprint=mm_fingerprint,
 )

Also applies to: 203-204, 518-523

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dikw_core/eval/runner.py` around lines 158 - 165, The cache key logic in
runner.py is missing parts of the embedding version identity, so snapshots can
be reused across incompatible provider settings. Update the key construction in
the cache-key helper used by the eval runner to include all fields that
ProviderConfig treats as identity, especially embedding_revision,
embedding_normalize, and embedding_distance, alongside embedding_model and
embedding_dim. Make the same change wherever the key format is documented or
assembled so the cache snapshot cannot mix stale indexed vectors with live query
vectors.


``as_posix()`` keeps the key cross-platform (Windows / Linux yield
the same digest). Embedding ``dim=None`` is rendered as ``0`` so the
Expand All @@ -171,11 +173,14 @@ def _corpus_cache_key(
index can't be silently reused by a real-vector multimodal eval.
``None`` is rendered as ``0`` for back-compat with pre-mm caches.

NOTE: ``RetrievalConfig`` (rrf_k, weights, fusion, graph_*, …) is
NOT in the key. Re-running with a different retrieval config under
``cache_mode="read_write"`` silently reuses the first run's
``dikw.yml`` — the snapshot's base carries the old block. Retrieval
ablations must use ``cache_mode="off"`` until this is fixed.
``cjk_tokenizer`` is the one **ingest-time** ``RetrievalConfig`` field —
it's baked into the FTS index (and drives the chunker's token budget) at
ingest, so it MUST be in the key: changing it forces a fresh snapshot.
Every other ``RetrievalConfig`` knob (rrf_k, weights, fusion, graph_*,
rerank_*) is **search-time** and is read live in ``_run_queries`` from
the caller's config — never from the baked snapshot — so a retrieval
ablation under ``cache_mode="read_write"`` is both fast (no re-embed)
and correct.
"""
h = hashlib.sha256()
for path in sorted(spec.corpus_dir.rglob("*")):
Expand All @@ -195,7 +200,8 @@ def _corpus_cache_key(
# distinct from the legacy ``mig`` prefix so caches stamped before
# the per-migration-counter framework was deleted never match.
return (
f"{spec.name}/{model}__{dim_str}__{digest}__mm{mm_str}__sf{SCHEMA_VERSION}"
f"{spec.name}/{model}__{dim_str}__{digest}"
f"__mm{mm_str}__cjk{cjk_tokenizer}__sf{SCHEMA_VERSION}"
)


Expand Down Expand Up @@ -502,6 +508,8 @@ async def _build(target: Path) -> Path:
embedder=effective_embedder,
embedding_model=effective_provider_cfg.embedding_model,
modes=modes,
retrieval_config=effective_retrieval_cfg,
provider_config=effective_provider_cfg,
reporter=_reporter,
)
else:
Expand All @@ -511,6 +519,7 @@ async def _build(target: Path) -> Path:
spec,
effective_provider_cfg.embedding_model,
effective_provider_cfg.embedding_dim,
cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
mm_fingerprint=mm_fingerprint,
)
cache_dir = root / key
Expand Down Expand Up @@ -549,6 +558,8 @@ async def _build(target: Path) -> Path:
embedder=effective_embedder,
embedding_model=effective_provider_cfg.embedding_model,
modes=modes,
retrieval_config=effective_retrieval_cfg,
provider_config=effective_provider_cfg,
reporter=_reporter,
)

Expand Down Expand Up @@ -615,11 +626,19 @@ def _materialise_base(
"""Scaffold a throwaway base + dikw.yml that matches ``spec``.

``provider_cfg`` / ``retrieval_cfg`` / ``assets_cfg`` are copied
verbatim into the written ``dikw.yml``. Downstream ``api.ingest``
reads the provider + assets blocks; ``_run_queries`` re-loads the
whole file to build ``HybridSearcher``, which picks up the retrieval
+ multimodal blocks. This means eval reproducibly measures whatever
fusion knobs the caller passed, including the asset-vector leg.
verbatim into the written ``dikw.yml``. Downstream ``api.ingest`` reads
the provider + assets blocks plus the ingest-time
``retrieval.cjk_tokenizer`` (which is baked into the FTS index). The
query-time knobs (the search-time ``RetrievalConfig`` fields rrf_k,
weights, fusion, graph_*, rerank_*, AND the rerank provider wiring) are
NOT read back from this file at query time — ``_run_queries`` receives
the caller's live ``RetrievalConfig`` + ``ProviderConfig`` directly
(#250). On a **cache hit** this file is the one the FIRST (snapshot-
building) arm wrote, so its query-time blocks are STALE relative to a
later arm's live config: trust it only for the ingest-time identity
(provider model+dim, ``cjk_tokenizer``, ``multimodal`` — the last is
still reloaded for the asset-vector leg), not to reproduce which fusion /
rerank knobs produced a given arm's numbers.

``schema_cfg`` is optional and only used by ``run_synth_eval`` so a
synth-mode dataset can pin its own ``categories`` taxonomy into the
Expand Down Expand Up @@ -861,14 +880,46 @@ async def _run_queries(
embedder: EmbeddingProvider,
embedding_model: str,
modes: list[RetrievalMode],
retrieval_config: RetrievalConfig,
provider_config: ProviderConfig,
reporter: ProgressReporter | None = None,
) -> dict[RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]]:
"""Run every query in ``spec`` once per mode against a single storage
connection. Returns a dict keyed by mode.

Everything that shaped the on-disk snapshot is keyed into the cache and
read back from the baked ``cfg`` (corpus, embedding model+dim, the
ingest-time ``retrieval.cjk_tokenizer``, multimodal identity). Everything
that runs **at query time** is taken from the caller's **live** config so
an ablation under ``cache_mode="read_write"`` takes effect on a cache hit
instead of silently reusing the first run's config (#250):

* ``retrieval_config`` — search-time fusion knobs (rrf_k, weights, fusion,
graph_*, ``rerank_enabled``, ``rerank_candidate_k``, same_doc_penalty_alpha).
* ``provider_config`` — the rerank **provider** wiring (``rerank_model`` /
base_url / key). The reranker scores ``(query, chunk)`` at query time and
never touches the stored index, so it is read live (not keyed) — a
rerank-model A/B is correct on a cache hit. The text query embedder is
likewise the caller's live ``embedder`` + ``embedding_model``; its
identity is keyed so live == baked on a hit.

The one ingest-time retrieval field, ``cjk_tokenizer``, is pinned into the
snapshot cache key (``_corpus_cache_key``), so on a hit it is guaranteed to
match the baked FTS index.
"""
cfg, _root = api.load_base(base)
# The snapshot's ingest-time tokenizer is pinned into the cache key, so a
# cache hit always matches the caller's live tokenizer. Assert it
# (defence-in-depth against a future key-format regression) before reusing
# the FTS index that was built with that tokenizer.
if cfg.retrieval.cjk_tokenizer != retrieval_config.cjk_tokenizer:
raise EvalError(
f"snapshot cjk_tokenizer {cfg.retrieval.cjk_tokenizer!r} != "
f"requested {retrieval_config.cjk_tokenizer!r}; the cache key should "
f"have forced a fresh snapshot (this indicates a _corpus_cache_key bug)"
)
storage = build_storage(
cfg.storage, root=base, cjk_tokenizer=cfg.retrieval.cjk_tokenizer
cfg.storage, root=base, cjk_tokenizer=retrieval_config.cjk_tokenizer
)
await storage.connect()
await storage.migrate()
Expand Down Expand Up @@ -913,19 +964,26 @@ async def _run_queries(
# happened to carry in ``Hit.asset_refs``.
mm_search = await _build_multimodal_search(cfg, storage)
# Wire the rerank leg the same way ``_retrieve_inner`` does so eval
# measures rerank-on vs rerank-off at one config (flip
# ``retrieval.rerank_enabled`` between runs). ``None`` when the base
# configured no reranker or disabled it. Closed in the ``finally``.
if cfg.retrieval.rerank_enabled:
reranker = build_reranker(cfg.provider)
# measures rerank-on vs rerank-off at one config. Both halves are read
# from the caller's **live** config, not the baked snapshot, so flipping
# either between runs takes effect even on a cache hit: ``rerank_enabled``
# (the gate) from ``retrieval_config`` and the rerank provider wiring
# (``rerank_model`` / base_url / key) from ``provider_config``. The
# reranker scores ``(query, chunk)`` at query time and never touches the
# stored index, so it is correctly read live rather than keyed into the
# snapshot — a rerank-model A/B under ``read_write`` is correct.
# ``None`` when the caller's config has no reranker or disabled it.
# Closed in the ``finally``.
if retrieval_config.rerank_enabled:
reranker = build_reranker(provider_config)
searcher = HybridSearcher.from_config(
storage,
embedder,
cfg.retrieval,
retrieval_config,
embedding_model=embedding_model,
multimodal=mm_search,
reranker=reranker,
rerank_model=cfg.provider.rerank_model,
rerank_model=provider_config.rerank_model,
# Eval must fail loud on a transient query-embed blip, not silently
# degrade to FTS-only — a degraded query would bias the measured
# hybrid metric and could false-flag the ``--against`` gate.
Expand Down
Loading
Loading