OpenDIKW · helebest · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026 · coderabbitai
diff --git a/docs/eval-plan.md b/docs/eval-plan.md
@@ -242,12 +242,19 @@ or non-destructiveness. Two gates:
    embedder**: the hermetic `FakeEmbeddings` is lexical bag-of-words and can't
    exercise the semantically-close-but-wrong failure mode rerank targets, so a
    fake-embedder run is non-informative for this change. scifact (large recall
-   pool) is the strongest signal; mvp is the dogfood sanity check. **Run each
-   arm with its own snapshot** (`cache_root`, or `cache_mode="off"`): the eval
-   cache key is `(corpus, embedder)` only and `_run_queries` reloads retrieval
-   config from the cached base's `dikw.yml`, so a shared `cache_root` makes the
-   second arm silently reuse the first arm's `rerank_enabled` — same reason RRF
-   weight A/B uses the offline re-fusion tools, not the `run_eval` cache.
+   pool) is the strongest signal; mvp is the dogfood sanity check. **Query-time
+   knobs are read live, so a shared `cache_root` is safe for these arms** — the
+   snapshot cache key covers only ingest-time inputs (corpus, embedding
+   model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal
+   identity), while everything applied at query time — the search-time
+   `RetrievalConfig` knobs (`rrf_k`, weights, `fusion`, `graph_*`,
+   `rerank_enabled`, `rerank_candidate_k`) **and** the rerank provider wiring
+   (`provider.rerank_model` / base_url) — is read from the caller's live config
+   in `_run_queries`, not the baked snapshot. So flipping `rerank_enabled`, the
+   rerank *model*, or any fusion knob between arms under the default
+   `cache_mode="read_write"` is both fast (no re-embed) and correct. Only an
+   **ingest-time** change (the embedding model/dim, or `cjk_tokenizer`) forces a
+   fresh snapshot — and the key handles that automatically.
 
 The Stage A K-layer fan-out + atomicity-lint baseline (2026-05-08), the
 wikilink graph leg ablation (2026-05-08), and the rerank-stage SciFact

diff --git a/evals/BASELINES.md b/evals/BASELINES.md
@@ -7,6 +7,41 @@ regression from a re-run variance.
 Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml`
 are calibrated ~2-3 % below the most recent canonical-mode run.
 
+## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250)
+
+**Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now
+includes the one **ingest-time** `RetrievalConfig` field, `cjk_tokenizer` (baked
+into the FTS index), and `_run_queries` reads every **query-time** knob from the
+caller's **live** config instead of the baked snapshot's `dikw.yml`: the
+search-time `RetrievalConfig` fields (`rrf_k`, `bm25_weight`, `vector_weight`,
+`fusion`, `same_doc_penalty_alpha`, `graph_*`, `rerank_enabled`,
+`rerank_candidate_k`) **and** the rerank provider wiring (`provider.rerank_model`
+/ base_url) — the reranker scores `(query, chunk)` at query time and never
+touches the stored index, so it is read live, not keyed. Eval-infra only —
+`domains/info/search.py` and the retrieval algorithm are untouched, so this is
+`no-baseline-needed` (a new benchmark row would measure nothing new).
+
+**Consequence for the older entries below:** the "one snapshot per arm" /
+"DROP the snapshot between arms" footgun they describe (e.g. the 2026-06-25 rerank
+entry) **no longer applies to any query-time change**. Flipping `rerank_enabled`,
+the rerank *model*, `rrf_k`, weights, `fusion`, or `graph_*` between arms under
+the default `cache_mode="read_write"` now takes effect correctly and reuses the
+embeddings (fast). Only an **ingest-time** change (the embedding model/dim, or
+`cjk_tokenizer`) forces a fresh snapshot — and the key does that automatically.
+
+**Verification:** the cache-**hit** regression this PR fixes is covered by the
+unit tests `tests/test_eval_runner.py::test_eval_cache_search_time_config_read_live_on_hit`
+(asserts the warm run builds the searcher from the live `rrf_k`, not the baked
+one — goes red on pre-fix code, which threaded `cfg.retrieval`) and
+`::test_eval_cache_cjk_tokenizer_change_forces_reingest`; a tripwire test
+(`::test_retrieval_config_fields_are_classified_for_eval_cache`) fails if a new
+`RetrievalConfig` field is added unclassified. A hermetic `mvp` run
+(`FakeEmbeddings`, `cache_mode="off"`) is only a flow sanity check — it
+fresh-builds a base each run, so it does NOT exercise the cache-hit path and
+would pass on pre-fix code too; it merely confirms the live config is honored at
+all (default config vs `fusion="combsum"` → `mrr` 1.0 vs 0.833, `ndcg@10` 1.0 vs
+0.877) and that the default-config numbers are unchanged (algorithm untouched).
+
 ## 2026-06-25 — rerank stage: post-fusion cross-encoder (opt-in), SciFact off-vs-on
 
 **Change under test:** the new optional rerank stage in `HybridSearcher.search`

diff --git a/evals/README.md b/evals/README.md
@@ -167,7 +167,12 @@ embedder for benchmark-scale work).
 
 Tune the fusion weights by editing the `retrieval:` block in your
 base's `dikw.yml`, then re-run `dikw client eval --dataset <name>
---retrieval all` to compare. Pin the winning combination:
+--retrieval all` to compare. These are **search-time** knobs, read live
+each run, so re-running under the default `--cache read_write` is safe and
+fast (the snapshot is reused, only the ranking is recomputed). The one
+exception is `cjk_tokenizer`: it is **ingest-time** (baked into the FTS
+index), so it is part of the snapshot cache key — changing it
+automatically forces a fresh snapshot. Pin the winning combination:
 
 ```yaml
 retrieval:

diff --git a/src/dikw_core/eval/runner.py b/src/dikw_core/eval/runner.py
@@ -152,15 +152,17 @@ def _corpus_cache_key(
     model: str,
     dim: int | None,
     *,
+    cjk_tokenizer: str,
     mm_fingerprint: str | None = None,
 ) -> str:
-    """Stable cache key combining dataset name, model, dim, corpus hash, schema.
+    """Stable cache key combining dataset name, model, dim, corpus hash,
+    tokenizer, schema.
 
     Algorithm:
       1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir
       2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable;
          ``cache_mode="rebuild"`` is the escape hatch)
-      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}``
+      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}``
 
     ``as_posix()`` keeps the key cross-platform (Windows / Linux yield
     the same digest). Embedding ``dim=None`` is rendered as ``0`` so the
@@ -171,11 +173,14 @@ def _corpus_cache_key(
     index can't be silently reused by a real-vector multimodal eval.
     ``None`` is rendered as ``0`` for back-compat with pre-mm caches.
 
-    NOTE: ``RetrievalConfig`` (rrf_k, weights, fusion, graph_*, …) is
-    NOT in the key. Re-running with a different retrieval config under
-    ``cache_mode="read_write"`` silently reuses the first run's
-    ``dikw.yml`` — the snapshot's base carries the old block. Retrieval
-    ablations must use ``cache_mode="off"`` until this is fixed.
+    ``cjk_tokenizer`` is the one **ingest-time** ``RetrievalConfig`` field —
+    it's baked into the FTS index (and drives the chunker's token budget) at
+    ingest, so it MUST be in the key: changing it forces a fresh snapshot.
+    Every other ``RetrievalConfig`` knob (rrf_k, weights, fusion, graph_*,
+    rerank_*) is **search-time** and is read live in ``_run_queries`` from
+    the caller's config — never from the baked snapshot — so a retrieval
+    ablation under ``cache_mode="read_write"`` is both fast (no re-embed)
+    and correct.
     """
     h = hashlib.sha256()
     for path in sorted(spec.corpus_dir.rglob("*")):
@@ -195,7 +200,8 @@ def _corpus_cache_key(
     # distinct from the legacy ``mig`` prefix so caches stamped before
     # the per-migration-counter framework was deleted never match.
     return (
-        f"{spec.name}/{model}__{dim_str}__{digest}__mm{mm_str}__sf{SCHEMA_VERSION}"
+        f"{spec.name}/{model}__{dim_str}__{digest}"
+        f"__mm{mm_str}__cjk{cjk_tokenizer}__sf{SCHEMA_VERSION}"
     )
 
 
@@ -502,6 +508,8 @@ async def _build(target: Path) -> Path:
                 embedder=effective_embedder,
                 embedding_model=effective_provider_cfg.embedding_model,
                 modes=modes,
+                retrieval_config=effective_retrieval_cfg,
+                provider_config=effective_provider_cfg,
                 reporter=_reporter,
             )
     else:
@@ -511,6 +519,7 @@ async def _build(target: Path) -> Path:
             spec,
             effective_provider_cfg.embedding_model,
             effective_provider_cfg.embedding_dim,
+            cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
             mm_fingerprint=mm_fingerprint,
         )
         cache_dir = root / key
@@ -549,6 +558,8 @@ async def _build(target: Path) -> Path:
             embedder=effective_embedder,
             embedding_model=effective_provider_cfg.embedding_model,
             modes=modes,
+            retrieval_config=effective_retrieval_cfg,
+            provider_config=effective_provider_cfg,
             reporter=_reporter,
         )
 
@@ -615,11 +626,19 @@ def _materialise_base(
     """Scaffold a throwaway base + dikw.yml that matches ``spec``.
 
     ``provider_cfg`` / ``retrieval_cfg`` / ``assets_cfg`` are copied
-    verbatim into the written ``dikw.yml``. Downstream ``api.ingest``
-    reads the provider + assets blocks; ``_run_queries`` re-loads the
-    whole file to build ``HybridSearcher``, which picks up the retrieval
-    + multimodal blocks. This means eval reproducibly measures whatever
-    fusion knobs the caller passed, including the asset-vector leg.
+    verbatim into the written ``dikw.yml``. Downstream ``api.ingest`` reads
+    the provider + assets blocks plus the ingest-time
+    ``retrieval.cjk_tokenizer`` (which is baked into the FTS index). The
+    query-time knobs (the search-time ``RetrievalConfig`` fields rrf_k,
+    weights, fusion, graph_*, rerank_*, AND the rerank provider wiring) are
+    NOT read back from this file at query time — ``_run_queries`` receives
+    the caller's live ``RetrievalConfig`` + ``ProviderConfig`` directly
+    (#250). On a **cache hit** this file is the one the FIRST (snapshot-
+    building) arm wrote, so its query-time blocks are STALE relative to a
+    later arm's live config: trust it only for the ingest-time identity
+    (provider model+dim, ``cjk_tokenizer``, ``multimodal`` — the last is
+    still reloaded for the asset-vector leg), not to reproduce which fusion /
+    rerank knobs produced a given arm's numbers.
 
     ``schema_cfg`` is optional and only used by ``run_synth_eval`` so a
     synth-mode dataset can pin its own ``categories`` taxonomy into the
@@ -861,14 +880,46 @@ async def _run_queries(
     embedder: EmbeddingProvider,
     embedding_model: str,
     modes: list[RetrievalMode],
+    retrieval_config: RetrievalConfig,
+    provider_config: ProviderConfig,
     reporter: ProgressReporter | None = None,
 ) -> dict[RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]]:
     """Run every query in ``spec`` once per mode against a single storage
     connection. Returns a dict keyed by mode.
+
+    Everything that shaped the on-disk snapshot is keyed into the cache and
+    read back from the baked ``cfg`` (corpus, embedding model+dim, the
+    ingest-time ``retrieval.cjk_tokenizer``, multimodal identity). Everything
+    that runs **at query time** is taken from the caller's **live** config so
+    an ablation under ``cache_mode="read_write"`` takes effect on a cache hit
+    instead of silently reusing the first run's config (#250):
+
+    * ``retrieval_config`` — search-time fusion knobs (rrf_k, weights, fusion,
+      graph_*, ``rerank_enabled``, ``rerank_candidate_k``, same_doc_penalty_alpha).
+    * ``provider_config`` — the rerank **provider** wiring (``rerank_model`` /
+      base_url / key). The reranker scores ``(query, chunk)`` at query time and
+      never touches the stored index, so it is read live (not keyed) — a
+      rerank-model A/B is correct on a cache hit. The text query embedder is
+      likewise the caller's live ``embedder`` + ``embedding_model``; its
+      identity is keyed so live == baked on a hit.
+
+    The one ingest-time retrieval field, ``cjk_tokenizer``, is pinned into the
+    snapshot cache key (``_corpus_cache_key``), so on a hit it is guaranteed to
+    match the baked FTS index.
     """
     cfg, _root = api.load_base(base)
+    # The snapshot's ingest-time tokenizer is pinned into the cache key, so a
+    # cache hit always matches the caller's live tokenizer. Assert it
+    # (defence-in-depth against a future key-format regression) before reusing
+    # the FTS index that was built with that tokenizer.
+    if cfg.retrieval.cjk_tokenizer != retrieval_config.cjk_tokenizer:
+        raise EvalError(
+            f"snapshot cjk_tokenizer {cfg.retrieval.cjk_tokenizer!r} != "
+            f"requested {retrieval_config.cjk_tokenizer!r}; the cache key should "
+            f"have forced a fresh snapshot (this indicates a _corpus_cache_key bug)"
+        )
     storage = build_storage(
-        cfg.storage, root=base, cjk_tokenizer=cfg.retrieval.cjk_tokenizer
+        cfg.storage, root=base, cjk_tokenizer=retrieval_config.cjk_tokenizer
     )
     await storage.connect()
     await storage.migrate()
@@ -913,19 +964,26 @@ async def _run_queries(
         # happened to carry in ``Hit.asset_refs``.
         mm_search = await _build_multimodal_search(cfg, storage)
         # Wire the rerank leg the same way ``_retrieve_inner`` does so eval
-        # measures rerank-on vs rerank-off at one config (flip
-        # ``retrieval.rerank_enabled`` between runs). ``None`` when the base
-        # configured no reranker or disabled it. Closed in the ``finally``.
-        if cfg.retrieval.rerank_enabled:
-            reranker = build_reranker(cfg.provider)
+        # measures rerank-on vs rerank-off at one config. Both halves are read
+        # from the caller's **live** config, not the baked snapshot, so flipping
+        # either between runs takes effect even on a cache hit: ``rerank_enabled``
+        # (the gate) from ``retrieval_config`` and the rerank provider wiring
+        # (``rerank_model`` / base_url / key) from ``provider_config``. The
+        # reranker scores ``(query, chunk)`` at query time and never touches the
+        # stored index, so it is correctly read live rather than keyed into the
+        # snapshot — a rerank-model A/B under ``read_write`` is correct.
+        # ``None`` when the caller's config has no reranker or disabled it.
+        # Closed in the ``finally``.
+        if retrieval_config.rerank_enabled:
+            reranker = build_reranker(provider_config)
         searcher = HybridSearcher.from_config(
             storage,
             embedder,
-            cfg.retrieval,
+            retrieval_config,
             embedding_model=embedding_model,
             multimodal=mm_search,
             reranker=reranker,
-            rerank_model=cfg.provider.rerank_model,
+            rerank_model=provider_config.rerank_model,
             # Eval must fail loud on a transient query-embed blip, not silently
             # degrade to FTS-only — a degraded query would bias the measured
             # hybrid metric and could false-flag the ``--against`` gate.