From a09598afc71378f44fe2692b77db0646213a55bc Mon Sep 17 00:00:00 2001
From: holo <helebest@gmail.com>
Date: Sun, 28 Jun 2026 20:52:58 +0800
Subject: [PATCH 1/2] fix(eval): key cjk_tokenizer + read query-time retrieval
 config live (#250)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The eval snapshot cache key omitted RetrievalConfig, so under the default
`--cache read_write`, changing a retrieval setting silently reused the
previous run's baked config (no error, wrong numbers) — biting the
retrieval-tuning / ablation workflow exactly where config changes most.

Two-level cache: the snapshot is keyed on ingest-time inputs only (corpus,
embedding model+dim, the one ingest-time retrieval field `cjk_tokenizer`,
multimodal identity). Everything applied at query time is read from the
caller's live config in `_run_queries`, not the baked snapshot:
- search-time RetrievalConfig knobs (rrf_k, weights, fusion, graph_*,
  rerank_enabled, rerank_candidate_k) via a new `retrieval_config` param;
- the rerank provider wiring (rerank_model / base_url) via a new
  `provider_config` param — the reranker scores (query, chunk) at query
  time and never touches the stored index, so it is read live, not keyed.

A defensive assert guards the (key-guaranteed) baked==live cjk_tokenizer
match; a tripwire test fails if a new RetrievalConfig field is added
unclassified, so the hand-split can't silently rot.

Eval-infra only — domains/info/search.py and the retrieval algorithm are
untouched (no-baseline-needed). Docs (eval-plan.md, README.md, BASELINES.md)
updated: retrieval ablations are now both correct and fast under read_write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/eval-plan.md            |  19 +++--
 evals/BASELINES.md           |  35 +++++++++
 evals/README.md              |   7 +-
 src/dikw_core/eval/runner.py | 100 ++++++++++++++++++-----
 tests/test_eval_runner.py    | 148 +++++++++++++++++++++++++++++++++--
 5 files changed, 276 insertions(+), 33 deletions(-)

diff --git a/docs/eval-plan.md b/docs/eval-plan.md
index 2104a2b..cc4fe38 100644
--- a/docs/eval-plan.md
+++ b/docs/eval-plan.md
@@ -242,12 +242,19 @@ or non-destructiveness. Two gates:
    embedder**: the hermetic `FakeEmbeddings` is lexical bag-of-words and can't
    exercise the semantically-close-but-wrong failure mode rerank targets, so a
    fake-embedder run is non-informative for this change. scifact (large recall
-   pool) is the strongest signal; mvp is the dogfood sanity check. **Run each
-   arm with its own snapshot** (`cache_root`, or `cache_mode="off"`): the eval
-   cache key is `(corpus, embedder)` only and `_run_queries` reloads retrieval
-   config from the cached base's `dikw.yml`, so a shared `cache_root` makes the
-   second arm silently reuse the first arm's `rerank_enabled` — same reason RRF
-   weight A/B uses the offline re-fusion tools, not the `run_eval` cache.
+   pool) is the strongest signal; mvp is the dogfood sanity check. **Query-time
+   knobs are read live, so a shared `cache_root` is safe for these arms** — the
+   snapshot cache key covers only ingest-time inputs (corpus, embedding
+   model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal
+   identity), while everything applied at query time — the search-time
+   `RetrievalConfig` knobs (`rrf_k`, weights, `fusion`, `graph_*`,
+   `rerank_enabled`, `rerank_candidate_k`) **and** the rerank provider wiring
+   (`provider.rerank_model` / base_url) — is read from the caller's live config
+   in `_run_queries`, not the baked snapshot. So flipping `rerank_enabled`, the
+   rerank *model*, or any fusion knob between arms under the default
+   `cache_mode="read_write"` is both fast (no re-embed) and correct. Only an
+   **ingest-time** change (the embedding model/dim, or `cjk_tokenizer`) forces a
+   fresh snapshot — and the key handles that automatically.
 
 The Stage A K-layer fan-out + atomicity-lint baseline (2026-05-08), the
 wikilink graph leg ablation (2026-05-08), and the rerank-stage SciFact
diff --git a/evals/BASELINES.md b/evals/BASELINES.md
index 7ddc603..eecc447 100644
--- a/evals/BASELINES.md
+++ b/evals/BASELINES.md
@@ -7,6 +7,41 @@ regression from a re-run variance.
 Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml`
 are calibrated ~2-3 % below the most recent canonical-mode run.
 
+## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250)
+
+**Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now
+includes the one **ingest-time** `RetrievalConfig` field, `cjk_tokenizer` (baked
+into the FTS index), and `_run_queries` reads every **query-time** knob from the
+caller's **live** config instead of the baked snapshot's `dikw.yml`: the
+search-time `RetrievalConfig` fields (`rrf_k`, `bm25_weight`, `vector_weight`,
+`fusion`, `same_doc_penalty_alpha`, `graph_*`, `rerank_enabled`,
+`rerank_candidate_k`) **and** the rerank provider wiring (`provider.rerank_model`
+/ base_url) — the reranker scores `(query, chunk)` at query time and never
+touches the stored index, so it is read live, not keyed. Eval-infra only —
+`domains/info/search.py` and the retrieval algorithm are untouched, so this is
+`no-baseline-needed` (a new benchmark row would measure nothing new).
+
+**Consequence for the older entries below:** the "one snapshot per arm" /
+"DROP the snapshot between arms" footgun they describe (e.g. the 2026-06-25 rerank
+entry) **no longer applies to any query-time change**. Flipping `rerank_enabled`,
+the rerank *model*, `rrf_k`, weights, `fusion`, or `graph_*` between arms under
+the default `cache_mode="read_write"` now takes effect correctly and reuses the
+embeddings (fast). Only an **ingest-time** change (the embedding model/dim, or
+`cjk_tokenizer`) forces a fresh snapshot — and the key does that automatically.
+
+**Verification:** the cache-**hit** regression this PR fixes is covered by the
+unit tests `tests/test_eval_runner.py::test_eval_cache_search_time_config_read_live_on_hit`
+(asserts the warm run builds the searcher from the live `rrf_k`, not the baked
+one — goes red on pre-fix code, which threaded `cfg.retrieval`) and
+`::test_eval_cache_cjk_tokenizer_change_forces_reingest`; a tripwire test
+(`::test_retrieval_config_fields_are_classified_for_eval_cache`) fails if a new
+`RetrievalConfig` field is added unclassified. A hermetic `mvp` run
+(`FakeEmbeddings`, `cache_mode="off"`) is only a flow sanity check — it
+fresh-builds a base each run, so it does NOT exercise the cache-hit path and
+would pass on pre-fix code too; it merely confirms the live config is honored at
+all (default config vs `fusion="combsum"` → `mrr` 1.0 vs 0.833, `ndcg@10` 1.0 vs
+0.877) and that the default-config numbers are unchanged (algorithm untouched).
+
 ## 2026-06-25 — rerank stage: post-fusion cross-encoder (opt-in), SciFact off-vs-on
 
 **Change under test:** the new optional rerank stage in `HybridSearcher.search`
diff --git a/evals/README.md b/evals/README.md
index 82eee8b..0adaafa 100644
--- a/evals/README.md
+++ b/evals/README.md
@@ -167,7 +167,12 @@ embedder for benchmark-scale work).
 
 Tune the fusion weights by editing the `retrieval:` block in your
 base's `dikw.yml`, then re-run `dikw client eval --dataset <name>
---retrieval all` to compare. Pin the winning combination:
+--retrieval all` to compare. These are **search-time** knobs, read live
+each run, so re-running under the default `--cache read_write` is safe and
+fast (the snapshot is reused, only the ranking is recomputed). The one
+exception is `cjk_tokenizer`: it is **ingest-time** (baked into the FTS
+index), so it is part of the snapshot cache key — changing it
+automatically forces a fresh snapshot. Pin the winning combination:
 
 ```yaml
 retrieval:
diff --git a/src/dikw_core/eval/runner.py b/src/dikw_core/eval/runner.py
index c151ad6..1aa7685 100644
--- a/src/dikw_core/eval/runner.py
+++ b/src/dikw_core/eval/runner.py
@@ -152,15 +152,17 @@ def _corpus_cache_key(
     model: str,
     dim: int | None,
     *,
+    cjk_tokenizer: str,
     mm_fingerprint: str | None = None,
 ) -> str:
-    """Stable cache key combining dataset name, model, dim, corpus hash, schema.
+    """Stable cache key combining dataset name, model, dim, corpus hash,
+    tokenizer, schema.
 
     Algorithm:
       1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir
       2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable;
          ``cache_mode="rebuild"`` is the escape hatch)
-      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}``
+      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}``
 
     ``as_posix()`` keeps the key cross-platform (Windows / Linux yield
     the same digest). Embedding ``dim=None`` is rendered as ``0`` so the
@@ -171,11 +173,14 @@ def _corpus_cache_key(
     index can't be silently reused by a real-vector multimodal eval.
     ``None`` is rendered as ``0`` for back-compat with pre-mm caches.
 
-    NOTE: ``RetrievalConfig`` (rrf_k, weights, fusion, graph_*, …) is
-    NOT in the key. Re-running with a different retrieval config under
-    ``cache_mode="read_write"`` silently reuses the first run's
-    ``dikw.yml`` — the snapshot's base carries the old block. Retrieval
-    ablations must use ``cache_mode="off"`` until this is fixed.
+    ``cjk_tokenizer`` is the one **ingest-time** ``RetrievalConfig`` field —
+    it's baked into the FTS index (and drives the chunker's token budget) at
+    ingest, so it MUST be in the key: changing it forces a fresh snapshot.
+    Every other ``RetrievalConfig`` knob (rrf_k, weights, fusion, graph_*,
+    rerank_*) is **search-time** and is read live in ``_run_queries`` from
+    the caller's config — never from the baked snapshot — so a retrieval
+    ablation under ``cache_mode="read_write"`` is both fast (no re-embed)
+    and correct.
     """
     h = hashlib.sha256()
     for path in sorted(spec.corpus_dir.rglob("*")):
@@ -195,7 +200,8 @@ def _corpus_cache_key(
     # distinct from the legacy ``mig`` prefix so caches stamped before
     # the per-migration-counter framework was deleted never match.
     return (
-        f"{spec.name}/{model}__{dim_str}__{digest}__mm{mm_str}__sf{SCHEMA_VERSION}"
+        f"{spec.name}/{model}__{dim_str}__{digest}"
+        f"__mm{mm_str}__cjk{cjk_tokenizer}__sf{SCHEMA_VERSION}"
     )
 
 
@@ -502,6 +508,8 @@ async def _build(target: Path) -> Path:
                 embedder=effective_embedder,
                 embedding_model=effective_provider_cfg.embedding_model,
                 modes=modes,
+                retrieval_config=effective_retrieval_cfg,
+                provider_config=effective_provider_cfg,
                 reporter=_reporter,
             )
     else:
@@ -511,6 +519,7 @@ async def _build(target: Path) -> Path:
             spec,
             effective_provider_cfg.embedding_model,
             effective_provider_cfg.embedding_dim,
+            cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
             mm_fingerprint=mm_fingerprint,
         )
         cache_dir = root / key
@@ -549,6 +558,8 @@ async def _build(target: Path) -> Path:
             embedder=effective_embedder,
             embedding_model=effective_provider_cfg.embedding_model,
             modes=modes,
+            retrieval_config=effective_retrieval_cfg,
+            provider_config=effective_provider_cfg,
             reporter=_reporter,
         )
 
@@ -615,11 +626,19 @@ def _materialise_base(
     """Scaffold a throwaway base + dikw.yml that matches ``spec``.
 
     ``provider_cfg`` / ``retrieval_cfg`` / ``assets_cfg`` are copied
-    verbatim into the written ``dikw.yml``. Downstream ``api.ingest``
-    reads the provider + assets blocks; ``_run_queries`` re-loads the
-    whole file to build ``HybridSearcher``, which picks up the retrieval
-    + multimodal blocks. This means eval reproducibly measures whatever
-    fusion knobs the caller passed, including the asset-vector leg.
+    verbatim into the written ``dikw.yml``. Downstream ``api.ingest`` reads
+    the provider + assets blocks plus the ingest-time
+    ``retrieval.cjk_tokenizer`` (which is baked into the FTS index). The
+    query-time knobs (the search-time ``RetrievalConfig`` fields rrf_k,
+    weights, fusion, graph_*, rerank_*, AND the rerank provider wiring) are
+    NOT read back from this file at query time — ``_run_queries`` receives
+    the caller's live ``RetrievalConfig`` + ``ProviderConfig`` directly
+    (#250). On a **cache hit** this file is the one the FIRST (snapshot-
+    building) arm wrote, so its query-time blocks are STALE relative to a
+    later arm's live config: trust it only for the ingest-time identity
+    (provider model+dim, ``cjk_tokenizer``, ``multimodal`` — the last is
+    still reloaded for the asset-vector leg), not to reproduce which fusion /
+    rerank knobs produced a given arm's numbers.
 
     ``schema_cfg`` is optional and only used by ``run_synth_eval`` so a
     synth-mode dataset can pin its own ``categories`` taxonomy into the
@@ -861,14 +880,46 @@ async def _run_queries(
     embedder: EmbeddingProvider,
     embedding_model: str,
     modes: list[RetrievalMode],
+    retrieval_config: RetrievalConfig,
+    provider_config: ProviderConfig,
     reporter: ProgressReporter | None = None,
 ) -> dict[RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]]:
     """Run every query in ``spec`` once per mode against a single storage
     connection. Returns a dict keyed by mode.
+
+    Everything that shaped the on-disk snapshot is keyed into the cache and
+    read back from the baked ``cfg`` (corpus, embedding model+dim, the
+    ingest-time ``retrieval.cjk_tokenizer``, multimodal identity). Everything
+    that runs **at query time** is taken from the caller's **live** config so
+    an ablation under ``cache_mode="read_write"`` takes effect on a cache hit
+    instead of silently reusing the first run's config (#250):
+
+    * ``retrieval_config`` — search-time fusion knobs (rrf_k, weights, fusion,
+      graph_*, ``rerank_enabled``, ``rerank_candidate_k``, same_doc_penalty_alpha).
+    * ``provider_config`` — the rerank **provider** wiring (``rerank_model`` /
+      base_url / key). The reranker scores ``(query, chunk)`` at query time and
+      never touches the stored index, so it is read live (not keyed) — a
+      rerank-model A/B is correct on a cache hit. The text query embedder is
+      likewise the caller's live ``embedder`` + ``embedding_model``; its
+      identity is keyed so live == baked on a hit.
+
+    The one ingest-time retrieval field, ``cjk_tokenizer``, is pinned into the
+    snapshot cache key (``_corpus_cache_key``), so on a hit it is guaranteed to
+    match the baked FTS index.
     """
     cfg, _root = api.load_base(base)
+    # The snapshot's ingest-time tokenizer is pinned into the cache key, so a
+    # cache hit always matches the caller's live tokenizer. Assert it
+    # (defence-in-depth against a future key-format regression) before reusing
+    # the FTS index that was built with that tokenizer.
+    if cfg.retrieval.cjk_tokenizer != retrieval_config.cjk_tokenizer:
+        raise EvalError(
+            f"snapshot cjk_tokenizer {cfg.retrieval.cjk_tokenizer!r} != "
+            f"requested {retrieval_config.cjk_tokenizer!r}; the cache key should "
+            f"have forced a fresh snapshot (this indicates a _corpus_cache_key bug)"
+        )
     storage = build_storage(
-        cfg.storage, root=base, cjk_tokenizer=cfg.retrieval.cjk_tokenizer
+        cfg.storage, root=base, cjk_tokenizer=retrieval_config.cjk_tokenizer
     )
     await storage.connect()
     await storage.migrate()
@@ -913,19 +964,26 @@ async def _run_queries(
         # happened to carry in ``Hit.asset_refs``.
         mm_search = await _build_multimodal_search(cfg, storage)
         # Wire the rerank leg the same way ``_retrieve_inner`` does so eval
-        # measures rerank-on vs rerank-off at one config (flip
-        # ``retrieval.rerank_enabled`` between runs). ``None`` when the base
-        # configured no reranker or disabled it. Closed in the ``finally``.
-        if cfg.retrieval.rerank_enabled:
-            reranker = build_reranker(cfg.provider)
+        # measures rerank-on vs rerank-off at one config. Both halves are read
+        # from the caller's **live** config, not the baked snapshot, so flipping
+        # either between runs takes effect even on a cache hit: ``rerank_enabled``
+        # (the gate) from ``retrieval_config`` and the rerank provider wiring
+        # (``rerank_model`` / base_url / key) from ``provider_config``. The
+        # reranker scores ``(query, chunk)`` at query time and never touches the
+        # stored index, so it is correctly read live rather than keyed into the
+        # snapshot — a rerank-model A/B under ``read_write`` is correct.
+        # ``None`` when the caller's config has no reranker or disabled it.
+        # Closed in the ``finally``.
+        if retrieval_config.rerank_enabled:
+            reranker = build_reranker(provider_config)
         searcher = HybridSearcher.from_config(
             storage,
             embedder,
-            cfg.retrieval,
+            retrieval_config,
             embedding_model=embedding_model,
             multimodal=mm_search,
             reranker=reranker,
-            rerank_model=cfg.provider.rerank_model,
+            rerank_model=provider_config.rerank_model,
             # Eval must fail loud on a transient query-embed blip, not silently
             # degrade to FTS-only — a degraded query would bias the measured
             # hybrid metric and could false-flag the ``--against`` gate.
diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py
index 66bfc9b..36235b7 100644
--- a/tests/test_eval_runner.py
+++ b/tests/test_eval_runner.py
@@ -741,19 +741,19 @@ async def test_corpus_cache_key_is_deterministic_and_path_stable(
         queries=[("foo", ["alpha"])],
     )
     spec = load_dataset(ds)
-    k1 = _corpus_cache_key(spec, "fake", 64)
-    k2 = _corpus_cache_key(spec, "fake", 64)
+    k1 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba")
+    k2 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba")
     assert k1 == k2
 
     # Mutate one file's bytes — key must change.
     (ds / "corpus" / "alpha.md").write_text(
         "# Alpha\n\nDifferent body now.\n", encoding="utf-8"
     )
-    k3 = _corpus_cache_key(spec, "fake", 64)
+    k3 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba")
     assert k3 != k1
 
     # Different model → different key.
-    k4 = _corpus_cache_key(spec, "other-model", 64)
+    k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba")
     assert k4 != k1
     # Format sanity.
     assert k1.startswith("toy/fake__64__")
@@ -762,10 +762,20 @@ async def test_corpus_cache_key_is_deterministic_and_path_stable(
     # Different multimodal fingerprint → different key (text-only and
     # multimodal eval runs against the same corpus must NOT share a
     # snapshot — the asset index lives in different vec tables).
-    k5 = _corpus_cache_key(spec, "fake", 64, mm_fingerprint="qwen3-vl-8b@4096")
+    k5 = _corpus_cache_key(
+        spec, "fake", 64, cjk_tokenizer="jieba", mm_fingerprint="qwen3-vl-8b@4096"
+    )
     assert k5 != k1
     assert "__mmqwen3-vl-8b@4096__" in k5
 
+    # #250: cjk_tokenizer is the one INGEST-TIME RetrievalConfig field — it's
+    # baked into the FTS index, so it must be in the key. Changing it must
+    # produce a distinct key (a fresh snapshot), never a stale reuse.
+    k6 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="none")
+    assert k6 != k1
+    assert "__cjkjieba__" in k1
+    assert "__cjknone__" in k6
+
 
 async def test_eval_snapshot_cache_hit_skips_ingest(
     tmp_path: Path, monkeypatch: pytest.MonkeyPatch
@@ -899,6 +909,134 @@ async def boom_ingest(*args: object, **kwargs: object) -> object:
     partial_dirs = list((cache_root / "toy").glob("fake__*.partial"))
     assert len(partial_dirs) == 1
 
+
+async def test_eval_cache_search_time_config_read_live_on_hit(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    """#250: a SEARCH-TIME retrieval knob (rrf_k) changed under read_write takes
+    effect on a cache HIT — the searcher is built from the caller's LIVE config,
+    not the baked snapshot — and the change does NOT re-ingest (fast + correct).
+    """
+    from dikw_core.config import RetrievalConfig
+    from dikw_core.eval import runner as runner_mod
+
+    ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])])
+    spec = load_dataset(ds)
+    cache_root = tmp_path / "cache"
+
+    ingest_calls = 0
+    real_ingest = runner_mod.api.ingest
+
+    async def spy_ingest(*args: object, **kwargs: object) -> object:
+        nonlocal ingest_calls
+        ingest_calls += 1
+        return await real_ingest(*args, **kwargs)
+
+    monkeypatch.setattr(runner_mod.api, "ingest", spy_ingest)
+
+    # Capture the rrf_k the searcher is actually built with each run. The
+    # ``from_config`` classmethod, set as a plain attr on the class, is called
+    # unbound (``HybridSearcher.from_config(storage, embedder, cfg, ...)``), so
+    # the spy sees the same positional args.
+    seen_rrf_k: list[int] = []
+    real_from_config = runner_mod.HybridSearcher.from_config
+
+    def spy_from_config(storage, embedder, retrieval_cfg, **kwargs):  # type: ignore[no-untyped-def]
+        seen_rrf_k.append(retrieval_cfg.rrf_k)
+        return real_from_config(storage, embedder, retrieval_cfg, **kwargs)
+
+    monkeypatch.setattr(runner_mod.HybridSearcher, "from_config", spy_from_config)
+
+    await run_eval(
+        spec, cache_root=cache_root, retrieval_config=RetrievalConfig(rrf_k=60)
+    )
+    await run_eval(
+        spec, cache_root=cache_root, retrieval_config=RetrievalConfig(rrf_k=11)
+    )
+
+    assert ingest_calls == 1, "search-time change must reuse the snapshot (no re-ingest)"
+    assert seen_rrf_k == [60, 11], (
+        "warm run must build the searcher from the LIVE rrf_k (11), not the "
+        f"baked 60; saw {seen_rrf_k}"
+    )
+
+
+async def test_eval_cache_cjk_tokenizer_change_forces_reingest(
+    tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    """#250: cjk_tokenizer is INGEST-TIME — it's pinned into the cache key, so
+    changing it produces a fresh snapshot (re-ingest), never a stale reuse of an
+    index built with the old tokenizer.
+    """
+    from dikw_core.config import RetrievalConfig
+    from dikw_core.eval import runner as runner_mod
+
+    ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])])
+    spec = load_dataset(ds)
+    cache_root = tmp_path / "cache"
+
+    ingest_calls = 0
+    real_ingest = runner_mod.api.ingest
+
+    async def spy_ingest(*args: object, **kwargs: object) -> object:
+        nonlocal ingest_calls
+        ingest_calls += 1
+        return await real_ingest(*args, **kwargs)
+
+    monkeypatch.setattr(runner_mod.api, "ingest", spy_ingest)
+
+    await run_eval(
+        spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="jieba")
+    )
+    assert ingest_calls == 1
+    await run_eval(
+        spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="none")
+    )
+    assert ingest_calls == 2, "ingest-time tokenizer change must re-ingest"
+    # Two distinct snapshots coexist under the dataset's cache subdir.
+    snapshot_dirs = [
+        d for d in (cache_root / "toy").glob("fake__*") if not d.name.endswith(".partial")
+    ]
+    assert len(snapshot_dirs) == 2
+
+
+def test_retrieval_config_fields_are_classified_for_eval_cache() -> None:
+    """#250 tripwire: the eval snapshot cache hand-splits ``RetrievalConfig`` into
+    ONE ingest-time field (``cjk_tokenizer`` — baked into the FTS index, so keyed
+    in ``_corpus_cache_key``) and the rest search-time (read live in
+    ``_run_queries``, never keyed). That split is encoded by hand, so adding a new
+    ``RetrievalConfig`` field would silently default it to "search-time, read
+    live" — which is WRONG and reintroduces the #250 stale-snapshot bug if the new
+    field actually affects ingest. This test fails on any field-set change, forcing
+    the contributor to classify the new field:
+      * ingest-time (changes chunking / the FTS index / stored vectors) → add it
+        to ``_corpus_cache_key`` (and the ``cjk_tokenizer`` guard in
+        ``_run_queries``);
+      * search-time (query-time ranking only) → it is already read live; no key
+        change needed.
+    """
+    from dikw_core.config import RetrievalConfig
+
+    ingest_time = {"cjk_tokenizer"}
+    search_time = {
+        "rrf_k",
+        "bm25_weight",
+        "vector_weight",
+        "fusion",
+        "same_doc_penalty_alpha",
+        "graph_enabled",
+        "graph_seed_top_k",
+        "graph_weight",
+        "rerank_enabled",
+        "rerank_candidate_k",
+    }
+    assert set(RetrievalConfig.model_fields) == ingest_time | search_time, (
+        "RetrievalConfig fields changed — classify the new/removed field for the "
+        "eval snapshot cache (src/dikw_core/eval/runner.py): ingest-time → add to "
+        "_corpus_cache_key (+ the cjk_tokenizer assert in _run_queries); "
+        "search-time → it is read live, no key change needed. See #250."
+    )
+
 # ---- check_thresholds direction-aware ---------------------------------------
 
 

From a9ca5dcb0f916ff8bf5f550f8c91b19e02542b31 Mon Sep 17 00:00:00 2001
From: holo <helebest@gmail.com>
Date: Sun, 28 Jun 2026 21:04:43 +0800
Subject: [PATCH 2/2] test(eval): cover the cjk_tokenizer mismatch guard in
 _run_queries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Exercise the #250 defence-in-depth assert directly (build a jieba snapshot,
call _run_queries with a mismatched live tokenizer → EvalError), covering
the otherwise-unexecuted raise so codecov/patch reflects the guard's intent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/test_eval_runner.py | 44 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py
index 36235b7..22ca6ab 100644
--- a/tests/test_eval_runner.py
+++ b/tests/test_eval_runner.py
@@ -1037,6 +1037,50 @@ def test_retrieval_config_fields_are_classified_for_eval_cache() -> None:
         "search-time → it is read live, no key change needed. See #250."
     )
 
+
+async def test_run_queries_guards_against_tokenizer_mismatch(
+    tmp_path: Path,
+) -> None:
+    """#250 defence-in-depth: if a future ``_corpus_cache_key`` regression ever let
+    a snapshot built with one ``cjk_tokenizer`` be reused under a different live
+    tokenizer, ``_run_queries`` fails loud rather than querying an FTS index built
+    with the wrong tokenizer. The cache key makes this impossible in normal
+    operation, so it's exercised here by calling ``_run_queries`` directly on a
+    jieba-built snapshot with a mismatched live ``cjk_tokenizer``.
+    """
+    from dikw_core.config import ProviderConfig, RetrievalConfig
+    from dikw_core.eval.fake_embedder import FakeEmbeddings
+    from dikw_core.eval.runner import _run_queries
+
+    ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])])
+    spec = load_dataset(ds)
+    cache_root = tmp_path / "cache"
+    await run_eval(
+        spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="jieba")
+    )
+    base = next((cache_root / "toy").glob("fake__*")) / "base"
+    assert base.is_dir()
+
+    provider = ProviderConfig(
+        llm_api_key_env="ANTHROPIC_API_KEY",
+        embedding_model="fake",
+        embedding_dim=64,
+        embedding_revision="",
+        embedding_normalize=True,
+        embedding_distance="cosine",
+        embedding_api_key_env="OPENAI_API_KEY",
+    )
+    with pytest.raises(EvalError, match="snapshot cjk_tokenizer"):
+        await _run_queries(
+            base,
+            spec,
+            embedder=FakeEmbeddings(),
+            embedding_model="fake",
+            modes=["hybrid"],
+            retrieval_config=RetrievalConfig(cjk_tokenizer="none"),
+            provider_config=provider,
+        )
+
 # ---- check_thresholds direction-aware ---------------------------------------