From a09598afc71378f44fe2692b77db0646213a55bc Mon Sep 17 00:00:00 2001 From: holo Date: Sun, 28 Jun 2026 20:52:58 +0800 Subject: [PATCH 1/2] fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The eval snapshot cache key omitted RetrievalConfig, so under the default `--cache read_write`, changing a retrieval setting silently reused the previous run's baked config (no error, wrong numbers) — biting the retrieval-tuning / ablation workflow exactly where config changes most. Two-level cache: the snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal identity). Everything applied at query time is read from the caller's live config in `_run_queries`, not the baked snapshot: - search-time RetrievalConfig knobs (rrf_k, weights, fusion, graph_*, rerank_enabled, rerank_candidate_k) via a new `retrieval_config` param; - the rerank provider wiring (rerank_model / base_url) via a new `provider_config` param — the reranker scores (query, chunk) at query time and never touches the stored index, so it is read live, not keyed. A defensive assert guards the (key-guaranteed) baked==live cjk_tokenizer match; a tripwire test fails if a new RetrievalConfig field is added unclassified, so the hand-split can't silently rot. Eval-infra only — domains/info/search.py and the retrieval algorithm are untouched (no-baseline-needed). Docs (eval-plan.md, README.md, BASELINES.md) updated: retrieval ablations are now both correct and fast under read_write. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/eval-plan.md | 19 +++-- evals/BASELINES.md | 35 +++++++++ evals/README.md | 7 +- src/dikw_core/eval/runner.py | 100 ++++++++++++++++++----- tests/test_eval_runner.py | 148 +++++++++++++++++++++++++++++++++-- 5 files changed, 276 insertions(+), 33 deletions(-) diff --git a/docs/eval-plan.md b/docs/eval-plan.md index 2104a2b..cc4fe38 100644 --- a/docs/eval-plan.md +++ b/docs/eval-plan.md @@ -242,12 +242,19 @@ or non-destructiveness. Two gates: embedder**: the hermetic `FakeEmbeddings` is lexical bag-of-words and can't exercise the semantically-close-but-wrong failure mode rerank targets, so a fake-embedder run is non-informative for this change. scifact (large recall - pool) is the strongest signal; mvp is the dogfood sanity check. **Run each - arm with its own snapshot** (`cache_root`, or `cache_mode="off"`): the eval - cache key is `(corpus, embedder)` only and `_run_queries` reloads retrieval - config from the cached base's `dikw.yml`, so a shared `cache_root` makes the - second arm silently reuse the first arm's `rerank_enabled` — same reason RRF - weight A/B uses the offline re-fusion tools, not the `run_eval` cache. + pool) is the strongest signal; mvp is the dogfood sanity check. **Query-time + knobs are read live, so a shared `cache_root` is safe for these arms** — the + snapshot cache key covers only ingest-time inputs (corpus, embedding + model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal + identity), while everything applied at query time — the search-time + `RetrievalConfig` knobs (`rrf_k`, weights, `fusion`, `graph_*`, + `rerank_enabled`, `rerank_candidate_k`) **and** the rerank provider wiring + (`provider.rerank_model` / base_url) — is read from the caller's live config + in `_run_queries`, not the baked snapshot. So flipping `rerank_enabled`, the + rerank *model*, or any fusion knob between arms under the default + `cache_mode="read_write"` is both fast (no re-embed) and correct. Only an + **ingest-time** change (the embedding model/dim, or `cjk_tokenizer`) forces a + fresh snapshot — and the key handles that automatically. The Stage A K-layer fan-out + atomicity-lint baseline (2026-05-08), the wikilink graph leg ablation (2026-05-08), and the rerank-stage SciFact diff --git a/evals/BASELINES.md b/evals/BASELINES.md index 7ddc603..eecc447 100644 --- a/evals/BASELINES.md +++ b/evals/BASELINES.md @@ -7,6 +7,41 @@ regression from a re-run variance. Newest first. `dikw client eval` thresholds in each dataset's `dataset.yaml` are calibrated ~2-3 % below the most recent canonical-mode run. +## 2026-06-28 — eval-infra: two-level snapshot cache (fixes the RetrievalConfig footgun, #250) + +**Change under test:** `run_eval`'s snapshot cache key (`_corpus_cache_key`) now +includes the one **ingest-time** `RetrievalConfig` field, `cjk_tokenizer` (baked +into the FTS index), and `_run_queries` reads every **query-time** knob from the +caller's **live** config instead of the baked snapshot's `dikw.yml`: the +search-time `RetrievalConfig` fields (`rrf_k`, `bm25_weight`, `vector_weight`, +`fusion`, `same_doc_penalty_alpha`, `graph_*`, `rerank_enabled`, +`rerank_candidate_k`) **and** the rerank provider wiring (`provider.rerank_model` +/ base_url) — the reranker scores `(query, chunk)` at query time and never +touches the stored index, so it is read live, not keyed. Eval-infra only — +`domains/info/search.py` and the retrieval algorithm are untouched, so this is +`no-baseline-needed` (a new benchmark row would measure nothing new). + +**Consequence for the older entries below:** the "one snapshot per arm" / +"DROP the snapshot between arms" footgun they describe (e.g. the 2026-06-25 rerank +entry) **no longer applies to any query-time change**. Flipping `rerank_enabled`, +the rerank *model*, `rrf_k`, weights, `fusion`, or `graph_*` between arms under +the default `cache_mode="read_write"` now takes effect correctly and reuses the +embeddings (fast). Only an **ingest-time** change (the embedding model/dim, or +`cjk_tokenizer`) forces a fresh snapshot — and the key does that automatically. + +**Verification:** the cache-**hit** regression this PR fixes is covered by the +unit tests `tests/test_eval_runner.py::test_eval_cache_search_time_config_read_live_on_hit` +(asserts the warm run builds the searcher from the live `rrf_k`, not the baked +one — goes red on pre-fix code, which threaded `cfg.retrieval`) and +`::test_eval_cache_cjk_tokenizer_change_forces_reingest`; a tripwire test +(`::test_retrieval_config_fields_are_classified_for_eval_cache`) fails if a new +`RetrievalConfig` field is added unclassified. A hermetic `mvp` run +(`FakeEmbeddings`, `cache_mode="off"`) is only a flow sanity check — it +fresh-builds a base each run, so it does NOT exercise the cache-hit path and +would pass on pre-fix code too; it merely confirms the live config is honored at +all (default config vs `fusion="combsum"` → `mrr` 1.0 vs 0.833, `ndcg@10` 1.0 vs +0.877) and that the default-config numbers are unchanged (algorithm untouched). + ## 2026-06-25 — rerank stage: post-fusion cross-encoder (opt-in), SciFact off-vs-on **Change under test:** the new optional rerank stage in `HybridSearcher.search` diff --git a/evals/README.md b/evals/README.md index 82eee8b..0adaafa 100644 --- a/evals/README.md +++ b/evals/README.md @@ -167,7 +167,12 @@ embedder for benchmark-scale work). Tune the fusion weights by editing the `retrieval:` block in your base's `dikw.yml`, then re-run `dikw client eval --dataset ---retrieval all` to compare. Pin the winning combination: +--retrieval all` to compare. These are **search-time** knobs, read live +each run, so re-running under the default `--cache read_write` is safe and +fast (the snapshot is reused, only the ranking is recomputed). The one +exception is `cjk_tokenizer`: it is **ingest-time** (baked into the FTS +index), so it is part of the snapshot cache key — changing it +automatically forces a fresh snapshot. Pin the winning combination: ```yaml retrieval: diff --git a/src/dikw_core/eval/runner.py b/src/dikw_core/eval/runner.py index c151ad6..1aa7685 100644 --- a/src/dikw_core/eval/runner.py +++ b/src/dikw_core/eval/runner.py @@ -152,15 +152,17 @@ def _corpus_cache_key( model: str, dim: int | None, *, + cjk_tokenizer: str, mm_fingerprint: str | None = None, ) -> str: - """Stable cache key combining dataset name, model, dim, corpus hash, schema. + """Stable cache key combining dataset name, model, dim, corpus hash, + tokenizer, schema. Algorithm: 1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir 2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable; ``cache_mode="rebuild"`` is the escape hatch) - 3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}`` + 3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}`` ``as_posix()`` keeps the key cross-platform (Windows / Linux yield the same digest). Embedding ``dim=None`` is rendered as ``0`` so the @@ -171,11 +173,14 @@ def _corpus_cache_key( index can't be silently reused by a real-vector multimodal eval. ``None`` is rendered as ``0`` for back-compat with pre-mm caches. - NOTE: ``RetrievalConfig`` (rrf_k, weights, fusion, graph_*, …) is - NOT in the key. Re-running with a different retrieval config under - ``cache_mode="read_write"`` silently reuses the first run's - ``dikw.yml`` — the snapshot's base carries the old block. Retrieval - ablations must use ``cache_mode="off"`` until this is fixed. + ``cjk_tokenizer`` is the one **ingest-time** ``RetrievalConfig`` field — + it's baked into the FTS index (and drives the chunker's token budget) at + ingest, so it MUST be in the key: changing it forces a fresh snapshot. + Every other ``RetrievalConfig`` knob (rrf_k, weights, fusion, graph_*, + rerank_*) is **search-time** and is read live in ``_run_queries`` from + the caller's config — never from the baked snapshot — so a retrieval + ablation under ``cache_mode="read_write"`` is both fast (no re-embed) + and correct. """ h = hashlib.sha256() for path in sorted(spec.corpus_dir.rglob("*")): @@ -195,7 +200,8 @@ def _corpus_cache_key( # distinct from the legacy ``mig`` prefix so caches stamped before # the per-migration-counter framework was deleted never match. return ( - f"{spec.name}/{model}__{dim_str}__{digest}__mm{mm_str}__sf{SCHEMA_VERSION}" + f"{spec.name}/{model}__{dim_str}__{digest}" + f"__mm{mm_str}__cjk{cjk_tokenizer}__sf{SCHEMA_VERSION}" ) @@ -502,6 +508,8 @@ async def _build(target: Path) -> Path: embedder=effective_embedder, embedding_model=effective_provider_cfg.embedding_model, modes=modes, + retrieval_config=effective_retrieval_cfg, + provider_config=effective_provider_cfg, reporter=_reporter, ) else: @@ -511,6 +519,7 @@ async def _build(target: Path) -> Path: spec, effective_provider_cfg.embedding_model, effective_provider_cfg.embedding_dim, + cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer, mm_fingerprint=mm_fingerprint, ) cache_dir = root / key @@ -549,6 +558,8 @@ async def _build(target: Path) -> Path: embedder=effective_embedder, embedding_model=effective_provider_cfg.embedding_model, modes=modes, + retrieval_config=effective_retrieval_cfg, + provider_config=effective_provider_cfg, reporter=_reporter, ) @@ -615,11 +626,19 @@ def _materialise_base( """Scaffold a throwaway base + dikw.yml that matches ``spec``. ``provider_cfg`` / ``retrieval_cfg`` / ``assets_cfg`` are copied - verbatim into the written ``dikw.yml``. Downstream ``api.ingest`` - reads the provider + assets blocks; ``_run_queries`` re-loads the - whole file to build ``HybridSearcher``, which picks up the retrieval - + multimodal blocks. This means eval reproducibly measures whatever - fusion knobs the caller passed, including the asset-vector leg. + verbatim into the written ``dikw.yml``. Downstream ``api.ingest`` reads + the provider + assets blocks plus the ingest-time + ``retrieval.cjk_tokenizer`` (which is baked into the FTS index). The + query-time knobs (the search-time ``RetrievalConfig`` fields rrf_k, + weights, fusion, graph_*, rerank_*, AND the rerank provider wiring) are + NOT read back from this file at query time — ``_run_queries`` receives + the caller's live ``RetrievalConfig`` + ``ProviderConfig`` directly + (#250). On a **cache hit** this file is the one the FIRST (snapshot- + building) arm wrote, so its query-time blocks are STALE relative to a + later arm's live config: trust it only for the ingest-time identity + (provider model+dim, ``cjk_tokenizer``, ``multimodal`` — the last is + still reloaded for the asset-vector leg), not to reproduce which fusion / + rerank knobs produced a given arm's numbers. ``schema_cfg`` is optional and only used by ``run_synth_eval`` so a synth-mode dataset can pin its own ``categories`` taxonomy into the @@ -861,14 +880,46 @@ async def _run_queries( embedder: EmbeddingProvider, embedding_model: str, modes: list[RetrievalMode], + retrieval_config: RetrievalConfig, + provider_config: ProviderConfig, reporter: ProgressReporter | None = None, ) -> dict[RetrievalMode, tuple[list[PerQueryRow], list[NegativeRow]]]: """Run every query in ``spec`` once per mode against a single storage connection. Returns a dict keyed by mode. + + Everything that shaped the on-disk snapshot is keyed into the cache and + read back from the baked ``cfg`` (corpus, embedding model+dim, the + ingest-time ``retrieval.cjk_tokenizer``, multimodal identity). Everything + that runs **at query time** is taken from the caller's **live** config so + an ablation under ``cache_mode="read_write"`` takes effect on a cache hit + instead of silently reusing the first run's config (#250): + + * ``retrieval_config`` — search-time fusion knobs (rrf_k, weights, fusion, + graph_*, ``rerank_enabled``, ``rerank_candidate_k``, same_doc_penalty_alpha). + * ``provider_config`` — the rerank **provider** wiring (``rerank_model`` / + base_url / key). The reranker scores ``(query, chunk)`` at query time and + never touches the stored index, so it is read live (not keyed) — a + rerank-model A/B is correct on a cache hit. The text query embedder is + likewise the caller's live ``embedder`` + ``embedding_model``; its + identity is keyed so live == baked on a hit. + + The one ingest-time retrieval field, ``cjk_tokenizer``, is pinned into the + snapshot cache key (``_corpus_cache_key``), so on a hit it is guaranteed to + match the baked FTS index. """ cfg, _root = api.load_base(base) + # The snapshot's ingest-time tokenizer is pinned into the cache key, so a + # cache hit always matches the caller's live tokenizer. Assert it + # (defence-in-depth against a future key-format regression) before reusing + # the FTS index that was built with that tokenizer. + if cfg.retrieval.cjk_tokenizer != retrieval_config.cjk_tokenizer: + raise EvalError( + f"snapshot cjk_tokenizer {cfg.retrieval.cjk_tokenizer!r} != " + f"requested {retrieval_config.cjk_tokenizer!r}; the cache key should " + f"have forced a fresh snapshot (this indicates a _corpus_cache_key bug)" + ) storage = build_storage( - cfg.storage, root=base, cjk_tokenizer=cfg.retrieval.cjk_tokenizer + cfg.storage, root=base, cjk_tokenizer=retrieval_config.cjk_tokenizer ) await storage.connect() await storage.migrate() @@ -913,19 +964,26 @@ async def _run_queries( # happened to carry in ``Hit.asset_refs``. mm_search = await _build_multimodal_search(cfg, storage) # Wire the rerank leg the same way ``_retrieve_inner`` does so eval - # measures rerank-on vs rerank-off at one config (flip - # ``retrieval.rerank_enabled`` between runs). ``None`` when the base - # configured no reranker or disabled it. Closed in the ``finally``. - if cfg.retrieval.rerank_enabled: - reranker = build_reranker(cfg.provider) + # measures rerank-on vs rerank-off at one config. Both halves are read + # from the caller's **live** config, not the baked snapshot, so flipping + # either between runs takes effect even on a cache hit: ``rerank_enabled`` + # (the gate) from ``retrieval_config`` and the rerank provider wiring + # (``rerank_model`` / base_url / key) from ``provider_config``. The + # reranker scores ``(query, chunk)`` at query time and never touches the + # stored index, so it is correctly read live rather than keyed into the + # snapshot — a rerank-model A/B under ``read_write`` is correct. + # ``None`` when the caller's config has no reranker or disabled it. + # Closed in the ``finally``. + if retrieval_config.rerank_enabled: + reranker = build_reranker(provider_config) searcher = HybridSearcher.from_config( storage, embedder, - cfg.retrieval, + retrieval_config, embedding_model=embedding_model, multimodal=mm_search, reranker=reranker, - rerank_model=cfg.provider.rerank_model, + rerank_model=provider_config.rerank_model, # Eval must fail loud on a transient query-embed blip, not silently # degrade to FTS-only — a degraded query would bias the measured # hybrid metric and could false-flag the ``--against`` gate. diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py index 66bfc9b..36235b7 100644 --- a/tests/test_eval_runner.py +++ b/tests/test_eval_runner.py @@ -741,19 +741,19 @@ async def test_corpus_cache_key_is_deterministic_and_path_stable( queries=[("foo", ["alpha"])], ) spec = load_dataset(ds) - k1 = _corpus_cache_key(spec, "fake", 64) - k2 = _corpus_cache_key(spec, "fake", 64) + k1 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba") + k2 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba") assert k1 == k2 # Mutate one file's bytes — key must change. (ds / "corpus" / "alpha.md").write_text( "# Alpha\n\nDifferent body now.\n", encoding="utf-8" ) - k3 = _corpus_cache_key(spec, "fake", 64) + k3 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="jieba") assert k3 != k1 # Different model → different key. - k4 = _corpus_cache_key(spec, "other-model", 64) + k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba") assert k4 != k1 # Format sanity. assert k1.startswith("toy/fake__64__") @@ -762,10 +762,20 @@ async def test_corpus_cache_key_is_deterministic_and_path_stable( # Different multimodal fingerprint → different key (text-only and # multimodal eval runs against the same corpus must NOT share a # snapshot — the asset index lives in different vec tables). - k5 = _corpus_cache_key(spec, "fake", 64, mm_fingerprint="qwen3-vl-8b@4096") + k5 = _corpus_cache_key( + spec, "fake", 64, cjk_tokenizer="jieba", mm_fingerprint="qwen3-vl-8b@4096" + ) assert k5 != k1 assert "__mmqwen3-vl-8b@4096__" in k5 + # #250: cjk_tokenizer is the one INGEST-TIME RetrievalConfig field — it's + # baked into the FTS index, so it must be in the key. Changing it must + # produce a distinct key (a fresh snapshot), never a stale reuse. + k6 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="none") + assert k6 != k1 + assert "__cjkjieba__" in k1 + assert "__cjknone__" in k6 + async def test_eval_snapshot_cache_hit_skips_ingest( tmp_path: Path, monkeypatch: pytest.MonkeyPatch @@ -899,6 +909,134 @@ async def boom_ingest(*args: object, **kwargs: object) -> object: partial_dirs = list((cache_root / "toy").glob("fake__*.partial")) assert len(partial_dirs) == 1 + +async def test_eval_cache_search_time_config_read_live_on_hit( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """#250: a SEARCH-TIME retrieval knob (rrf_k) changed under read_write takes + effect on a cache HIT — the searcher is built from the caller's LIVE config, + not the baked snapshot — and the change does NOT re-ingest (fast + correct). + """ + from dikw_core.config import RetrievalConfig + from dikw_core.eval import runner as runner_mod + + ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])]) + spec = load_dataset(ds) + cache_root = tmp_path / "cache" + + ingest_calls = 0 + real_ingest = runner_mod.api.ingest + + async def spy_ingest(*args: object, **kwargs: object) -> object: + nonlocal ingest_calls + ingest_calls += 1 + return await real_ingest(*args, **kwargs) + + monkeypatch.setattr(runner_mod.api, "ingest", spy_ingest) + + # Capture the rrf_k the searcher is actually built with each run. The + # ``from_config`` classmethod, set as a plain attr on the class, is called + # unbound (``HybridSearcher.from_config(storage, embedder, cfg, ...)``), so + # the spy sees the same positional args. + seen_rrf_k: list[int] = [] + real_from_config = runner_mod.HybridSearcher.from_config + + def spy_from_config(storage, embedder, retrieval_cfg, **kwargs): # type: ignore[no-untyped-def] + seen_rrf_k.append(retrieval_cfg.rrf_k) + return real_from_config(storage, embedder, retrieval_cfg, **kwargs) + + monkeypatch.setattr(runner_mod.HybridSearcher, "from_config", spy_from_config) + + await run_eval( + spec, cache_root=cache_root, retrieval_config=RetrievalConfig(rrf_k=60) + ) + await run_eval( + spec, cache_root=cache_root, retrieval_config=RetrievalConfig(rrf_k=11) + ) + + assert ingest_calls == 1, "search-time change must reuse the snapshot (no re-ingest)" + assert seen_rrf_k == [60, 11], ( + "warm run must build the searcher from the LIVE rrf_k (11), not the " + f"baked 60; saw {seen_rrf_k}" + ) + + +async def test_eval_cache_cjk_tokenizer_change_forces_reingest( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """#250: cjk_tokenizer is INGEST-TIME — it's pinned into the cache key, so + changing it produces a fresh snapshot (re-ingest), never a stale reuse of an + index built with the old tokenizer. + """ + from dikw_core.config import RetrievalConfig + from dikw_core.eval import runner as runner_mod + + ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])]) + spec = load_dataset(ds) + cache_root = tmp_path / "cache" + + ingest_calls = 0 + real_ingest = runner_mod.api.ingest + + async def spy_ingest(*args: object, **kwargs: object) -> object: + nonlocal ingest_calls + ingest_calls += 1 + return await real_ingest(*args, **kwargs) + + monkeypatch.setattr(runner_mod.api, "ingest", spy_ingest) + + await run_eval( + spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="jieba") + ) + assert ingest_calls == 1 + await run_eval( + spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="none") + ) + assert ingest_calls == 2, "ingest-time tokenizer change must re-ingest" + # Two distinct snapshots coexist under the dataset's cache subdir. + snapshot_dirs = [ + d for d in (cache_root / "toy").glob("fake__*") if not d.name.endswith(".partial") + ] + assert len(snapshot_dirs) == 2 + + +def test_retrieval_config_fields_are_classified_for_eval_cache() -> None: + """#250 tripwire: the eval snapshot cache hand-splits ``RetrievalConfig`` into + ONE ingest-time field (``cjk_tokenizer`` — baked into the FTS index, so keyed + in ``_corpus_cache_key``) and the rest search-time (read live in + ``_run_queries``, never keyed). That split is encoded by hand, so adding a new + ``RetrievalConfig`` field would silently default it to "search-time, read + live" — which is WRONG and reintroduces the #250 stale-snapshot bug if the new + field actually affects ingest. This test fails on any field-set change, forcing + the contributor to classify the new field: + * ingest-time (changes chunking / the FTS index / stored vectors) → add it + to ``_corpus_cache_key`` (and the ``cjk_tokenizer`` guard in + ``_run_queries``); + * search-time (query-time ranking only) → it is already read live; no key + change needed. + """ + from dikw_core.config import RetrievalConfig + + ingest_time = {"cjk_tokenizer"} + search_time = { + "rrf_k", + "bm25_weight", + "vector_weight", + "fusion", + "same_doc_penalty_alpha", + "graph_enabled", + "graph_seed_top_k", + "graph_weight", + "rerank_enabled", + "rerank_candidate_k", + } + assert set(RetrievalConfig.model_fields) == ingest_time | search_time, ( + "RetrievalConfig fields changed — classify the new/removed field for the " + "eval snapshot cache (src/dikw_core/eval/runner.py): ingest-time → add to " + "_corpus_cache_key (+ the cjk_tokenizer assert in _run_queries); " + "search-time → it is read live, no key change needed. See #250." + ) + # ---- check_thresholds direction-aware --------------------------------------- From a9ca5dcb0f916ff8bf5f550f8c91b19e02542b31 Mon Sep 17 00:00:00 2001 From: holo Date: Sun, 28 Jun 2026 21:04:43 +0800 Subject: [PATCH 2/2] test(eval): cover the cjk_tokenizer mismatch guard in _run_queries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Exercise the #250 defence-in-depth assert directly (build a jieba snapshot, call _run_queries with a mismatched live tokenizer → EvalError), covering the otherwise-unexecuted raise so codecov/patch reflects the guard's intent. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/test_eval_runner.py | 44 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/tests/test_eval_runner.py b/tests/test_eval_runner.py index 36235b7..22ca6ab 100644 --- a/tests/test_eval_runner.py +++ b/tests/test_eval_runner.py @@ -1037,6 +1037,50 @@ def test_retrieval_config_fields_are_classified_for_eval_cache() -> None: "search-time → it is read live, no key change needed. See #250." ) + +async def test_run_queries_guards_against_tokenizer_mismatch( + tmp_path: Path, +) -> None: + """#250 defence-in-depth: if a future ``_corpus_cache_key`` regression ever let + a snapshot built with one ``cjk_tokenizer`` be reused under a different live + tokenizer, ``_run_queries`` fails loud rather than querying an FTS index built + with the wrong tokenizer. The cache key makes this impossible in normal + operation, so it's exercised here by calling ``_run_queries`` directly on a + jieba-built snapshot with a mismatched live ``cjk_tokenizer``. + """ + from dikw_core.config import ProviderConfig, RetrievalConfig + from dikw_core.eval.fake_embedder import FakeEmbeddings + from dikw_core.eval.runner import _run_queries + + ds = _write_dataset(tmp_path / "ds", queries=[("alpha", ["alpha"])]) + spec = load_dataset(ds) + cache_root = tmp_path / "cache" + await run_eval( + spec, cache_root=cache_root, retrieval_config=RetrievalConfig(cjk_tokenizer="jieba") + ) + base = next((cache_root / "toy").glob("fake__*")) / "base" + assert base.is_dir() + + provider = ProviderConfig( + llm_api_key_env="ANTHROPIC_API_KEY", + embedding_model="fake", + embedding_dim=64, + embedding_revision="", + embedding_normalize=True, + embedding_distance="cosine", + embedding_api_key_env="OPENAI_API_KEY", + ) + with pytest.raises(EvalError, match="snapshot cjk_tokenizer"): + await _run_queries( + base, + spec, + embedder=FakeEmbeddings(), + embedding_model="fake", + modes=["hybrid"], + retrieval_config=RetrievalConfig(cjk_tokenizer="none"), + provider_config=provider, + ) + # ---- check_thresholds direction-aware ---------------------------------------