fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250) by helebest · Pull Request #252 · OpenDIKW/dikw-core

helebest · 2026-06-28T12:53:32Z

Closes #250.

Problem

The eval snapshot cache key (_corpus_cache_key) was built from corpus + embedding model + dim only — it omitted RetrievalConfig. So under the default --cache read_write, changing any retrieval setting (rrf_k, weights, fusion, cjk_tokenizer, rerank_enabled, graph_*) and re-running produced a silent cache hit that reused the previous run's config — no error, wrong numbers. This bit the retrieval-tuning / ablation workflow exactly where config changes most (and was already self-acknowledged in-code: "until this is fixed").

Fix — two-level cache

The snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field cjk_tokenizer — baked into the FTS index + chunker token budget, multimodal identity). Everything applied at query time is read from the caller's live config in _run_queries, not the baked snapshot:

search-time RetrievalConfig knobs (rrf_k, bm25_weight, vector_weight, fusion, same_doc_penalty_alpha, graph_*, rerank_enabled, rerank_candidate_k) — via a new retrieval_config param;
rerank provider wiring (rerank_model / base_url) — via a new provider_config param. The reranker scores (query, chunk) at query time and never touches the stored index, so it is read live, not keyed (a rerank-model A/B is now correct on a cache hit too).

Result: a retrieval ablation under the default read_write is now both fast (no re-embed) and correct. Only an ingest-time change (embedding model/dim, or cjk_tokenizer) forces a fresh snapshot — and the key handles that automatically.

Guards: a defensive assert verifies the (key-guaranteed) baked==live cjk_tokenizer; a tripwire test fails if a new RetrievalConfig field is added unclassified, so the hand-split can't silently rot.

Scope: eval-infra only — domains/info/search.py and the retrieval algorithm are untouched, hence no-baseline-needed.

Tests

test_eval_cache_search_time_config_read_live_on_hit — a search-time knob change takes effect on a cache hit (searcher built from live rrf_k) and does not re-ingest. Goes red on pre-fix code (which threaded cfg.retrieval).
test_eval_cache_cjk_tokenizer_change_forces_reingest — the ingest-time tokenizer change produces a distinct key → fresh snapshot.
test_retrieval_config_fields_are_classified_for_eval_cache — tripwire against unclassified new fields.

Delivery receipt

step 3 — verify: tools/check.py green (ruff + mypy + 2403 passed / 138 skipped, Postgres skips w/o DSN). Targeted cache+tripwire tests: 9 passed. Hermetic mvp flow sanity (cache_mode=off): default vs fusion=combsum → mrr 1.0 vs 0.833, ndcg@10 1.0 vs 0.877 (live config honored); default numbers unchanged.

step 5 — review: fresh-agent invariant/design review → PASS (ingest/search split correct & complete; cache-key sound; defensive assert harmless; scope confined to eval-infra). Workflow-backed /code-review (xhigh, 28 agents) → 5 root-cause findings, all resolved:

(dominant) rerank provider wiring was read from the baked snapshot and not keyed → fixed by reading it live (provider_config);
baked search-time block is stale on cache-hit arms → docstring now warns;
hand-encoded split fragile → tripwire test added;
BASELINES off-mode citation non-discriminating → reworded;
test spy boilerplate duplication → rejected (matches established local idiom; out-of-scope refactor).

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Retrieval eval runs now reuse snapshots more safely while still applying the latest search-time settings.
- Changes to ingest-time tokenizer settings now correctly trigger a fresh snapshot, avoiding stale results.
- Reranking and fusion options now update on rerun even when cached data is reused.
Documentation
- Updated eval guidance to explain which retrieval settings are safe to change without rebuilding the corpus snapshot.

…250) The eval snapshot cache key omitted RetrievalConfig, so under the default `--cache read_write`, changing a retrieval setting silently reused the previous run's baked config (no error, wrong numbers) — biting the retrieval-tuning / ablation workflow exactly where config changes most. Two-level cache: the snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal identity). Everything applied at query time is read from the caller's live config in `_run_queries`, not the baked snapshot: - search-time RetrievalConfig knobs (rrf_k, weights, fusion, graph_*, rerank_enabled, rerank_candidate_k) via a new `retrieval_config` param; - the rerank provider wiring (rerank_model / base_url) via a new `provider_config` param — the reranker scores (query, chunk) at query time and never touches the stored index, so it is read live, not keyed. A defensive assert guards the (key-guaranteed) baked==live cjk_tokenizer match; a tripwire test fails if a new RetrievalConfig field is added unclassified, so the hand-split can't silently rot. Eval-infra only — domains/info/search.py and the retrieval algorithm are untouched (no-baseline-needed). Docs (eval-plan.md, README.md, BASELINES.md) updated: retrieval ablations are now both correct and fast under read_write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-28T12:53:45Z

Warning

Review limit reached

@helebest, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 46 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 4c81f423-9e36-4c9a-b6ea-3c71d5b93119

📥 Commits

Reviewing files that changed from the base of the PR and between a09598a and a9ca5dc.

📒 Files selected for processing (1)

tests/test_eval_runner.py

📝 Walkthrough

Walkthrough

Fixes the silent stale-results bug (#250) in the eval runner by splitting RetrievalConfig into ingest-time fields (only cjk_tokenizer, baked into the snapshot cache key) and query-time fields (rrf_k, fusion, rerank, etc., read live in _run_queries). The cache key gains __cjk{...}, _run_queries accepts and uses live retrieval_config/provider_config, and tests plus docs lock in the field classification.

Changes

Two-level eval snapshot cache fix

Layer / File(s)	Summary
`_corpus_cache_key` gains `cjk_tokenizer`; `_materialise_base` doc clarified `src/dikw_core/eval/runner.py`	`_corpus_cache_key` adds `cjk_tokenizer` as an explicit ingest-time parameter, extends the rendered key with `__cjk{...}`, and revises comments to document the ingest-time vs query-time split. `_materialise_base` docstring reflects that only ingest-time identity is baked into the snapshot.
`run_eval` passes live configs to `_run_queries` at all call sites `src/dikw_core/eval/runner.py`	Both `_run_queries` call sites (cache-off and cache-hit paths) are updated to forward `retrieval_config` and `provider_config`. The cache-key computation passes `cjk_tokenizer` from the effective retrieval config.
`_run_queries`: live `retrieval_config`/`provider_config` for searcher and reranker `src/dikw_core/eval/runner.py`	`_run_queries` signature expands to accept `retrieval_config` and `provider_config`. It asserts the snapshot's baked `cjk_tokenizer` matches the requested one, and instantiates `HybridSearcher` and the reranker from the live configs rather than the stale baked snapshot.
Tests: cache-key differentiation, live-config read, cjk re-ingest, field partition `tests/test_eval_runner.py`	Cache-key tests gain `cjk_tokenizer` argument and key-differentiation assertions. Three new tests verify: search-time `rrf_k` is read live on cache hits without re-ingest; `cjk_tokenizer` change forces re-ingest; and `RetrievalConfig` fields are exactly partitioned into ingest-time and search-time sets as a guardrail.
Docs updated: BASELINES, README, eval-plan `evals/BASELINES.md`, `evals/README.md`, `docs/eval-plan.md`	BASELINES records the infra fix with ingest-time/query-time semantics. README expands RRF tuning guidance. eval-plan revises the acceptance-gate procedure to allow shared `cache_root` for retrieval ablations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 The cache was a trap, a silent deceit,
Old rerank configs would silently repeat.
Now cjk bakes in, query knobs read fresh,
Two levels of caching — correctness refreshed!
No more stale numbers to lead you astray,
The rabbit fixed footguns — hip hip hooray! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Specific and aligned with the main eval-cache fix: cjk_tokenizer keying plus live query-time retrieval config.
Description check	✅ Passed	It covers the problem, fix, tests, scope, and notes, so the PR intent is clear even though it doesn't use the repo template headings.
Linked Issues check	✅ Passed	The changes match `#250` by keying ingest-time inputs, reading search-time retrieval knobs live, and invalidating on cjk_tokenizer changes.
Out of Scope Changes check	✅ Passed	Changes stay within eval infra plus supporting docs and tests; no unrelated behavior or public API expansion is introduced.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-fix+eval-cache-retrieval-config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-28T12:55:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_eval_runner.py (1)

752-778: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Compare key variants against a same-corpus baseline.

After Line 749 mutates the corpus, k4, k5, and k6 differ from k1 even if model/mm/tokenizer were ignored. Reuse k3 as the post-mutation baseline so each assertion isolates the intended key dimension.

Suggested adjustment

-    k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba")
-    assert k4 != k1
+    k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba")
+    assert k4 != k3
@@
     k5 = _corpus_cache_key(
         spec, "fake", 64, cjk_tokenizer="jieba", mm_fingerprint="qwen3-vl-8b@4096"
     )
-    assert k5 != k1
+    assert k5 != k3
@@
     k6 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="none")
-    assert k6 != k1
-    assert "__cjkjieba__" in k1
+    assert k6 != k3
+    assert "__cjkjieba__" in k3

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 752 - 778, The cache-key assertions
in the _corpus_cache_key test are using k1 as the baseline after the corpus has
already been mutated, so later checks mix corpus changes with model, multimodal,
and tokenizer differences. Update the test to use k3 as the post-mutation
same-corpus baseline and compare k4, k5, and k6 against it, keeping the existing
checks in test_eval_runner.py focused on the intended key dimensions.

🧹 Nitpick comments (1)

tests/test_eval_runner.py (1)
950-961: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add a cache-hit check for live rerank provider wiring.

This covers live rrf_k, but the PR also moves rerank provider settings to live provider_config. Capture rerank_model from the HybridSearcher.from_config(..., rerank_model=...) kwargs across two cached runs so stale baked provider config can’t regress unnoticed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 950 - 961, Add a cache-hit assertion
for live rerank provider wiring in the eval runner test: besides validating live
rrf_k, also capture the rerank_model passed into HybridSearcher.from_config via
its rerank_model keyword across two cached runs. Update the test around run_eval
and the HybridSearcher.from_config path so the second warm run proves provider
settings come from live provider_config, not a baked snapshot, and assert the
captured rerank_model stays aligned with the expected live config on both runs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dikw_core/eval/runner.py`:
- Around line 158-165: The cache key logic in runner.py is missing parts of the
embedding version identity, so snapshots can be reused across incompatible
provider settings. Update the key construction in the cache-key helper used by
the eval runner to include all fields that ProviderConfig treats as identity,
especially embedding_revision, embedding_normalize, and embedding_distance,
alongside embedding_model and embedding_dim. Make the same change wherever the
key format is documented or assembled so the cache snapshot cannot mix stale
indexed vectors with live query vectors.

---

Outside diff comments:
In `@tests/test_eval_runner.py`:
- Around line 752-778: The cache-key assertions in the _corpus_cache_key test
are using k1 as the baseline after the corpus has already been mutated, so later
checks mix corpus changes with model, multimodal, and tokenizer differences.
Update the test to use k3 as the post-mutation same-corpus baseline and compare
k4, k5, and k6 against it, keeping the existing checks in test_eval_runner.py
focused on the intended key dimensions.

---

Nitpick comments:
In `@tests/test_eval_runner.py`:
- Around line 950-961: Add a cache-hit assertion for live rerank provider wiring
in the eval runner test: besides validating live rrf_k, also capture the
rerank_model passed into HybridSearcher.from_config via its rerank_model keyword
across two cached runs. Update the test around run_eval and the
HybridSearcher.from_config path so the second warm run proves provider settings
come from live provider_config, not a baked snapshot, and assert the captured
rerank_model stays aligned with the expected live config on both runs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a96efed0-47be-466e-8de4-9f4cba517aab

📥 Commits

Reviewing files that changed from the base of the PR and between 91b530a and a09598a.

📒 Files selected for processing (5)

docs/eval-plan.md
evals/BASELINES.md
evals/README.md
src/dikw_core/eval/runner.py
tests/test_eval_runner.py

coderabbitai · 2026-06-28T12:58:52Z

+    """Stable cache key combining dataset name, model, dim, corpus hash,
+    tokenizer, schema.

    Algorithm:
      1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir
      2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable;
         ``cache_mode="rebuild"`` is the escape hatch)
-      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}``
+      3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}``


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Key snapshots by the full embedding version identity.

The cache key still only includes embedding_model and embedding_dim, but ProviderConfig treats embedding_revision, embedding_normalize, and embedding_distance as part of the stored vector version identity. Under read_write, changing one of those fields can reuse an old snapshot while querying with the live embedder/config, mixing fresh query vectors with stale indexed vectors.

Suggested direction

def _corpus_cache_key( spec: DatasetSpec, model: str, dim: int | None, *, + embedding_revision: str, + embedding_normalize: bool, + embedding_distance: str, cjk_tokenizer: str, mm_fingerprint: str | None = None, ) -> str:

key = _corpus_cache_key( spec, effective_provider_cfg.embedding_model, effective_provider_cfg.embedding_dim, + embedding_revision=effective_provider_cfg.embedding_revision, + embedding_normalize=effective_provider_cfg.embedding_normalize, + embedding_distance=effective_provider_cfg.embedding_distance, cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer, mm_fingerprint=mm_fingerprint, )

Also applies to: 203-204, 518-523

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/dikw_core/eval/runner.py` around lines 158 - 165, The cache key logic in runner.py is missing parts of the embedding version identity, so snapshots can be reused across incompatible provider settings. Update the key construction in the cache-key helper used by the eval runner to include all fields that ProviderConfig treats as identity, especially embedding_revision, embedding_normalize, and embedding_distance, alongside embedding_model and embedding_dim. Make the same change wherever the key format is documented or assembled so the cache snapshot cannot mix stale indexed vectors with live query vectors.

Exercise the #250 defence-in-depth assert directly (build a jieba snapshot, call _run_queries with a mismatched live tokenizer → EvalError), covering the otherwise-unexecuted raise so codecov/patch reflects the guard's intent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

helebest added the no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics label Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

helebest merged commit c4e6bb4 into main Jun 28, 2026
10 checks passed

helebest deleted the worktree-fix+eval-cache-retrieval-config branch June 28, 2026 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250)#252

fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250)#252
helebest merged 2 commits into
mainfrom
worktree-fix+eval-cache-retrieval-config

helebest commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix — two-level cache

Tests

Delivery receipt

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

helebest commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

codecov Bot commented Jun 28, 2026 •

edited

Loading