Skip to content

fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250)#252

Merged
helebest merged 2 commits into
mainfrom
worktree-fix+eval-cache-retrieval-config
Jun 28, 2026
Merged

fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250)#252
helebest merged 2 commits into
mainfrom
worktree-fix+eval-cache-retrieval-config

Conversation

@helebest

@helebest helebest commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Closes #250.

Problem

The eval snapshot cache key (_corpus_cache_key) was built from corpus + embedding model + dim only — it omitted RetrievalConfig. So under the default --cache read_write, changing any retrieval setting (rrf_k, weights, fusion, cjk_tokenizer, rerank_enabled, graph_*) and re-running produced a silent cache hit that reused the previous run's config — no error, wrong numbers. This bit the retrieval-tuning / ablation workflow exactly where config changes most (and was already self-acknowledged in-code: "until this is fixed").

Fix — two-level cache

The snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field cjk_tokenizer — baked into the FTS index + chunker token budget, multimodal identity). Everything applied at query time is read from the caller's live config in _run_queries, not the baked snapshot:

  • search-time RetrievalConfig knobs (rrf_k, bm25_weight, vector_weight, fusion, same_doc_penalty_alpha, graph_*, rerank_enabled, rerank_candidate_k) — via a new retrieval_config param;
  • rerank provider wiring (rerank_model / base_url) — via a new provider_config param. The reranker scores (query, chunk) at query time and never touches the stored index, so it is read live, not keyed (a rerank-model A/B is now correct on a cache hit too).

Result: a retrieval ablation under the default read_write is now both fast (no re-embed) and correct. Only an ingest-time change (embedding model/dim, or cjk_tokenizer) forces a fresh snapshot — and the key handles that automatically.

Guards: a defensive assert verifies the (key-guaranteed) baked==live cjk_tokenizer; a tripwire test fails if a new RetrievalConfig field is added unclassified, so the hand-split can't silently rot.

Scope: eval-infra only — domains/info/search.py and the retrieval algorithm are untouched, hence no-baseline-needed.

Tests

  • test_eval_cache_search_time_config_read_live_on_hit — a search-time knob change takes effect on a cache hit (searcher built from live rrf_k) and does not re-ingest. Goes red on pre-fix code (which threaded cfg.retrieval).
  • test_eval_cache_cjk_tokenizer_change_forces_reingest — the ingest-time tokenizer change produces a distinct key → fresh snapshot.
  • test_retrieval_config_fields_are_classified_for_eval_cache — tripwire against unclassified new fields.

Delivery receipt

step 3 — verify: tools/check.py green (ruff + mypy + 2403 passed / 138 skipped, Postgres skips w/o DSN). Targeted cache+tripwire tests: 9 passed. Hermetic mvp flow sanity (cache_mode=off): default vs fusion=combsummrr 1.0 vs 0.833, ndcg@10 1.0 vs 0.877 (live config honored); default numbers unchanged.

step 5 — review: fresh-agent invariant/design review → PASS (ingest/search split correct & complete; cache-key sound; defensive assert harmless; scope confined to eval-infra). Workflow-backed /code-review (xhigh, 28 agents) → 5 root-cause findings, all resolved:

  • (dominant) rerank provider wiring was read from the baked snapshot and not keyed → fixed by reading it live (provider_config);
  • baked search-time block is stale on cache-hit arms → docstring now warns;
  • hand-encoded split fragile → tripwire test added;
  • BASELINES off-mode citation non-discriminating → reworded;
  • test spy boilerplate duplication → rejected (matches established local idiom; out-of-scope refactor).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Retrieval eval runs now reuse snapshots more safely while still applying the latest search-time settings.
    • Changes to ingest-time tokenizer settings now correctly trigger a fresh snapshot, avoiding stale results.
    • Reranking and fusion options now update on rerun even when cached data is reused.
  • Documentation

    • Updated eval guidance to explain which retrieval settings are safe to change without rebuilding the corpus snapshot.

…250)

The eval snapshot cache key omitted RetrievalConfig, so under the default
`--cache read_write`, changing a retrieval setting silently reused the
previous run's baked config (no error, wrong numbers) — biting the
retrieval-tuning / ablation workflow exactly where config changes most.

Two-level cache: the snapshot is keyed on ingest-time inputs only (corpus,
embedding model+dim, the one ingest-time retrieval field `cjk_tokenizer`,
multimodal identity). Everything applied at query time is read from the
caller's live config in `_run_queries`, not the baked snapshot:
- search-time RetrievalConfig knobs (rrf_k, weights, fusion, graph_*,
  rerank_enabled, rerank_candidate_k) via a new `retrieval_config` param;
- the rerank provider wiring (rerank_model / base_url) via a new
  `provider_config` param — the reranker scores (query, chunk) at query
  time and never touches the stored index, so it is read live, not keyed.

A defensive assert guards the (key-guaranteed) baked==live cjk_tokenizer
match; a tripwire test fails if a new RetrievalConfig field is added
unclassified, so the hand-split can't silently rot.

Eval-infra only — domains/info/search.py and the retrieval algorithm are
untouched (no-baseline-needed). Docs (eval-plan.md, README.md, BASELINES.md)
updated: retrieval ablations are now both correct and fast under read_write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest added the no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics label Jun 28, 2026
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@helebest, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 46 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 4c81f423-9e36-4c9a-b6ea-3c71d5b93119

📥 Commits

Reviewing files that changed from the base of the PR and between a09598a and a9ca5dc.

📒 Files selected for processing (1)
  • tests/test_eval_runner.py
📝 Walkthrough

Walkthrough

Fixes the silent stale-results bug (#250) in the eval runner by splitting RetrievalConfig into ingest-time fields (only cjk_tokenizer, baked into the snapshot cache key) and query-time fields (rrf_k, fusion, rerank, etc., read live in _run_queries). The cache key gains __cjk{...}, _run_queries accepts and uses live retrieval_config/provider_config, and tests plus docs lock in the field classification.

Changes

Two-level eval snapshot cache fix

Layer / File(s) Summary
_corpus_cache_key gains cjk_tokenizer; _materialise_base doc clarified
src/dikw_core/eval/runner.py
_corpus_cache_key adds cjk_tokenizer as an explicit ingest-time parameter, extends the rendered key with __cjk{...}, and revises comments to document the ingest-time vs query-time split. _materialise_base docstring reflects that only ingest-time identity is baked into the snapshot.
run_eval passes live configs to _run_queries at all call sites
src/dikw_core/eval/runner.py
Both _run_queries call sites (cache-off and cache-hit paths) are updated to forward retrieval_config and provider_config. The cache-key computation passes cjk_tokenizer from the effective retrieval config.
_run_queries: live retrieval_config/provider_config for searcher and reranker
src/dikw_core/eval/runner.py
_run_queries signature expands to accept retrieval_config and provider_config. It asserts the snapshot's baked cjk_tokenizer matches the requested one, and instantiates HybridSearcher and the reranker from the live configs rather than the stale baked snapshot.
Tests: cache-key differentiation, live-config read, cjk re-ingest, field partition
tests/test_eval_runner.py
Cache-key tests gain cjk_tokenizer argument and key-differentiation assertions. Three new tests verify: search-time rrf_k is read live on cache hits without re-ingest; cjk_tokenizer change forces re-ingest; and RetrievalConfig fields are exactly partitioned into ingest-time and search-time sets as a guardrail.
Docs updated: BASELINES, README, eval-plan
evals/BASELINES.md, evals/README.md, docs/eval-plan.md
BASELINES records the infra fix with ingest-time/query-time semantics. README expands RRF tuning guidance. eval-plan revises the acceptance-gate procedure to allow shared cache_root for retrieval ablations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 The cache was a trap, a silent deceit,
Old rerank configs would silently repeat.
Now cjk bakes in, query knobs read fresh,
Two levels of caching — correctness refreshed!
No more stale numbers to lead you astray,
The rabbit fixed footguns — hip hip hooray! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Specific and aligned with the main eval-cache fix: cjk_tokenizer keying plus live query-time retrieval config.
Description check ✅ Passed It covers the problem, fix, tests, scope, and notes, so the PR intent is clear even though it doesn't use the repo template headings.
Linked Issues check ✅ Passed The changes match #250 by keying ingest-time inputs, reading search-time retrieval knobs live, and invalidating on cjk_tokenizer changes.
Out of Scope Changes check ✅ Passed Changes stay within eval infra plus supporting docs and tests; no unrelated behavior or public API expansion is introduced.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-fix+eval-cache-retrieval-config

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_eval_runner.py (1)

752-778: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Compare key variants against a same-corpus baseline.

After Line 749 mutates the corpus, k4, k5, and k6 differ from k1 even if model/mm/tokenizer were ignored. Reuse k3 as the post-mutation baseline so each assertion isolates the intended key dimension.

Suggested adjustment
-    k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba")
-    assert k4 != k1
+    k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba")
+    assert k4 != k3
@@
     k5 = _corpus_cache_key(
         spec, "fake", 64, cjk_tokenizer="jieba", mm_fingerprint="qwen3-vl-8b@4096"
     )
-    assert k5 != k1
+    assert k5 != k3
@@
     k6 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="none")
-    assert k6 != k1
-    assert "__cjkjieba__" in k1
+    assert k6 != k3
+    assert "__cjkjieba__" in k3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 752 - 778, The cache-key assertions
in the _corpus_cache_key test are using k1 as the baseline after the corpus has
already been mutated, so later checks mix corpus changes with model, multimodal,
and tokenizer differences. Update the test to use k3 as the post-mutation
same-corpus baseline and compare k4, k5, and k6 against it, keeping the existing
checks in test_eval_runner.py focused on the intended key dimensions.
🧹 Nitpick comments (1)
tests/test_eval_runner.py (1)

950-961: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Add a cache-hit check for live rerank provider wiring.

This covers live rrf_k, but the PR also moves rerank provider settings to live provider_config. Capture rerank_model from the HybridSearcher.from_config(..., rerank_model=...) kwargs across two cached runs so stale baked provider config can’t regress unnoticed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_eval_runner.py` around lines 950 - 961, Add a cache-hit assertion
for live rerank provider wiring in the eval runner test: besides validating live
rrf_k, also capture the rerank_model passed into HybridSearcher.from_config via
its rerank_model keyword across two cached runs. Update the test around run_eval
and the HybridSearcher.from_config path so the second warm run proves provider
settings come from live provider_config, not a baked snapshot, and assert the
captured rerank_model stays aligned with the expected live config on both runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/dikw_core/eval/runner.py`:
- Around line 158-165: The cache key logic in runner.py is missing parts of the
embedding version identity, so snapshots can be reused across incompatible
provider settings. Update the key construction in the cache-key helper used by
the eval runner to include all fields that ProviderConfig treats as identity,
especially embedding_revision, embedding_normalize, and embedding_distance,
alongside embedding_model and embedding_dim. Make the same change wherever the
key format is documented or assembled so the cache snapshot cannot mix stale
indexed vectors with live query vectors.

---

Outside diff comments:
In `@tests/test_eval_runner.py`:
- Around line 752-778: The cache-key assertions in the _corpus_cache_key test
are using k1 as the baseline after the corpus has already been mutated, so later
checks mix corpus changes with model, multimodal, and tokenizer differences.
Update the test to use k3 as the post-mutation same-corpus baseline and compare
k4, k5, and k6 against it, keeping the existing checks in test_eval_runner.py
focused on the intended key dimensions.

---

Nitpick comments:
In `@tests/test_eval_runner.py`:
- Around line 950-961: Add a cache-hit assertion for live rerank provider wiring
in the eval runner test: besides validating live rrf_k, also capture the
rerank_model passed into HybridSearcher.from_config via its rerank_model keyword
across two cached runs. Update the test around run_eval and the
HybridSearcher.from_config path so the second warm run proves provider settings
come from live provider_config, not a baked snapshot, and assert the captured
rerank_model stays aligned with the expected live config on both runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a96efed0-47be-466e-8de4-9f4cba517aab

📥 Commits

Reviewing files that changed from the base of the PR and between 91b530a and a09598a.

📒 Files selected for processing (5)
  • docs/eval-plan.md
  • evals/BASELINES.md
  • evals/README.md
  • src/dikw_core/eval/runner.py
  • tests/test_eval_runner.py

Comment on lines +158 to +165
"""Stable cache key combining dataset name, model, dim, corpus hash,
tokenizer, schema.

Algorithm:
1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir
2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable;
``cache_mode="rebuild"`` is the escape hatch)
3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}``
3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}``

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Key snapshots by the full embedding version identity.

The cache key still only includes embedding_model and embedding_dim, but ProviderConfig treats embedding_revision, embedding_normalize, and embedding_distance as part of the stored vector version identity. Under read_write, changing one of those fields can reuse an old snapshot while querying with the live embedder/config, mixing fresh query vectors with stale indexed vectors.

Suggested direction
 def _corpus_cache_key(
     spec: DatasetSpec,
     model: str,
     dim: int | None,
     *,
+    embedding_revision: str,
+    embedding_normalize: bool,
+    embedding_distance: str,
     cjk_tokenizer: str,
     mm_fingerprint: str | None = None,
 ) -> str:
 key = _corpus_cache_key(
     spec,
     effective_provider_cfg.embedding_model,
     effective_provider_cfg.embedding_dim,
+    embedding_revision=effective_provider_cfg.embedding_revision,
+    embedding_normalize=effective_provider_cfg.embedding_normalize,
+    embedding_distance=effective_provider_cfg.embedding_distance,
     cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
     mm_fingerprint=mm_fingerprint,
 )

Also applies to: 203-204, 518-523

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/dikw_core/eval/runner.py` around lines 158 - 165, The cache key logic in
runner.py is missing parts of the embedding version identity, so snapshots can
be reused across incompatible provider settings. Update the key construction in
the cache-key helper used by the eval runner to include all fields that
ProviderConfig treats as identity, especially embedding_revision,
embedding_normalize, and embedding_distance, alongside embedding_model and
embedding_dim. Make the same change wherever the key format is documented or
assembled so the cache snapshot cannot mix stale indexed vectors with live query
vectors.

Exercise the #250 defence-in-depth assert directly (build a jieba snapshot,
call _run_queries with a mismatched live tokenizer → EvalError), covering
the otherwise-unexecuted raise so codecov/patch reflects the guard's intent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest merged commit c4e6bb4 into main Jun 28, 2026
10 checks passed
@helebest helebest deleted the worktree-fix+eval-cache-retrieval-config branch June 28, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-baseline-needed Bypass Eval gate: change is genuinely non-functional w.r.t. retrieval/synth metrics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval snapshot cache key omits RetrievalConfig — changing retrieval config under --cache read_write silently reuses stale results

1 participant