fix(eval): key cjk_tokenizer + read query-time retrieval config live (#250)#252
Conversation
…250) The eval snapshot cache key omitted RetrievalConfig, so under the default `--cache read_write`, changing a retrieval setting silently reused the previous run's baked config (no error, wrong numbers) — biting the retrieval-tuning / ablation workflow exactly where config changes most. Two-level cache: the snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field `cjk_tokenizer`, multimodal identity). Everything applied at query time is read from the caller's live config in `_run_queries`, not the baked snapshot: - search-time RetrievalConfig knobs (rrf_k, weights, fusion, graph_*, rerank_enabled, rerank_candidate_k) via a new `retrieval_config` param; - the rerank provider wiring (rerank_model / base_url) via a new `provider_config` param — the reranker scores (query, chunk) at query time and never touches the stored index, so it is read live, not keyed. A defensive assert guards the (key-guaranteed) baked==live cjk_tokenizer match; a tripwire test fails if a new RetrievalConfig field is added unclassified, so the hand-split can't silently rot. Eval-infra only — domains/info/search.py and the retrieval algorithm are untouched (no-baseline-needed). Docs (eval-plan.md, README.md, BASELINES.md) updated: retrieval ablations are now both correct and fast under read_write. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 48 minutes and 46 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughFixes the silent stale-results bug ( ChangesTwo-level eval snapshot cache fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_eval_runner.py (1)
752-778: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winCompare key variants against a same-corpus baseline.
After Line 749 mutates the corpus,
k4,k5, andk6differ fromk1even if model/mm/tokenizer were ignored. Reusek3as the post-mutation baseline so each assertion isolates the intended key dimension.Suggested adjustment
- k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba") - assert k4 != k1 + k4 = _corpus_cache_key(spec, "other-model", 64, cjk_tokenizer="jieba") + assert k4 != k3 @@ k5 = _corpus_cache_key( spec, "fake", 64, cjk_tokenizer="jieba", mm_fingerprint="qwen3-vl-8b@4096" ) - assert k5 != k1 + assert k5 != k3 @@ k6 = _corpus_cache_key(spec, "fake", 64, cjk_tokenizer="none") - assert k6 != k1 - assert "__cjkjieba__" in k1 + assert k6 != k3 + assert "__cjkjieba__" in k3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_eval_runner.py` around lines 752 - 778, The cache-key assertions in the _corpus_cache_key test are using k1 as the baseline after the corpus has already been mutated, so later checks mix corpus changes with model, multimodal, and tokenizer differences. Update the test to use k3 as the post-mutation same-corpus baseline and compare k4, k5, and k6 against it, keeping the existing checks in test_eval_runner.py focused on the intended key dimensions.
🧹 Nitpick comments (1)
tests/test_eval_runner.py (1)
950-961: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winAdd a cache-hit check for live rerank provider wiring.
This covers live
rrf_k, but the PR also moves rerank provider settings to liveprovider_config. Capturererank_modelfrom theHybridSearcher.from_config(..., rerank_model=...)kwargs across two cached runs so stale baked provider config can’t regress unnoticed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_eval_runner.py` around lines 950 - 961, Add a cache-hit assertion for live rerank provider wiring in the eval runner test: besides validating live rrf_k, also capture the rerank_model passed into HybridSearcher.from_config via its rerank_model keyword across two cached runs. Update the test around run_eval and the HybridSearcher.from_config path so the second warm run proves provider settings come from live provider_config, not a baked snapshot, and assert the captured rerank_model stays aligned with the expected live config on both runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/dikw_core/eval/runner.py`:
- Around line 158-165: The cache key logic in runner.py is missing parts of the
embedding version identity, so snapshots can be reused across incompatible
provider settings. Update the key construction in the cache-key helper used by
the eval runner to include all fields that ProviderConfig treats as identity,
especially embedding_revision, embedding_normalize, and embedding_distance,
alongside embedding_model and embedding_dim. Make the same change wherever the
key format is documented or assembled so the cache snapshot cannot mix stale
indexed vectors with live query vectors.
---
Outside diff comments:
In `@tests/test_eval_runner.py`:
- Around line 752-778: The cache-key assertions in the _corpus_cache_key test
are using k1 as the baseline after the corpus has already been mutated, so later
checks mix corpus changes with model, multimodal, and tokenizer differences.
Update the test to use k3 as the post-mutation same-corpus baseline and compare
k4, k5, and k6 against it, keeping the existing checks in test_eval_runner.py
focused on the intended key dimensions.
---
Nitpick comments:
In `@tests/test_eval_runner.py`:
- Around line 950-961: Add a cache-hit assertion for live rerank provider wiring
in the eval runner test: besides validating live rrf_k, also capture the
rerank_model passed into HybridSearcher.from_config via its rerank_model keyword
across two cached runs. Update the test around run_eval and the
HybridSearcher.from_config path so the second warm run proves provider settings
come from live provider_config, not a baked snapshot, and assert the captured
rerank_model stays aligned with the expected live config on both runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a96efed0-47be-466e-8de4-9f4cba517aab
📒 Files selected for processing (5)
docs/eval-plan.mdevals/BASELINES.mdevals/README.mdsrc/dikw_core/eval/runner.pytests/test_eval_runner.py
| """Stable cache key combining dataset name, model, dim, corpus hash, | ||
| tokenizer, schema. | ||
|
|
||
| Algorithm: | ||
| 1. sha256 over sorted (rel_posix_path, file_bytes) pairs in corpus_dir | ||
| 2. take first 8 hex chars (collision ~1/4B per dataset+model — acceptable; | ||
| ``cache_mode="rebuild"`` is the escape hatch) | ||
| 3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__sf{N}`` | ||
| 3. format: ``{dataset}/{model}__{dim}__{digest}__mm{mm}__cjk{tok}__sf{N}`` |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Key snapshots by the full embedding version identity.
The cache key still only includes embedding_model and embedding_dim, but ProviderConfig treats embedding_revision, embedding_normalize, and embedding_distance as part of the stored vector version identity. Under read_write, changing one of those fields can reuse an old snapshot while querying with the live embedder/config, mixing fresh query vectors with stale indexed vectors.
Suggested direction
def _corpus_cache_key(
spec: DatasetSpec,
model: str,
dim: int | None,
*,
+ embedding_revision: str,
+ embedding_normalize: bool,
+ embedding_distance: str,
cjk_tokenizer: str,
mm_fingerprint: str | None = None,
) -> str: key = _corpus_cache_key(
spec,
effective_provider_cfg.embedding_model,
effective_provider_cfg.embedding_dim,
+ embedding_revision=effective_provider_cfg.embedding_revision,
+ embedding_normalize=effective_provider_cfg.embedding_normalize,
+ embedding_distance=effective_provider_cfg.embedding_distance,
cjk_tokenizer=effective_retrieval_cfg.cjk_tokenizer,
mm_fingerprint=mm_fingerprint,
)Also applies to: 203-204, 518-523
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/dikw_core/eval/runner.py` around lines 158 - 165, The cache key logic in
runner.py is missing parts of the embedding version identity, so snapshots can
be reused across incompatible provider settings. Update the key construction in
the cache-key helper used by the eval runner to include all fields that
ProviderConfig treats as identity, especially embedding_revision,
embedding_normalize, and embedding_distance, alongside embedding_model and
embedding_dim. Make the same change wherever the key format is documented or
assembled so the cache snapshot cannot mix stale indexed vectors with live query
vectors.
Exercise the #250 defence-in-depth assert directly (build a jieba snapshot, call _run_queries with a mismatched live tokenizer → EvalError), covering the otherwise-unexecuted raise so codecov/patch reflects the guard's intent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Closes #250.
Problem
The eval snapshot cache key (
_corpus_cache_key) was built from corpus + embedding model + dim only — it omittedRetrievalConfig. So under the default--cache read_write, changing any retrieval setting (rrf_k, weights,fusion,cjk_tokenizer,rerank_enabled,graph_*) and re-running produced a silent cache hit that reused the previous run's config — no error, wrong numbers. This bit the retrieval-tuning / ablation workflow exactly where config changes most (and was already self-acknowledged in-code: "until this is fixed").Fix — two-level cache
The snapshot is keyed on ingest-time inputs only (corpus, embedding model+dim, the one ingest-time retrieval field
cjk_tokenizer— baked into the FTS index + chunker token budget, multimodal identity). Everything applied at query time is read from the caller's live config in_run_queries, not the baked snapshot:RetrievalConfigknobs (rrf_k,bm25_weight,vector_weight,fusion,same_doc_penalty_alpha,graph_*,rerank_enabled,rerank_candidate_k) — via a newretrieval_configparam;rerank_model/ base_url) — via a newprovider_configparam. The reranker scores(query, chunk)at query time and never touches the stored index, so it is read live, not keyed (a rerank-model A/B is now correct on a cache hit too).Result: a retrieval ablation under the default
read_writeis now both fast (no re-embed) and correct. Only an ingest-time change (embedding model/dim, orcjk_tokenizer) forces a fresh snapshot — and the key handles that automatically.Guards: a defensive assert verifies the (key-guaranteed) baked==live
cjk_tokenizer; a tripwire test fails if a newRetrievalConfigfield is added unclassified, so the hand-split can't silently rot.Scope: eval-infra only —
domains/info/search.pyand the retrieval algorithm are untouched, henceno-baseline-needed.Tests
test_eval_cache_search_time_config_read_live_on_hit— a search-time knob change takes effect on a cache hit (searcher built from liverrf_k) and does not re-ingest. Goes red on pre-fix code (which threadedcfg.retrieval).test_eval_cache_cjk_tokenizer_change_forces_reingest— the ingest-time tokenizer change produces a distinct key → fresh snapshot.test_retrieval_config_fields_are_classified_for_eval_cache— tripwire against unclassified new fields.Delivery receipt
step 3 — verify:
tools/check.pygreen (ruff + mypy + 2403 passed / 138 skipped, Postgres skips w/o DSN). Targeted cache+tripwire tests: 9 passed. Hermeticmvpflow sanity (cache_mode=off): default vsfusion=combsum→mrr1.0 vs 0.833,ndcg@101.0 vs 0.877 (live config honored); default numbers unchanged.step 5 — review: fresh-agent invariant/design review → PASS (ingest/search split correct & complete; cache-key sound; defensive assert harmless; scope confined to eval-infra). Workflow-backed
/code-review(xhigh, 28 agents) → 5 root-cause findings, all resolved:provider_config);🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Documentation