feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation)#10
Merged
Merged
Conversation
While dikw-core fixes #249 (surface absolute relevance scores) and #250 (snapshot cache key omits RetrievalConfig), add the two analysis tools that consume those fixes, written against the proposed output so they run the moment the fixes land. - tools/negatives_separation.py: pos-vs-neg top-1 relevance-score separation for an eval NDJSON. Probes the candidate score-field names #249 may add (scores / ranked_scores / top1_score / ...) so it is name-agnostic; degrades to a rank-only observation with scores_available=false (never a bogus zero) until #249 ships. Computes separation margin + expect_none satisfaction at a cutoff. - scripts/ablate_retrieval.py: retrieval-config ablation harness (rrf_k / weights / fusion / rerank). Merges each variant's overrides into a temp base's retrieval: block and FORCES --cache off — the documented #250 workaround for the stale-baked-config cache bug — capturing per-variant metrics into a comparison table. FORCED_CACHE flips to "rebuild" once #250 lands. Both reuse existing seams (split_metrics_by_lang's NDJSON read; run_eval's build_eval_command / parse_eval_report / ensure_base). Offline-testable: 12 new tests. ruff + mypy src + pytest green (87 passed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two analysis tools that consume the dikw-core fixes currently in flight, written
against the proposed output so they run the moment #249/#250 land — no code change
needed when they do.
tools/negatives_separation.py— pos-vs-neg top-1 relevance-scoreseparation for an eval NDJSON. Probes the candidate score-field names #249 may
add (
scores/ranked_scores/top1_score/ …) so it is name-agnostic;degrades to a rank-only observation with
scores_available=false(never a boguszero) until #249 ships. Computes the separation margin +
expect_nonesatisfaction at a cutoff.
scripts/ablate_retrieval.py— retrieval-config ablation harness(
rrf_k/ weights /fusion/ rerank). Merges each variant's overrides into atemp base's
retrieval:block and forces--cache off— the documented#250 workaround for the stale-baked-config cache bug — capturing per-variant
metrics into a comparison table.
FORCED_CACHEflips to"rebuild"once #250 lands.Why
Per the parallel-work plan: #249/#250 block the measurement/gating refinement,
not the tooling that consumes them. Building these now means the post-fix
calibration is a single pass rather than a scramble.
Reuse
negatives_separationmirrorstools/split_metrics_by_lang.py's NDJSON-read.ablate_retrievalreusesrun_eval'sbuild_eval_command/parse_eval_report/
ensure_base/merge_env.Verification
ruff+mypy src+pytestgreen (87 passed).ablate_retrieval --dry-runbuilds correct per-variant--cache offcommands.🤖 Generated with Claude Code