feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation) by helebest · Pull Request #10 · OpenDIKW/dikw-data

helebest · 2026-06-28T13:08:56Z

What

Two analysis tools that consume the dikw-core fixes currently in flight, written
against the proposed output so they run the moment #249/#250 land — no code change
needed when they do.

tools/negatives_separation.py — pos-vs-neg top-1 relevance-score
separation for an eval NDJSON. Probes the candidate score-field names #249 may
add (scores / ranked_scores / top1_score / …) so it is name-agnostic;
degrades to a rank-only observation with scores_available=false (never a bogus
zero) until #249 ships. Computes the separation margin + expect_none
satisfaction at a cutoff.
scripts/ablate_retrieval.py — retrieval-config ablation harness
(rrf_k / weights / fusion / rerank). Merges each variant's overrides into a
temp base's retrieval: block and forces --cache off — the documented
#250 workaround for the stale-baked-config cache bug — capturing per-variant
metrics into a comparison table. FORCED_CACHE flips to "rebuild" once #250 lands.

Why

Per the parallel-work plan: #249/#250 block the measurement/gating refinement,
not the tooling that consumes them. Building these now means the post-fix
calibration is a single pass rather than a scramble.

Reuse

negatives_separation mirrors tools/split_metrics_by_lang.py's NDJSON-read.
ablate_retrieval reuses run_eval's build_eval_command / parse_eval_report
/ ensure_base / merge_env.

Verification

12 new offline tests (both score schemas for #249; pure-helper coverage for #250).
ruff + mypy src + pytest green (87 passed).
ablate_retrieval --dry-run builds correct per-variant --cache off commands.

🤖 Generated with Claude Code

While dikw-core fixes #249 (surface absolute relevance scores) and #250 (snapshot cache key omits RetrievalConfig), add the two analysis tools that consume those fixes, written against the proposed output so they run the moment the fixes land. - tools/negatives_separation.py: pos-vs-neg top-1 relevance-score separation for an eval NDJSON. Probes the candidate score-field names #249 may add (scores / ranked_scores / top1_score / ...) so it is name-agnostic; degrades to a rank-only observation with scores_available=false (never a bogus zero) until #249 ships. Computes separation margin + expect_none satisfaction at a cutoff. - scripts/ablate_retrieval.py: retrieval-config ablation harness (rrf_k / weights / fusion / rerank). Merges each variant's overrides into a temp base's retrieval: block and FORCES --cache off — the documented #250 workaround for the stale-baked-config cache bug — capturing per-variant metrics into a comparison table. FORCED_CACHE flips to "rebuild" once #250 lands. Both reuse existing seams (split_metrics_by_lang's NDJSON read; run_eval's build_eval_command / parse_eval_report / ensure_base). Offline-testable: 12 new tests. ruff + mypy src + pytest green (87 passed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

helebest merged commit 727c187 into main Jun 28, 2026
2 checks passed

helebest deleted the eval/issue-249-250-tooling branch June 28, 2026 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation)#10

feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation)#10
helebest merged 1 commit into
mainfrom
eval/issue-249-250-tooling

helebest commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 28, 2026

What

Why

Reuse

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant