Skip to content

feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation)#10

Merged
helebest merged 1 commit into
mainfrom
eval/issue-249-250-tooling
Jun 28, 2026
Merged

feat(tools): build-ahead analyzers for #249 (negatives separation) + #250 (retrieval ablation)#10
helebest merged 1 commit into
mainfrom
eval/issue-249-250-tooling

Conversation

@helebest

Copy link
Copy Markdown
Collaborator

What

Two analysis tools that consume the dikw-core fixes currently in flight, written
against the proposed output so they run the moment #249/#250 land — no code change
needed when they do.

  • tools/negatives_separation.py — pos-vs-neg top-1 relevance-score
    separation for an eval NDJSON. Probes the candidate score-field names #249 may
    add (scores / ranked_scores / top1_score / …) so it is name-agnostic;
    degrades to a rank-only observation with scores_available=false (never a bogus
    zero) until #249 ships. Computes the separation margin + expect_none
    satisfaction at a cutoff.
  • scripts/ablate_retrieval.py — retrieval-config ablation harness
    (rrf_k / weights / fusion / rerank). Merges each variant's overrides into a
    temp base's retrieval: block and forces --cache off — the documented
    #250 workaround for the stale-baked-config cache bug — capturing per-variant
    metrics into a comparison table. FORCED_CACHE flips to "rebuild" once #250 lands.

Why

Per the parallel-work plan: #249/#250 block the measurement/gating refinement,
not the tooling that consumes them. Building these now means the post-fix
calibration is a single pass rather than a scramble.

Reuse

  • negatives_separation mirrors tools/split_metrics_by_lang.py's NDJSON-read.
  • ablate_retrieval reuses run_eval's build_eval_command / parse_eval_report
    / ensure_base / merge_env.

Verification

  • 12 new offline tests (both score schemas for #249; pure-helper coverage for #250).
  • ruff + mypy src + pytest green (87 passed).
  • ablate_retrieval --dry-run builds correct per-variant --cache off commands.

🤖 Generated with Claude Code

While dikw-core fixes #249 (surface absolute relevance scores) and #250 (snapshot
cache key omits RetrievalConfig), add the two analysis tools that consume those
fixes, written against the proposed output so they run the moment the fixes land.

- tools/negatives_separation.py: pos-vs-neg top-1 relevance-score separation for
  an eval NDJSON. Probes the candidate score-field names #249 may add
  (scores / ranked_scores / top1_score / ...) so it is name-agnostic; degrades to
  a rank-only observation with scores_available=false (never a bogus zero) until
  #249 ships. Computes separation margin + expect_none satisfaction at a cutoff.
- scripts/ablate_retrieval.py: retrieval-config ablation harness (rrf_k / weights
  / fusion / rerank). Merges each variant's overrides into a temp base's
  retrieval: block and FORCES --cache off — the documented #250 workaround for the
  stale-baked-config cache bug — capturing per-variant metrics into a comparison
  table. FORCED_CACHE flips to "rebuild" once #250 lands.

Both reuse existing seams (split_metrics_by_lang's NDJSON read; run_eval's
build_eval_command / parse_eval_report / ensure_base). Offline-testable: 12 new
tests. ruff + mypy src + pytest green (87 passed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@helebest helebest merged commit 727c187 into main Jun 28, 2026
2 checks passed
@helebest helebest deleted the eval/issue-249-250-tooling branch June 28, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant