Skip to content

eval: record scifact + cmteb public-anchor calibration (Milestone B)#6

Closed
helebest wants to merge 1 commit into
ci/lint-and-eval-gatesfrom
eval/anchor-calibration
Closed

eval: record scifact + cmteb public-anchor calibration (Milestone B)#6
helebest wants to merge 1 commit into
ci/lint-and-eval-gatesfrom
eval/anchor-calibration

Conversation

@helebest

Copy link
Copy Markdown
Collaborator

What

First real-vector eval baselines, recorded in reports/BASELINES.md. Validates the eval chain end-to-end on non-saturated, literature-anchored data — the documented Phase 0→1 advance criterion (docs/dikw-eval-plan.md §3). This is Milestone B, following the CI floor in #5.

Stacked on #5 (ci/lint-and-eval-gates). The diff here is just the reports/BASELINES.md entry; GitHub will retarget this PR to main once #5 merges. Review/merge #5 first.

Results (doc/hybrid canonical, --retrieval all, 1 run each)

dataset lang ndcg_at_10 hit_at_3 recall_at_100 passed
scifact en 0.689 0.723 0.947
cmteb-t2-subset zh 0.946 0.987 0.988
  • scifact matches BEIR literature ndcg_at_10 ≈ 0.67 (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real: bm25 0.651 < vector 0.673 < hybrid 0.689.
  • cmteb-t2-subset reproduces dikw-core's calibrated subset baseline within noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs 0.942). This is a curated 300-query subset (distractor-padded), not the full CMTEB leaderboard (~0.50) — see its dataset.yaml. jieba CJK confirmed (0.840, not the degenerate unicode61 0.031).

Cross-cutting checks

  • Multi-batch embedding confirmed (5183 / 5000 chunks ≫ the 16/batch Gitee limit) — closes the Phase 0 single-batch gap.
  • Read-only held: scifact corpus/ + queries.yaml materialized into gitignored paths; its tracked dataset.yaml backed up and restored; git -C dikw-core status stays clean.

Not in this PR

No gates set — calibration only. Per-language thresholds get set at observed − margin once the in-house domain-bilingual-v1 / negatives-ood-v1 sets exist (rest of Phase 1).

🤖 Generated with Claude Code

First real-vector baselines, validating the eval chain end-to-end on
non-saturated, literature-anchored data (the Phase 0→1 advance criterion).

- scifact (en): doc/hybrid ndcg_at_10 0.689 — matches BEIR literature 0.67
  (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real
  (bm25 0.651 < vector 0.673 < hybrid 0.689).
- cmteb-t2-subset (zh): reproduces dikw-core's calibrated subset baseline within
  noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs
  0.942); clears all dataset thresholds. jieba CJK confirmed (0.840, not the
  degenerate 0.031). NB: this is a curated 300-q subset, not the full CMTEB
  leaderboard (~0.50) — see its dataset.yaml.
- Cross-cutting: multi-batch embedding confirmed (5183 / 5000 chunks ≫ 16/batch,
  the Phase 0 gap); read-only held (scifact materialized into gitignored paths,
  dataset.yaml restored, dikw-core tree clean).

No gates set — calibration only; per-language thresholds wait for the in-house
domain-bilingual-v1 / negatives-ood-v1 sets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest force-pushed the ci/lint-and-eval-gates branch from 72527bd to 6d62374 Compare June 28, 2026 12:57
@helebest helebest force-pushed the eval/anchor-calibration branch from 8cbc031 to 1af56be Compare June 28, 2026 12:57
@helebest helebest deleted the branch ci/lint-and-eval-gates June 28, 2026 12:57
@helebest helebest closed this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant