eval: record scifact + cmteb public-anchor calibration (Milestone B) by helebest · Pull Request #6 · OpenDIKW/dikw-data

helebest · 2026-06-25T15:03:23Z

What

First real-vector eval baselines, recorded in reports/BASELINES.md. Validates the eval chain end-to-end on non-saturated, literature-anchored data — the documented Phase 0→1 advance criterion (docs/dikw-eval-plan.md §3). This is Milestone B, following the CI floor in #5.

Stacked on #5 (ci/lint-and-eval-gates). The diff here is just the reports/BASELINES.md entry; GitHub will retarget this PR to main once #5 merges. Review/merge #5 first.

Results (doc/hybrid canonical, `--retrieval all`, 1 run each)

dataset	lang	ndcg_at_10	hit_at_3	recall_at_100	passed
scifact	en	0.689	0.723	0.947	✅
cmteb-t2-subset	zh	0.946	0.987	0.988	✅

scifact matches BEIR literature ndcg_at_10 ≈ 0.67 (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real: bm25 0.651 < vector 0.673 < hybrid 0.689.
cmteb-t2-subset reproduces dikw-core's calibrated subset baseline within noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs 0.942). This is a curated 300-query subset (distractor-padded), not the full CMTEB leaderboard (~0.50) — see its dataset.yaml. jieba CJK confirmed (0.840, not the degenerate unicode61 0.031).

Cross-cutting checks

Multi-batch embedding confirmed (5183 / 5000 chunks ≫ the 16/batch Gitee limit) — closes the Phase 0 single-batch gap.
Read-only held: scifact corpus/ + queries.yaml materialized into gitignored paths; its tracked dataset.yaml backed up and restored; git -C dikw-core status stays clean.

Not in this PR

No gates set — calibration only. Per-language thresholds get set at observed − margin once the in-house domain-bilingual-v1 / negatives-ood-v1 sets exist (rest of Phase 1).

🤖 Generated with Claude Code

First real-vector baselines, validating the eval chain end-to-end on non-saturated, literature-anchored data (the Phase 0→1 advance criterion). - scifact (en): doc/hybrid ndcg_at_10 0.689 — matches BEIR literature 0.67 (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real (bm25 0.651 < vector 0.673 < hybrid 0.689). - cmteb-t2-subset (zh): reproduces dikw-core's calibrated subset baseline within noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs 0.942); clears all dataset thresholds. jieba CJK confirmed (0.840, not the degenerate 0.031). NB: this is a curated 300-q subset, not the full CMTEB leaderboard (~0.50) — see its dataset.yaml. - Cross-cutting: multi-batch embedding confirmed (5183 / 5000 chunks ≫ 16/batch, the Phase 0 gap); read-only held (scifact materialized into gitignored paths, dataset.yaml restored, dikw-core tree clean). No gates set — calibration only; per-language thresholds wait for the in-house domain-bilingual-v1 / negatives-ood-v1 sets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

helebest mentioned this pull request Jun 26, 2026

feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1 #7

Merged

helebest force-pushed the ci/lint-and-eval-gates branch from 72527bd to 6d62374 Compare June 28, 2026 12:57

helebest force-pushed the eval/anchor-calibration branch from 8cbc031 to 1af56be Compare June 28, 2026 12:57

helebest deleted the branch ci/lint-and-eval-gates June 28, 2026 12:57

helebest closed this Jun 28, 2026

helebest mentioned this pull request Jun 28, 2026

eval: record scifact + cmteb public-anchor calibration (Milestone B) #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

eval: record scifact + cmteb public-anchor calibration (Milestone B)#6

eval: record scifact + cmteb public-anchor calibration (Milestone B)#6
helebest wants to merge 1 commit into
ci/lint-and-eval-gatesfrom
eval/anchor-calibration

helebest commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 25, 2026

What

Results (doc/hybrid canonical, --retrieval all, 1 run each)

Cross-cutting checks

Not in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results (doc/hybrid canonical, `--retrieval all`, 1 run each)