From 3d46bf4d1b66001b73c7be0133bd5a56e29e259d Mon Sep 17 00:00:00 2001 From: Le He Date: Thu, 25 Jun 2026 23:02:56 +0800 Subject: [PATCH] eval: record scifact + cmteb public-anchor calibration (Milestone B) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First real-vector baselines, validating the eval chain end-to-end on non-saturated, literature-anchored data (the Phase 0→1 advance criterion). - scifact (en): doc/hybrid ndcg_at_10 0.689 — matches BEIR literature 0.67 (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real (bm25 0.651 < vector 0.673 < hybrid 0.689). - cmteb-t2-subset (zh): reproduces dikw-core's calibrated subset baseline within noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs 0.942); clears all dataset thresholds. jieba CJK confirmed (0.840, not the degenerate 0.031). NB: this is a curated 300-q subset, not the full CMTEB leaderboard (~0.50) — see its dataset.yaml. - Cross-cutting: multi-batch embedding confirmed (5183 / 5000 chunks ≫ 16/batch, the Phase 0 gap); read-only held (scifact materialized into gitignored paths, dataset.yaml restored, dikw-core tree clean). No gates set — calibration only; per-language thresholds wait for the in-house domain-bilingual-v1 / negatives-ood-v1 sets. Co-Authored-By: Claude Opus 4.8 (1M context) --- reports/BASELINES.md | 57 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/reports/BASELINES.md b/reports/BASELINES.md index eab1ecf..25ed4ef 100644 --- a/reports/BASELINES.md +++ b/reports/BASELINES.md @@ -24,7 +24,56 @@ recorded, reviewable outcome. ## Entries -_None yet._ The first real entries come from the Phase 0→1 public-anchor -calibration (`scifact` + `cmteb-t2-subset`); see `docs/dikw-eval-plan.md` §2.3 and -`docs/phase0-smoke-results.md`. Phase 0 set **no gates** — the synthetic sets -saturate at 1.0, so thresholds wait for non-saturated, anchored data. +## 2026-06-25 — scifact + cmteb-t2-subset public-anchor calibration (Phase 0→1) + +Run from `dikw-data` against a read-only `dikw-core` v0.6.1 (editable, `[cjk]`). +Provider: **MiniMax-M2.7** (LLM, unused — retrieval-only) + **Gitee +Qwen3-Embedding-0.6B@1024** (embeddings) + sqlite. `--retrieval all`, `--eval +retrieval`, `--cache read_write`, `serve-and-run`, 1 run each. Datasets handed in +by absolute path from `dikw-core/evals/datasets/` (out-of-tree). Canonical view is +`doc/hybrid`. + +**scifact** (en, 300 queries / 5183 docs) — `passed: True`, exit 0. + +| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 | +|---|---|---|---|---|---| +| bm25 | 0.700 | 0.790 | 0.622 | 0.651 | 0.855 | +| vector | 0.700 | 0.813 | 0.639 | 0.673 | 0.903 | +| **hybrid** | **0.723** | **0.843** | **0.655** | **0.689** | **0.947** | + +- Clears dikw-core's committed floor (`ndcg_at_10 0.67 / hit_at_3 0.71 / hit_at_10 + 0.82 / mrr 0.64 / recall_at_100 0.92`). +- Matches BEIR literature `ndcg_at_10 ≈ 0.67` (observed 0.689, Δ 0.019 — well within + the ±0.10 advance criterion). +- RRF lift is real: hybrid ndcg_at_10 (0.689) > vector (0.673) > bm25 (0.651). + +**cmteb-t2-subset** (zh, 300 queries / 5000 docs) — `passed: True`, exit 0. + +| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 | +|---|---|---|---|---|---| +| bm25 | 0.933 | 0.967 | 0.924 | 0.840 | 0.908 | +| vector | 0.973 | 0.990 | 0.967 | 0.943 | 0.980 | +| **hybrid** | **0.987** | **0.987** | **0.979** | **0.946** | **0.988** | + +- Clears the dataset's calibrated thresholds (`ndcg_at_10 ≥ 0.93`, etc.) and + **reproduces dikw-core's own committed numbers** to within noise: bm25 + `ndcg_at_10 0.840` (exact), vector `0.943` (vs 0.942), hybrid `0.946` (vs 0.952). +- **Not comparable to the CMTEB leaderboard (~0.50).** This is a 300-query curated + subset with distractor padding, intentionally easier than the full 118K-passage + benchmark — its `dataset.yaml` says so. The anchor for the subset is dikw-core's + committed baseline, which we reproduced, not the leaderboard figure. +- **jieba CJK confirmed**: zh `bm25 ndcg_at_10 0.840`, not the degenerate + unicode61 per-character `0.031`. RRF hybrid edges vector (0.946 vs 0.943). + +**Cross-cutting checks** + +- Multi-batch embedding confirmed (5183 / 5000 chunks ≫ the 16/batch Gitee limit) — + the Phase 0 gap (single batch only) is now covered. +- Read-only held: scifact `corpus/` + `queries.yaml` materialized into gitignored + paths; its tracked `dataset.yaml` was backed up and restored, so + `git -C dikw-core status` stays clean. + +**Gates: still none.** This is Phase 0→1 calibration, not a gate. Per-language +thresholds get set at `observed − margin` once the in-house sets +(`domain-bilingual-v1`, `negatives-ood-v1`) exist — see `docs/dikw-eval-plan.md` +§2.3 and §3.