From 3d46bf4d1b66001b73c7be0133bd5a56e29e259d Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Thu, 25 Jun 2026 23:02:56 +0800
Subject: [PATCH] eval: record scifact + cmteb public-anchor calibration
 (Milestone B)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

First real-vector baselines, validating the eval chain end-to-end on
non-saturated, literature-anchored data (the Phase 0→1 advance criterion).

- scifact (en): doc/hybrid ndcg_at_10 0.689 — matches BEIR literature 0.67
  (Δ 0.019, within ±0.10) and clears dikw-core's committed floor. RRF lift real
  (bm25 0.651 < vector 0.673 < hybrid 0.689).
- cmteb-t2-subset (zh): reproduces dikw-core's calibrated subset baseline within
  noise (bm25 ndcg_at_10 0.840 exact; hybrid 0.946 vs 0.952; vector 0.943 vs
  0.942); clears all dataset thresholds. jieba CJK confirmed (0.840, not the
  degenerate 0.031). NB: this is a curated 300-q subset, not the full CMTEB
  leaderboard (~0.50) — see its dataset.yaml.
- Cross-cutting: multi-batch embedding confirmed (5183 / 5000 chunks ≫ 16/batch,
  the Phase 0 gap); read-only held (scifact materialized into gitignored paths,
  dataset.yaml restored, dikw-core tree clean).

No gates set — calibration only; per-language thresholds wait for the in-house
domain-bilingual-v1 / negatives-ood-v1 sets.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 reports/BASELINES.md | 57 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/reports/BASELINES.md b/reports/BASELINES.md
index eab1ecf..25ed4ef 100644
--- a/reports/BASELINES.md
+++ b/reports/BASELINES.md
@@ -24,7 +24,56 @@ recorded, reviewable outcome.
 
 ## Entries
 
-_None yet._ The first real entries come from the Phase 0→1 public-anchor
-calibration (`scifact` + `cmteb-t2-subset`); see `docs/dikw-eval-plan.md` §2.3 and
-`docs/phase0-smoke-results.md`. Phase 0 set **no gates** — the synthetic sets
-saturate at 1.0, so thresholds wait for non-saturated, anchored data.
+## 2026-06-25 — scifact + cmteb-t2-subset public-anchor calibration (Phase 0→1)
+
+Run from `dikw-data` against a read-only `dikw-core` v0.6.1 (editable, `[cjk]`).
+Provider: **MiniMax-M2.7** (LLM, unused — retrieval-only) + **Gitee
+Qwen3-Embedding-0.6B@1024** (embeddings) + sqlite. `--retrieval all`, `--eval
+retrieval`, `--cache read_write`, `serve-and-run`, 1 run each. Datasets handed in
+by absolute path from `dikw-core/evals/datasets/` (out-of-tree). Canonical view is
+`doc/hybrid`.
+
+**scifact** (en, 300 queries / 5183 docs) — `passed: True`, exit 0.
+
+| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 |
+|---|---|---|---|---|---|
+| bm25 | 0.700 | 0.790 | 0.622 | 0.651 | 0.855 |
+| vector | 0.700 | 0.813 | 0.639 | 0.673 | 0.903 |
+| **hybrid** | **0.723** | **0.843** | **0.655** | **0.689** | **0.947** |
+
+- Clears dikw-core's committed floor (`ndcg_at_10 0.67 / hit_at_3 0.71 / hit_at_10
+  0.82 / mrr 0.64 / recall_at_100 0.92`).
+- Matches BEIR literature `ndcg_at_10 ≈ 0.67` (observed 0.689, Δ 0.019 — well within
+  the ±0.10 advance criterion).
+- RRF lift is real: hybrid ndcg_at_10 (0.689) > vector (0.673) > bm25 (0.651).
+
+**cmteb-t2-subset** (zh, 300 queries / 5000 docs) — `passed: True`, exit 0.
+
+| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 |
+|---|---|---|---|---|---|
+| bm25 | 0.933 | 0.967 | 0.924 | 0.840 | 0.908 |
+| vector | 0.973 | 0.990 | 0.967 | 0.943 | 0.980 |
+| **hybrid** | **0.987** | **0.987** | **0.979** | **0.946** | **0.988** |
+
+- Clears the dataset's calibrated thresholds (`ndcg_at_10 ≥ 0.93`, etc.) and
+  **reproduces dikw-core's own committed numbers** to within noise: bm25
+  `ndcg_at_10 0.840` (exact), vector `0.943` (vs 0.942), hybrid `0.946` (vs 0.952).
+- **Not comparable to the CMTEB leaderboard (~0.50).** This is a 300-query curated
+  subset with distractor padding, intentionally easier than the full 118K-passage
+  benchmark — its `dataset.yaml` says so. The anchor for the subset is dikw-core's
+  committed baseline, which we reproduced, not the leaderboard figure.
+- **jieba CJK confirmed**: zh `bm25 ndcg_at_10 0.840`, not the degenerate
+  unicode61 per-character `0.031`. RRF hybrid edges vector (0.946 vs 0.943).
+
+**Cross-cutting checks**
+
+- Multi-batch embedding confirmed (5183 / 5000 chunks ≫ the 16/batch Gitee limit) —
+  the Phase 0 gap (single batch only) is now covered.
+- Read-only held: scifact `corpus/` + `queries.yaml` materialized into gitignored
+  paths; its tracked `dataset.yaml` was backed up and restored, so
+  `git -C dikw-core status` stays clean.
+
+**Gates: still none.** This is Phase 0→1 calibration, not a gate. Per-language
+thresholds get set at `observed − margin` once the in-house sets
+(`domain-bilingual-v1`, `negatives-ood-v1`) exist — see `docs/dikw-eval-plan.md`
+§2.3 and §3.