From b1b53168633ff9d291a7bb6ef7c9bd9f17add9e0 Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 20:32:21 +0800
Subject: [PATCH 1/7] docs: design for Phase 1 in-house datasets
 (domain-bilingual-v1 + negatives-ood-v1)

Approved design implementing eval-plan datasets ii/iii: single bilingual
dataset with engine-gated blended thresholds plus an offline zh/en split
recorded in BASELINES.md, a reused 24-doc corpus, LLM-generated-then-verified
queries, and an observe-only off-corpus negatives set. Calibrate thresholds at
observed - margin from one real-vector run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/phase1-inhouse-datasets-design.md | 187 +++++++++++++++++++++++++
 1 file changed, 187 insertions(+)
 create mode 100644 docs/phase1-inhouse-datasets-design.md

diff --git a/docs/phase1-inhouse-datasets-design.md b/docs/phase1-inhouse-datasets-design.md
new file mode 100644
index 0000000..368194c
--- /dev/null
+++ b/docs/phase1-inhouse-datasets-design.md
@@ -0,0 +1,187 @@
+# Phase 1 in-house retrieval datasets — design
+
+> **Status:** approved design (2026-06-26); implementation pending.
+> Implements datasets **ii** (`domain-bilingual-v1`) and **iii** (`negatives-ood-v1`)
+> from [`dikw-eval-plan.md`](dikw-eval-plan.md) §2.2, following the threshold
+> methodology in §2.3 and the bilingual-split rule in §2.4. `dikw-core` stays
+> **read-only**.
+
+## 1. Context
+
+Phase 0 (smoke) and the public-anchor calibration (scifact + cmteb-t2-subset,
+recorded in [`reports/BASELINES.md`](../reports/BASELINES.md)) are done: the eval
+chain is validated end-to-end on real vectors against a read-only `dikw-core`
+v0.6.1. The Phase 0→1 advance criterion (public anchors land within ±0.10) is met.
+
+The remaining Phase 1 work is the **in-house retrieval gate that matters**: a
+bilingual domain set plus an off-corpus negatives set. The existing synthetic sets
+saturate at 1.0 and carry no committed gate; these two datasets are the first
+in-house sets calibrated to `observed − margin`.
+
+## 2. Decisions (settled)
+
+1. **Per-language gating = single dataset + recorded split.** One
+   `domain-bilingual-v1`; the engine gates a single blended flat-threshold set.
+   zh/en metrics are computed offline from the run's NDJSON `per_query` rows and
+   recorded in `reports/BASELINES.md`. Rationale: `dataset.yaml thresholds:` is a
+   flat map (no per-language nesting — see `scripts/validate_dataset.py`), one
+   shared corpus avoids 3× duplication, and ~40 blended queries are statistically
+   steadier than 2×~20. §2.4's "gate zh/en separately" is honored as a *review*
+   gate (the recorded split + the eval-gate CI that forces a baseline entry on any
+   `datasets/**` change), not an engine threshold.
+2. **Reuse the existing 24-doc corpus** from `synthetic-diverse-v2` (12 zh / 12 en)
+   as-is. Dataset "size" lives in ~40 multi-angle queries, not in new corpus docs.
+   §2.3 already says recalibrate when a corpus crosses ~50 docs, so 24 is fine for
+   P1 kickoff.
+3. **LLM-generate queries + verify.** Use the existing MiniMax factory
+   (`scripts/generate_candidates.py` → `RetryingMiniMaxClient`) to draft candidates,
+   then verify/curate before materializing `queries.yaml`.
+
+## 3. Dataset specs
+
+Both datasets are self-contained packages conforming to the `dikw-core` dataset
+contract (`dataset.yaml` + `corpus/` + `queries.yaml`).
+
+### `domain-bilingual-v1`
+
+```
+datasets/domain-bilingual-v1/
+  corpus/        # copy of synthetic-diverse-v2's 24 .md (12 zh / 12 en)
+  queries.yaml   # ~40 positives; ids prefixed zh-/en-; ~20 each; expect_any: [stem]
+  dataset.yaml   # blended flat thresholds, set at observed − margin (§5)
+```
+
+- **Query ids carry the language tag** as a leading segment, `zh-…` / `en-…`, so
+  the offline splitter can bucket them. (The current `synthetic-diverse-v2` uses a
+  `_zh`/`_en` *suffix*; we standardize on a **prefix** here because the splitter
+  keys on `id.startswith("zh-"|"en-")`.)
+- **A query's language matches its gold doc's language** (the corpus doc's
+  frontmatter `language:`), so the zh slice exercises the jieba/CJK path and the en
+  slice does not — per §2.4.
+- **Coverage:** every corpus doc gets ≥ 1 positive query; zh/en query counts stay
+  balanced (~20/~20).
+
+### `negatives-ood-v1`
+
+```
+datasets/negatives-ood-v1/
+  corpus/        # same 24 .md (copied; the contract requires a corpus dir)
+  queries.yaml   # ~25 expect_none: true (zh + en), domain-adjacent but uncovered
+  dataset.yaml   # thresholds: {}  (see §5 — expect_none is not a thresholdable key)
+```
+
+- Negatives are **plausible-but-unanswerable**: topics adjacent to the domain
+  (history/science/finance/…) that no corpus doc actually covers, so a healthy
+  engine returns nothing relevant. Mixed zh/en, ids prefixed `zh-`/`en-`.
+- The negatives ride the **same** corpus as `domain-bilingual-v1` so "no
+  hallucinated relevance on off-corpus queries" is measured against the real
+  domain index.
+
+## 4. Generation + verification workflow
+
+1. **Generate.** `scripts/generate_candidates.py --dataset <name> --queries N`
+   reads `datasets/<name>/corpus/` and prompts MiniMax for a JSON array of
+   candidates (`q`, `type`, `expect_any`, `evidence`, `confidence`, `rationale`;
+   `expect_none=true` for negatives), driven through the factory (retries, JSON
+   repair, audit, `--resume`). Generate with **coverage control**: ensure each doc
+   gets ≥ 1 positive and the zh/en balance holds (run per-language or per-cluster
+   prompts as needed).
+   - **Key wiring note:** the factory's default transport reads
+     `get_required_env("ANTHROPIC_API_KEY")` from `.env` (not `.env.eval`'s
+     `MINIMAX_API_KEY`). Generation provides the MiniMax key value as
+     `ANTHROPIC_API_KEY` **in-process / in a gitignored `.env`, never echoed**.
+     The `llm_base_url` already defaults to MiniMax. This is the one piece of
+     wiring the generation step must set up; no change to `llm_client.py` defaults.
+2. **Verify (the "human-verify" step).** Export candidates and curate each one:
+   confirm the gold stem is correct and the query is genuinely answerable from that
+   doc (positives) / genuinely uncovered (negatives); check the language tag; drop
+   low-confidence or ambiguous items; dedup; rebalance zh/en. `scripts/llm_review.py`
+   may add a second LLM opinion (`pass`/`fail`/`rewrite` + risk flags) — optional,
+   extra spend, on by default. **The final human verifier is the maintainer**, via
+   the committed `queries.yaml` diff in the PR.
+3. **Materialize.** Write curated `queries.yaml` (+ `dataset.yaml`) and copy the
+   corpus into each dataset dir. `scripts/validate_dataset.py datasets/<name>` must
+   pass ($0, before any spend).
+
+## 5. Threshold calibration (§2.3 methodology)
+
+1. **One real-vector run** over both datasets: `scripts/run_eval.py --datasets
+   "$PWD/datasets/domain-bilingual-v1,$PWD/datasets/negatives-ood-v1" --retrieval
+   all` (`--cache read_write`). Real Gitee embedding spend on the cold run; warm
+   reruns hit the snapshot cache. Confirm `summary.json worst_exit_code == 0` and
+   per-mode `bm25/vector/hybrid` views (look for RRF lift).
+2. **`domain-bilingual-v1` gate** = blended `observed − margin`, written into
+   `dataset.yaml`: **−0.03** absolute for `ndcg_at_10`/`recall_at_100`, **−0.05**
+   for `hit_at_3`/`hit_at_10` (hit@k is noisier on small sets). Gate is a
+   regression detector, not an aspiration.
+3. **Per-language split** via a new tool `tools/split_metrics_by_lang.py`: read the
+   run's NDJSON `per_query` rows (`id`, `ranked` top-100, `expect_any`), bucket by
+   `id` prefix `zh-`/`en-`, and recompute `hit_at_3`/`hit_at_10`/`mrr`/
+   `ndcg_at_10`/`recall_at_100` per language. Output goes into the `BASELINES.md`
+   entry. Pure function (bucket+compute) is unit-tested; a thin CLI reads the
+   NDJSON. (Feasibility confirmed: `dikw-core/src/dikw_core/eval/runner.py:206-288`
+   emits `per_query` with `id`/`ranked`/`expect_any`.)
+4. **`negatives-ood-v1` is observe-only.** `expect_none` satisfaction is **not** a
+   valid `dataset.yaml` threshold key (only `hit_at_3`/`hit_at_10`/`mrr`/
+   `ndcg_at_10`/`recall_at_100` are), and the engine treats `expect_none` queries as
+   **diagnostic only** (`runner.py:244`, no exit-1). So `dataset.yaml` carries
+   `thresholds: {}`; the value is the recorded `negative_diagnostics` (how many
+   negatives leaked a relevant-looking hit) plus pos-vs-neg top-1 score separation,
+   logged in `BASELINES.md` for Phase 1. A future gate would need an engine feature,
+   out of scope here.
+
+## 6. Deliverables + PR strategy
+
+- **Branch** `eval/phase1-inhouse-datasets`, stacked on `eval/anchor-calibration`
+  (#6); new PR (#7) with base `eval/anchor-calibration`. Merge order stays
+  #5 → #6 → #7 (GitHub auto-retargets on each merge).
+- **New/changed files:**
+  - `datasets/domain-bilingual-v1/**` (corpus copy + queries.yaml + dataset.yaml)
+  - `datasets/negatives-ood-v1/**` (corpus copy + queries.yaml + dataset.yaml)
+  - `tools/split_metrics_by_lang.py` + `tests/test_split_metrics_by_lang.py`
+  - `reports/BASELINES.md` — new dated entry (blended + zh/en split + negatives
+    diagnostics); satisfies the eval-gate content check
+  - possibly a small candidates-export helper if AuditStore→curation needs one
+  - docs touch-ups (`dikw-eval-plan.md` phase note; this spec)
+
+## 7. Constraints
+
+- **`dikw-core` read-only.** Materialize/copy only into `dikw-data` paths; never
+  modify a tracked `dikw-core` file (verify `git -C ../dikw-core status` clean).
+- **Secrets.** `.env`/`.env.eval` keys load in-process only; values are never
+  echoed or committed. `.env` (if created for the factory) is already gitignored.
+- **Real spend.** The calibration run incurs real Gitee embedding cost; LLM
+  generation + optional `llm_review` incur real MiniMax cost. Both are expected and
+  bounded: one cold eval run, a few batched generation calls (per language/cluster),
+  and — if enabled — one review call per candidate.
+
+## 8. Risks
+
+1. **Saturation.** 24 docs across 10 very distinct topics retrieve easily, so even
+   paraphrastic queries may push metrics near 1.0 — weakening "the gate that
+   matters." Mitigation (no corpus change): deliberately write a subset of
+   **intra-cluster-confusable** queries within the multi-doc clusters
+   (`chinese-history`×4, `world-history`×4) so ranking (`ndcg`/`mrr`) has signal.
+   Then follow §2.3 — record observed, gate at `observed − margin`, do **not**
+   engineer to a target. Residual saturation is logged as a known limitation; a
+   denser, deliberately-confusable corpus is a `domain-bilingual-v2` follow-up.
+2. **`expect_none` ungateable** — handled in §5.4 (observe-only).
+3. **Factory key wiring** — handled in §4.1.
+
+## 9. Acceptance criteria
+
+- `scripts/validate_dataset.py` passes for both datasets; `uv run ruff check .`,
+  `uv run mypy src`, `uv run pytest` all green (incl. the new splitter test).
+- Calibration run: `summary.json worst_exit_code == 0`; per-mode views recorded.
+- `reports/BASELINES.md` has a new entry with blended metrics, the zh/en split, and
+  negatives diagnostics; `domain-bilingual-v1` thresholds = `observed − margin`.
+- Cloud CI (incl. `eval-gate`) green on PR #7.
+
+## 10. Out of scope (Phase 1 follow-ups / later phases)
+
+- `domain-bilingual-v2` with a denser, deliberately-confusable corpus (if v1
+  saturates).
+- `mm-asset-v1` (multimodal, Phase 3), `synth-quality-v1` (K-layer, Phase 2) — per
+  `dikw-eval-plan.md` §2.2.
+- Promoting any threshold to a hard gate beyond the first calibrated floor (needs
+  ≥ 2 stable runs per §2.3).

From c897cf0ed80532e3a0c4ebd48f33cfd4641eab4d Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 20:38:26 +0800
Subject: [PATCH 2/7] docs: implementation plan for Phase 1 in-house datasets

8-task plan: per-language metric splitter (TDD), corpus scaffold, LLM query
generation via the MiniMax factory, curation/materialization, real-vector
calibration with observed-margin thresholds, zh/en split into BASELINES.md, and
the stacked PR. dikw-core read-only throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-26-phase1-inhouse-datasets.md     | 562 ++++++++++++++++++
 1 file changed, 562 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-06-26-phase1-inhouse-datasets.md

diff --git a/docs/superpowers/plans/2026-06-26-phase1-inhouse-datasets.md b/docs/superpowers/plans/2026-06-26-phase1-inhouse-datasets.md
new file mode 100644
index 0000000..7876d48
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-26-phase1-inhouse-datasets.md
@@ -0,0 +1,562 @@
+# Phase 1 In-House Datasets Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Build and calibrate two in-house retrieval datasets — `domain-bilingual-v1` (the gate that matters) and `negatives-ood-v1` (off-corpus robustness) — per `docs/dikw-eval-plan.md` §2.2, with thresholds set at `observed − margin` from one real-vector run.
+
+**Architecture:** Reuse `synthetic-diverse-v2`'s 24-doc corpus (12 zh / 12 en). LLM-generate queries through the existing MiniMax factory, curate them, materialize two contract-conformant dataset packages. Calibrate with one `run_eval.py` pass; the engine gates one blended flat-threshold set while a new offline tool recovers the zh/en split for `reports/BASELINES.md`. `dikw-core` stays read-only.
+
+**Tech Stack:** Python 3.12/3.13, uv, ruff, mypy (strict, `src` only), pytest; `dikw-core` v0.6.1 (editable); MiniMax (LLM) + Gitee `Qwen3-Embedding-0.6B`@1024 (embeddings).
+
+## Global Constraints
+
+- `dikw-core` is **read-only** — never modify a tracked `dikw-core` file; verify `git -C ../dikw-core status` is clean after every step that touches it.
+- Secrets load **in-process only**; key **values are never echoed or committed**. `.env` and `.env.eval` are gitignored.
+- Dataset contract (`scripts/validate_dataset.py`): each dataset dir has `dataset.yaml` + `queries.yaml` + `corpus/*.md`; every query needs `id` + `q` + exactly one of `expect_any: [stem]` / `expect_none: true`; `thresholds:` is a **flat** map; only `hit_at_3`, `hit_at_10`, `mrr`, `ndcg_at_10`, `recall_at_100` are valid threshold keys.
+- Query ids are **language-prefixed** `zh-…` / `en-…` (a query's language matches its gold doc's frontmatter `language:`).
+- Threshold margins (§2.3): gate at `observed − 0.03` for `ndcg_at_10`/`recall_at_100`, `observed − 0.05` for `hit_at_3`/`hit_at_10`.
+- Branch `eval/phase1-inhouse-datasets` is stacked on `eval/anchor-calibration` (#6); the new PR (#7) bases on `eval/anchor-calibration`.
+
+---
+
+### Task 1: `split_metrics_by_lang` tool (offline per-language metric split)
+
+**Files:**
+- Create: `tools/split_metrics_by_lang.py`
+- Test: `tests/test_split_metrics_by_lang.py`
+
+**Interfaces:**
+- Produces: `split_metrics(rows: list[dict]) -> dict` returning `{"all": {<5 metrics>}, "zh": {...}, "en": {...}, "counts": {...}}`; `aggregate(rows) -> dict[str,float]`; `per_query_rows(ndjson_path: str) -> list[dict]`; `format_markdown(name, split) -> str`; CLI `python tools/split_metrics_by_lang.py <ndjson> --name <n>`.
+- Metric formulas mirror `dikw-core/src/dikw_core/eval/metrics.py` exactly (binary `expect_any` relevance), so `split_metrics(all_rows)["all"]` reconciles with the engine's blended doc metrics.
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/test_split_metrics_by_lang.py
+"""Unit tests for the offline per-language metric splitter."""
+
+from __future__ import annotations
+
+import math
+
+import pytest
+
+from tools.split_metrics_by_lang import aggregate, lang_of, split_metrics
+
+# Four queries; ranked lists are gold-stem placements at known ranks.
+ROWS = [
+    {"id": "zh-a", "expect_any": ["doc_a"], "ranked": ["doc_a", "x", "y"]},      # rank 1
+    {"id": "zh-b", "expect_any": ["doc_b"], "ranked": ["x", "doc_b", "y"]},      # rank 2
+    {"id": "en-c", "expect_any": ["doc_c"], "ranked": ["x", "y", "z", "doc_c"]},  # rank 4
+    {"id": "en-d", "expect_any": ["doc_d"], "ranked": ["x", "y", "z"]},           # miss
+]
+
+
+def test_lang_of_prefix():
+    assert lang_of("zh-a") == "zh"
+    assert lang_of("en-c") == "en"
+    assert lang_of("scifact_q1") == "other"
+
+
+def test_zh_bucket_metrics():
+    m = split_metrics(ROWS)["zh"]
+    assert m["hit_at_3"] == pytest.approx(1.0)
+    assert m["mrr"] == pytest.approx((1.0 + 0.5) / 2)
+    assert m["ndcg_at_10"] == pytest.approx((1.0 + 1.0 / math.log2(3)) / 2)
+    assert m["recall_at_100"] == pytest.approx(1.0)
+
+
+def test_en_bucket_metrics():
+    m = split_metrics(ROWS)["en"]
+    assert m["hit_at_3"] == pytest.approx(0.0)
+    assert m["hit_at_10"] == pytest.approx(0.5)
+    assert m["mrr"] == pytest.approx((0.25 + 0.0) / 2)
+    assert m["ndcg_at_10"] == pytest.approx((1.0 / math.log2(5) + 0.0) / 2)
+
+
+def test_all_reconciles_with_full_aggregate():
+    split = split_metrics(ROWS)
+    assert split["all"] == aggregate(ROWS)
+    assert split["counts"] == {"all": 4, "zh": 2, "en": 2}
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/test_split_metrics_by_lang.py -q`
+Expected: FAIL — `ModuleNotFoundError: No module named 'tools.split_metrics_by_lang'`.
+
+- [ ] **Step 3: Write the implementation**
+
+```python
+# tools/split_metrics_by_lang.py
+"""Offline per-language split of an eval NDJSON's per-query rows.
+
+The dataset contract's ``thresholds:`` map is flat (no per-language nesting), so a
+bilingual dataset is gated on one blended set. This tool recovers the zh/en split
+that ``docs/dikw-eval-plan.md`` §2.4 asks for, for ``reports/BASELINES.md``: read an
+eval NDJSON, take the EvalReport's ``per_query`` rows, bucket by query-id prefix
+(``zh-`` / ``en-``), and recompute the five retrieval metrics per bucket.
+
+The metric formulas mirror ``dikw-core/src/dikw_core/eval/metrics.py`` exactly
+(binary ``expect_any`` relevance) so ``split_metrics(rows)["all"]`` reconciles with
+the engine's reported blended doc metrics — the calibration run asserts that
+equality.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+from collections.abc import Iterable, Sequence
+from typing import Any
+
+METRIC_KEYS = ("hit_at_3", "hit_at_10", "mrr", "ndcg_at_10", "recall_at_100")
+
+
+def hit_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    top = set(ranked[:k])
+    return 1.0 if any(e in top for e in expected_any) else 0.0
+
+
+def reciprocal_rank(ranked: Sequence[str], expected_any: Iterable[str]) -> float:
+    expected = set(expected_any)
+    for idx, key in enumerate(ranked, start=1):
+        if key in expected:
+            return 1.0 / idx
+    return 0.0
+
+
+def ndcg_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    expected = set(expected_any)
+    if not expected:
+        return 0.0
+    dcg = 0.0
+    for idx, key in enumerate(ranked[:k], start=1):
+        if key in expected:
+            dcg += 1.0 / math.log2(idx + 1)
+    n_rel = min(len(expected), k)
+    idcg = sum(1.0 / math.log2(i + 1) for i in range(1, n_rel + 1))
+    return dcg / idcg if idcg > 0 else 0.0
+
+
+def recall_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    expected = set(expected_any)
+    if not expected:
+        return 0.0
+    return len(set(ranked[:k]) & expected) / len(expected)
+
+
+def aggregate(rows: list[dict[str, Any]]) -> dict[str, float]:
+    """Mean of each metric across rows. Empty input → all zeros."""
+    if not rows:
+        return dict.fromkeys(METRIC_KEYS, 0.0)
+    n = len(rows)
+    pairs = [(r.get("ranked", []), r.get("expect_any", [])) for r in rows]
+    return {
+        "hit_at_3": sum(hit_at_k(rk, ex, 3) for rk, ex in pairs) / n,
+        "hit_at_10": sum(hit_at_k(rk, ex, 10) for rk, ex in pairs) / n,
+        "mrr": sum(reciprocal_rank(rk, ex) for rk, ex in pairs) / n,
+        "ndcg_at_10": sum(ndcg_at_k(rk, ex, 10) for rk, ex in pairs) / n,
+        "recall_at_100": sum(recall_at_k(rk, ex, 100) for rk, ex in pairs) / n,
+    }
+
+
+def lang_of(qid: str) -> str:
+    if qid.startswith("zh-"):
+        return "zh"
+    if qid.startswith("en-"):
+        return "en"
+    return "other"
+
+
+def split_metrics(rows: list[dict[str, Any]]) -> dict[str, Any]:
+    buckets: dict[str, list[dict[str, Any]]] = {"zh": [], "en": [], "other": []}
+    for r in rows:
+        buckets[lang_of(str(r.get("id", "")))].append(r)
+    out: dict[str, Any] = {"all": aggregate(rows), "counts": {"all": len(rows)}}
+    for lang in ("zh", "en", "other"):
+        if buckets[lang]:
+            out[lang] = aggregate(buckets[lang])
+            out["counts"][lang] = len(buckets[lang])
+    return out
+
+
+def per_query_rows(ndjson_path: str) -> list[dict[str, Any]]:
+    """Extract the EvalReport ``per_query`` rows from an eval NDJSON file.
+
+    The stream carries progress events plus the final EvalReport; the report is the
+    line with the longest ``per_query`` list.
+    """
+    best: list[dict[str, Any]] = []
+    with open(ndjson_path, encoding="utf-8") as fh:
+        for raw in fh:
+            raw = raw.strip()
+            if not raw:
+                continue
+            try:
+                obj = json.loads(raw)
+            except json.JSONDecodeError:
+                continue
+            pq = obj.get("per_query") if isinstance(obj, dict) else None
+            if isinstance(pq, list) and len(pq) >= len(best):
+                best = pq
+    return best
+
+
+def format_markdown(name: str, split: dict[str, Any]) -> str:
+    head = "| lang | n | " + " | ".join(METRIC_KEYS) + " |"
+    sep = "|" + "---|" * (2 + len(METRIC_KEYS))
+    lines = [f"### {name}", "", head, sep]
+    for lang in ("all", "zh", "en", "other"):
+        if lang not in split:
+            continue
+        m = split[lang]
+        n = split["counts"].get(lang, "")
+        cells = " | ".join(f"{m[k]:.3f}" for k in METRIC_KEYS)
+        lines.append(f"| {lang} | {n} | {cells} |")
+    return "\n".join(lines)
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description="Per-language split of an eval NDJSON.")
+    p.add_argument("ndjson", help="path to an eval NDJSON with per_query rows")
+    p.add_argument("--name", default="dataset")
+    args = p.parse_args(argv)
+    rows = per_query_rows(args.ndjson)
+    if not rows:
+        print(f"::error::no per_query rows in {args.ndjson}", file=sys.stderr)
+        return 1
+    print(format_markdown(args.name, split_metrics(rows)))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
+```
+
+- [ ] **Step 4: Run tests + lint to verify they pass**
+
+Run: `uv run pytest tests/test_split_metrics_by_lang.py -q && uv run ruff check tools/split_metrics_by_lang.py tests/test_split_metrics_by_lang.py`
+Expected: tests PASS, ruff clean.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add tools/split_metrics_by_lang.py tests/test_split_metrics_by_lang.py
+git commit -m "feat(tools): per-language metric splitter for bilingual eval NDJSON"
+```
+
+---
+
+### Task 2: Scaffold both dataset packages (corpus copy)
+
+**Files:**
+- Create: `datasets/domain-bilingual-v1/corpus/*.md` (24 files), `datasets/negatives-ood-v1/corpus/*.md` (24 files)
+- Create: `datasets/domain-bilingual-v1/dataset.yaml`, `datasets/negatives-ood-v1/dataset.yaml` (thresholds empty for now)
+
+- [ ] **Step 1: Copy the reused corpus into both datasets**
+
+```bash
+cd /Users/hele/Projects/opendikw/dikw-data/.claude/worktrees/dikw-data-ci-flag
+for d in domain-bilingual-v1 negatives-ood-v1; do
+  mkdir -p "datasets/$d/corpus"
+  cp datasets/synthetic-diverse-v2/corpus/*.md "datasets/$d/corpus/"
+done
+ls datasets/domain-bilingual-v1/corpus/*.md | wc -l   # expect 24
+ls datasets/negatives-ood-v1/corpus/*.md | wc -l       # expect 24
+```
+
+- [ ] **Step 2: Write placeholder `dataset.yaml` (thresholds filled in Task 6)**
+
+`datasets/domain-bilingual-v1/dataset.yaml`:
+
+```yaml
+name: domain-bilingual-v1
+description: >
+  In-house bilingual (50/50 zh+en) domain retrieval set over the diverse-v2
+  corpus. The Phase-1 retrieval gate. Engine gates one blended flat-threshold
+  set; the zh/en split is recorded in reports/BASELINES.md (see
+  docs/dikw-eval-plan.md §2.4). Thresholds set at observed − margin.
+thresholds: {}
+```
+
+`datasets/negatives-ood-v1/dataset.yaml`:
+
+```yaml
+name: negatives-ood-v1
+description: >
+  Off-corpus negatives (expect_none) riding the diverse-v2 corpus: plausible but
+  unanswerable zh+en queries that a healthy engine returns nothing relevant for.
+  expect_none is diagnostic-only in dikw-core (no threshold key), so this set is
+  observe-only — see reports/BASELINES.md negative diagnostics.
+thresholds: {}
+```
+
+- [ ] **Step 3: Verify dikw-core untouched + commit the scaffold**
+
+```bash
+git -C ../../../../dikw-core status --short   # expect empty
+git add datasets/domain-bilingual-v1 datasets/negatives-ood-v1
+git commit -m "feat(datasets): scaffold domain-bilingual-v1 + negatives-ood-v1 (corpus copy)"
+```
+
+> NOTE: `validate_dataset.py` will fail until `queries.yaml` exists (Task 4) — that's expected; do not run it yet.
+
+---
+
+### Task 3: LLM-generate query candidates (real MiniMax spend)
+
+**Files:**
+- Create (gitignored): `.env` (factory key), `generated/<dataset>/…` candidate audit
+- Verify: `configs/minimax.yml` exists
+
+- [ ] **Step 1: Wire the factory key without echoing its value**
+
+The factory transport reads `ANTHROPIC_API_KEY` from `.env` (`src/dikw_data/config.py`), while the eval key lives in `.env.eval` as `MINIMAX_API_KEY`. Copy the value across without printing it:
+
+```bash
+test -f configs/minimax.yml || echo "MISSING configs/minimax.yml"
+grep -q '^ANTHROPIC_API_KEY=' .env 2>/dev/null || \
+  grep '^MINIMAX_API_KEY=' .env.eval | sed 's/^MINIMAX_API_KEY=/ANTHROPIC_API_KEY=/' >> .env
+grep -c '^ANTHROPIC_API_KEY=' .env   # expect 1; value never printed
+```
+
+- [ ] **Step 2: Dry-run the generator (no spend) to confirm wiring**
+
+```bash
+UV_NO_SYNC=1 uv run python scripts/generate_candidates.py --dataset domain-bilingual-v1 --queries 30 --dry-run
+```
+Expected: JSON status lines with `"status": "dry_run"`, exit 0.
+
+- [ ] **Step 3: Generate positives for `domain-bilingual-v1` (real spend)**
+
+```bash
+UV_NO_SYNC=1 uv run python scripts/generate_candidates.py --dataset domain-bilingual-v1 --queries 50 --resume
+```
+Generates ~50 candidates (so curation can keep ~40 with each doc covered). Candidates land in the AuditStore under `generated/domain-bilingual-v1/`. Re-run with `--resume` if it stalls.
+
+- [ ] **Step 4: Generate negatives for `negatives-ood-v1` (real spend)**
+
+```bash
+UV_NO_SYNC=1 uv run python scripts/generate_candidates.py --dataset negatives-ood-v1 --queries 35 --resume
+```
+The generator's prompt already supports `expect_none=true` negatives. Aim for ~35 candidates to curate down to ~25.
+
+- [ ] **Step 5: Export candidates to a readable JSON for curation**
+
+Read the AuditStore results (the generator persists each task's `result_json`). Dump the candidate arrays to `$CLAUDE_JOB_DIR/tmp/<dataset>-candidates.json` for the curation pass in Task 4. (Inspect `src/dikw_data/audit.py` for the read API; the candidates are the JSON array in the successful task's `result_json`.)
+
+> No commit — `.env` and `generated/` are gitignored.
+
+---
+
+### Task 4: Curate + materialize `queries.yaml` (the human-verify gate)
+
+**Files:**
+- Create: `datasets/domain-bilingual-v1/queries.yaml`, `datasets/negatives-ood-v1/queries.yaml`
+
+- [ ] **Step 1: Curate `domain-bilingual-v1` positives**
+
+For each candidate, confirm: the `expect_any` stem exists in the corpus and genuinely answers the query; the query language matches that stem's frontmatter `language:`; drop low-`confidence`/ambiguous/duplicate items. Keep ~40 with **every doc covered ≥ 1** and **zh/en balanced (~20/~20)**. Deliberately retain a handful of **intra-cluster-confusable** queries within `chinese-history`×4 and `world-history`×4 (e.g. a vaguely-phrased "中国某王朝的中央集权改革" that plausibly matches qin/tang/wang-anshi) so ranking has signal. Optionally run `scripts/llm_review.py` for a second opinion.
+
+- [ ] **Step 2: Write `datasets/domain-bilingual-v1/queries.yaml`**
+
+Shape (ids language-prefixed; a query's language = its gold doc's language):
+
+```yaml
+queries:
+  - id: zh-tang-founding-basis
+    q: "唐朝建立的核心政治基础是什么？"
+    expect_any: [chinese-history-tang-founding]
+  - id: en-photosynthesis-chlorophyll
+    q: "What role does chlorophyll play in photosynthesis?"
+    expect_any: [science-photosynthesis]
+  # … ~40 total, ~20 zh + ~20 en, every corpus stem covered ≥ 1
+```
+
+- [ ] **Step 3: Curate + write `datasets/negatives-ood-v1/queries.yaml`**
+
+~25 `expect_none` queries, plausible-but-uncovered, mixed zh/en, ids prefixed:
+
+```yaml
+queries:
+  - id: en-docker-bridge-network
+    q: "How do I configure a Docker bridge network?"
+    expect_none: true
+  - id: zh-bike-derailleur-repair
+    q: "如何修理自行车变速器？"
+    expect_none: true
+  # … ~25 total, ~12 zh + ~13 en
+```
+
+- [ ] **Step 4: Validate both datasets ($0, before any eval spend)**
+
+```bash
+uv run python scripts/validate_dataset.py datasets/domain-bilingual-v1
+uv run python scripts/validate_dataset.py datasets/negatives-ood-v1
+```
+Expected: both report OK / exit 0. Fix any unresolved-stem or duplicate-id errors.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add datasets/domain-bilingual-v1/queries.yaml datasets/negatives-ood-v1/queries.yaml
+git commit -m "feat(datasets): curated bilingual positives + ood negatives queries"
+```
+
+---
+
+### Task 5: Calibration eval run (real Gitee spend) + reconcile splitter
+
+**Files:**
+- Create (gitignored): `reports/<UTC-ts>/…` NDJSON + `summary.json`
+
+- [ ] **Step 1: Run the real-vector eval over both datasets**
+
+```bash
+UV_NO_SYNC=1 uv run python scripts/run_eval.py \
+  --datasets "$PWD/datasets/domain-bilingual-v1,$PWD/datasets/negatives-ood-v1" \
+  --retrieval all
+```
+Cold run pays Gitee embedding; warm reruns hit the snapshot cache. Capture the printed `reports/<UTC-ts>/` path.
+
+- [ ] **Step 2: Confirm exit health + per-mode views**
+
+```bash
+cat reports/<UTC-ts>/summary.json   # worst_exit_code == 0
+```
+Expected: `worst_exit_code == 0`; per-mode `bm25/vector/hybrid` blocks present for `domain-bilingual-v1`.
+
+- [ ] **Step 3: Reconcile the splitter against the engine (catches formula drift)**
+
+Run the splitter on the hybrid NDJSON and confirm its `all` block equals the engine's blended doc metrics in `summary.json` (within 1e-9):
+
+```bash
+uv run python tools/split_metrics_by_lang.py \
+  reports/<UTC-ts>/domain-bilingual-v1__hybrid.ndjson --name domain-bilingual-v1
+```
+Compare the `all` row to the engine's reported `doc/hybrid` metrics. If they diverge, the splitter formulas are wrong — fix before proceeding. (The actual NDJSON filename may differ; use the hybrid/doc report that carries `per_query`.)
+
+> No commit — `reports/<ts>/` is gitignored (only `reports/BASELINES.md` is tracked).
+
+---
+
+### Task 6: Set thresholds at `observed − margin` + re-validate
+
+**Files:**
+- Modify: `datasets/domain-bilingual-v1/dataset.yaml`
+
+- [ ] **Step 1: Fill `domain-bilingual-v1` thresholds from observed blended (doc/hybrid)**
+
+Using the `all` row from Task 5: gate at `observed − 0.03` for `ndcg_at_10`/`recall_at_100`, `observed − 0.05` for `hit_at_3`/`hit_at_10`; `mrr` at `observed − 0.05`. Replace `thresholds: {}` with the computed values, e.g.:
+
+```yaml
+thresholds:
+  hit_at_3: 0.XX        # observed_hit_at_3 − 0.05
+  hit_at_10: 0.XX       # observed_hit_at_10 − 0.05
+  mrr: 0.XX             # observed_mrr − 0.05
+  ndcg_at_10: 0.XX      # observed_ndcg_at_10 − 0.03
+  recall_at_100: 0.XX   # observed_recall_at_100 − 0.03
+```
+`negatives-ood-v1/dataset.yaml` stays `thresholds: {}` (observe-only).
+
+- [ ] **Step 2: Re-validate + re-run to confirm the gate passes**
+
+```bash
+uv run python scripts/validate_dataset.py datasets/domain-bilingual-v1
+UV_NO_SYNC=1 uv run python scripts/run_eval.py \
+  --datasets "$PWD/datasets/domain-bilingual-v1" --retrieval all
+cat reports/<UTC-ts>/summary.json   # worst_exit_code == 0 (warm; cache hit)
+```
+Expected: validation OK; eval exit 0 (observed clears the just-set floor by the margin).
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add datasets/domain-bilingual-v1/dataset.yaml
+git commit -m "feat(datasets): calibrate domain-bilingual-v1 thresholds at observed-margin"
+```
+
+---
+
+### Task 7: Record the baseline entry (satisfies eval-gate)
+
+**Files:**
+- Modify: `reports/BASELINES.md`
+
+- [ ] **Step 1: Append a new dated entry**
+
+Add a `## 2026-06-26 — domain-bilingual-v1 + negatives-ood-v1 (Phase 1 calibration)` entry. Include: dikw-core v0.6.1 + MiniMax+Gitee + `--retrieval all`; the per-mode blended table for `domain-bilingual-v1`; the **zh/en split** table from `tools/split_metrics_by_lang.py`; the chosen `observed − margin` thresholds; `negatives-ood-v1` diagnostics (how many negatives leaked, pos-vs-neg top-1 score separation); and a saturation note if metrics ran high. Name at least one retrieval metric (the eval-gate content check requires a new dated header + a metric token).
+
+- [ ] **Step 2: Verify the eval-gate content check passes locally**
+
+```bash
+uv run python -c "from tools.check_baselines import check_baseline_addition as c; \
+import sys; \
+lines=open('reports/BASELINES.md',encoding='utf-8').read().splitlines(); \
+print(c([l for l in lines if '2026-06-26' in l or 'ndcg' in l], existing_headers=set(), touches_datasets=True))"
+```
+Expected: `[]` (no violations) — a new dated header + a retrieval metric are present.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add reports/BASELINES.md
+git commit -m "eval: record Phase 1 in-house dataset calibration (blended + zh/en split)"
+```
+
+---
+
+### Task 8: Final gates + PR #7
+
+- [ ] **Step 1: Run the full local CI floor**
+
+```bash
+uv run ruff check . && uv run mypy src && uv run pytest -q
+for d in domain-bilingual-v1 negatives-ood-v1; do
+  uv run python scripts/validate_dataset.py "datasets/$d"
+done
+git -C ../../../../dikw-core status --short   # expect empty (read-only held)
+```
+Expected: all green; dikw-core clean.
+
+- [ ] **Step 2: Push the stacked branch**
+
+```bash
+git push -u origin eval/phase1-inhouse-datasets
+```
+
+- [ ] **Step 3: Open PR #7 stacked on #6**
+
+```bash
+gh pr create --base eval/anchor-calibration --head eval/phase1-inhouse-datasets \
+  --title "feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1" \
+  --body "Implements eval-plan datasets ii/iii. Single bilingual dataset with engine-gated blended thresholds (observed−margin) + offline zh/en split recorded in BASELINES.md; observe-only off-corpus negatives. dikw-core read-only. Merge order #5 → #6 → #7.
+
+🤖 Generated with [Claude Code](https://claude.com/claude-code)"
+```
+
+- [ ] **Step 4: Confirm cloud CI green**
+
+```bash
+gh pr checks   # lint-type-test (3.12 + 3.13) + eval-gate all pass
+```
+Expected: all checks green. If `eval-gate` is red, the BASELINES.md entry shape is wrong (Task 7) — fix and re-push.
+
+---
+
+## Self-Review
+
+**Spec coverage:** §2 decisions → Tasks 2/3/4 (single dataset, reuse corpus, LLM-gen). §3 dataset specs → Tasks 2/4. §4 generation+verify → Tasks 3/4. §5 calibration (run, splitter, per-language, negatives observe-only) → Tasks 1/5/6/7. §6 deliverables/PR → Task 8. §7 constraints → Global Constraints + read-only checks in Tasks 2/8. §8 risks → Task 4 Step 1 (saturation mitigation), §5.4 (negatives), §4.1 (key wiring → Task 3 Step 1). §9 acceptance → Tasks 6/7/8. No gaps.
+
+**Placeholder scan:** The only `0.XX` placeholders are in Task 6 thresholds, which are *necessarily* computed from the Task-5 run (the methodology forbids guessing them) — each carries the exact formula. `reports/<UTC-ts>/` and the NDJSON filename are runtime-resolved paths, flagged as such. No disallowed "add error handling"/"write tests"-style gaps.
+
+**Type consistency:** `split_metrics`/`aggregate`/`lang_of`/`per_query_rows`/`format_markdown` signatures match between the implementation (Task 1 Step 3) and the test (Step 1) and the CLI usage (Task 5 Step 3). Metric keys are the single `METRIC_KEYS` tuple throughout.

From 7924f1cbd51f5c543ac8a6f991f9c5dbd3eed8e5 Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 20:39:44 +0800
Subject: [PATCH 3/7] feat(tools): per-language metric splitter for bilingual
 eval NDJSON
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Recovers the zh/en breakdown (docs/dikw-eval-plan.md §2.4) from an eval NDJSON's
per_query rows, since dataset.yaml thresholds are flat. Metric formulas mirror
dikw-core/src/dikw_core/eval/metrics.py exactly so split_metrics(all) reconciles
with the engine's blended doc metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/test_split_metrics_by_lang.py |  44 ++++++++
 tools/split_metrics_by_lang.py      | 151 ++++++++++++++++++++++++++++
 2 files changed, 195 insertions(+)
 create mode 100644 tests/test_split_metrics_by_lang.py
 create mode 100644 tools/split_metrics_by_lang.py

diff --git a/tests/test_split_metrics_by_lang.py b/tests/test_split_metrics_by_lang.py
new file mode 100644
index 0000000..5cb55f0
--- /dev/null
+++ b/tests/test_split_metrics_by_lang.py
@@ -0,0 +1,44 @@
+"""Unit tests for the offline per-language metric splitter."""
+
+from __future__ import annotations
+
+import math
+
+import pytest
+from tools.split_metrics_by_lang import aggregate, lang_of, split_metrics
+
+# Four queries; ranked lists place the gold stem at known ranks.
+ROWS = [
+    {"id": "zh-a", "expect_any": ["doc_a"], "ranked": ["doc_a", "x", "y"]},       # rank 1
+    {"id": "zh-b", "expect_any": ["doc_b"], "ranked": ["x", "doc_b", "y"]},       # rank 2
+    {"id": "en-c", "expect_any": ["doc_c"], "ranked": ["x", "y", "z", "doc_c"]},  # rank 4
+    {"id": "en-d", "expect_any": ["doc_d"], "ranked": ["x", "y", "z"]},           # miss
+]
+
+
+def test_lang_of_prefix():
+    assert lang_of("zh-a") == "zh"
+    assert lang_of("en-c") == "en"
+    assert lang_of("scifact_q1") == "other"
+
+
+def test_zh_bucket_metrics():
+    m = split_metrics(ROWS)["zh"]
+    assert m["hit_at_3"] == pytest.approx(1.0)
+    assert m["mrr"] == pytest.approx((1.0 + 0.5) / 2)
+    assert m["ndcg_at_10"] == pytest.approx((1.0 + 1.0 / math.log2(3)) / 2)
+    assert m["recall_at_100"] == pytest.approx(1.0)
+
+
+def test_en_bucket_metrics():
+    m = split_metrics(ROWS)["en"]
+    assert m["hit_at_3"] == pytest.approx(0.0)
+    assert m["hit_at_10"] == pytest.approx(0.5)
+    assert m["mrr"] == pytest.approx((0.25 + 0.0) / 2)
+    assert m["ndcg_at_10"] == pytest.approx((1.0 / math.log2(5) + 0.0) / 2)
+
+
+def test_all_reconciles_with_full_aggregate():
+    split = split_metrics(ROWS)
+    assert split["all"] == aggregate(ROWS)
+    assert split["counts"] == {"all": 4, "zh": 2, "en": 2}
diff --git a/tools/split_metrics_by_lang.py b/tools/split_metrics_by_lang.py
new file mode 100644
index 0000000..4e00b31
--- /dev/null
+++ b/tools/split_metrics_by_lang.py
@@ -0,0 +1,151 @@
+"""Offline per-language split of an eval NDJSON's per-query rows.
+
+The dataset contract's ``thresholds:`` map is flat (no per-language nesting), so a
+bilingual dataset is gated on one blended set. This tool recovers the zh/en split
+that ``docs/dikw-eval-plan.md`` §2.4 asks for, for ``reports/BASELINES.md``: read an
+eval NDJSON, take the EvalReport's ``per_query`` rows, bucket by query-id prefix
+(``zh-`` / ``en-``), and recompute the five retrieval metrics per bucket.
+
+The metric formulas mirror ``dikw-core/src/dikw_core/eval/metrics.py`` exactly
+(binary ``expect_any`` relevance) so ``split_metrics(rows)["all"]`` reconciles with
+the engine's reported blended doc metrics — the calibration run asserts that
+equality.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+from collections.abc import Iterable, Sequence
+from typing import Any
+
+METRIC_KEYS = ("hit_at_3", "hit_at_10", "mrr", "ndcg_at_10", "recall_at_100")
+
+
+def hit_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    top = set(ranked[:k])
+    return 1.0 if any(e in top for e in expected_any) else 0.0
+
+
+def reciprocal_rank(ranked: Sequence[str], expected_any: Iterable[str]) -> float:
+    expected = set(expected_any)
+    for idx, key in enumerate(ranked, start=1):
+        if key in expected:
+            return 1.0 / idx
+    return 0.0
+
+
+def ndcg_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    expected = set(expected_any)
+    if not expected:
+        return 0.0
+    dcg = 0.0
+    for idx, key in enumerate(ranked[:k], start=1):
+        if key in expected:
+            dcg += 1.0 / math.log2(idx + 1)
+    n_rel = min(len(expected), k)
+    idcg = sum(1.0 / math.log2(i + 1) for i in range(1, n_rel + 1))
+    return dcg / idcg if idcg > 0 else 0.0
+
+
+def recall_at_k(ranked: Sequence[str], expected_any: Iterable[str], k: int) -> float:
+    if k <= 0:
+        return 0.0
+    expected = set(expected_any)
+    if not expected:
+        return 0.0
+    return len(set(ranked[:k]) & expected) / len(expected)
+
+
+def aggregate(rows: list[dict[str, Any]]) -> dict[str, float]:
+    """Mean of each metric across rows. Empty input → all zeros."""
+    if not rows:
+        return dict.fromkeys(METRIC_KEYS, 0.0)
+    n = len(rows)
+    pairs = [(r.get("ranked", []), r.get("expect_any", [])) for r in rows]
+    return {
+        "hit_at_3": sum(hit_at_k(rk, ex, 3) for rk, ex in pairs) / n,
+        "hit_at_10": sum(hit_at_k(rk, ex, 10) for rk, ex in pairs) / n,
+        "mrr": sum(reciprocal_rank(rk, ex) for rk, ex in pairs) / n,
+        "ndcg_at_10": sum(ndcg_at_k(rk, ex, 10) for rk, ex in pairs) / n,
+        "recall_at_100": sum(recall_at_k(rk, ex, 100) for rk, ex in pairs) / n,
+    }
+
+
+def lang_of(qid: str) -> str:
+    if qid.startswith("zh-"):
+        return "zh"
+    if qid.startswith("en-"):
+        return "en"
+    return "other"
+
+
+def split_metrics(rows: list[dict[str, Any]]) -> dict[str, Any]:
+    buckets: dict[str, list[dict[str, Any]]] = {"zh": [], "en": [], "other": []}
+    for r in rows:
+        buckets[lang_of(str(r.get("id", "")))].append(r)
+    out: dict[str, Any] = {"all": aggregate(rows), "counts": {"all": len(rows)}}
+    for lang in ("zh", "en", "other"):
+        if buckets[lang]:
+            out[lang] = aggregate(buckets[lang])
+            out["counts"][lang] = len(buckets[lang])
+    return out
+
+
+def per_query_rows(ndjson_path: str) -> list[dict[str, Any]]:
+    """Extract the EvalReport ``per_query`` rows from an eval NDJSON file.
+
+    The stream carries progress events plus the final EvalReport; the report is the
+    line with the longest ``per_query`` list.
+    """
+    best: list[dict[str, Any]] = []
+    with open(ndjson_path, encoding="utf-8") as fh:
+        for raw in fh:
+            line = raw.strip()
+            if not line:
+                continue
+            try:
+                obj = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            pq = obj.get("per_query") if isinstance(obj, dict) else None
+            if isinstance(pq, list) and len(pq) >= len(best):
+                best = pq
+    return best
+
+
+def format_markdown(name: str, split: dict[str, Any]) -> str:
+    head = "| lang | n | " + " | ".join(METRIC_KEYS) + " |"
+    sep = "|" + "---|" * (2 + len(METRIC_KEYS))
+    lines = [f"### {name}", "", head, sep]
+    for lang in ("all", "zh", "en", "other"):
+        if lang not in split:
+            continue
+        m = split[lang]
+        n = split["counts"].get(lang, "")
+        cells = " | ".join(f"{m[k]:.3f}" for k in METRIC_KEYS)
+        lines.append(f"| {lang} | {n} | {cells} |")
+    return "\n".join(lines)
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description="Per-language split of an eval NDJSON.")
+    p.add_argument("ndjson", help="path to an eval NDJSON with per_query rows")
+    p.add_argument("--name", default="dataset")
+    args = p.parse_args(argv)
+    rows = per_query_rows(args.ndjson)
+    if not rows:
+        print(f"::error::no per_query rows in {args.ndjson}", file=sys.stderr)
+        return 1
+    print(format_markdown(args.name, split_metrics(rows)))
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

From 088c0583887c7f5c79440f60162425d7e8c9e5d6 Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 20:40:24 +0800
Subject: [PATCH 4/7] feat(datasets): scaffold domain-bilingual-v1 +
 negatives-ood-v1 (corpus copy)

Both reuse synthetic-diverse-v2's 24-doc corpus (12 zh / 12 en). dataset.yaml
thresholds left empty; domain-bilingual-v1 is calibrated to observed-margin after
the real-vector run, negatives-ood-v1 stays observe-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../corpus/chinese-history-opium-war.md          | 16 ++++++++++++++++
 .../corpus/chinese-history-qin-unification.md    | 16 ++++++++++++++++
 .../corpus/chinese-history-tang-founding.md      | 16 ++++++++++++++++
 .../corpus/chinese-history-wang-anshi-reform.md  | 16 ++++++++++++++++
 .../corpus/economics-externalities.md            | 16 ++++++++++++++++
 .../corpus/economics-supply-demand.md            | 16 ++++++++++++++++
 .../corpus/finance-compound-interest.md          | 16 ++++++++++++++++
 .../corpus/finance-inflation-bonds.md            | 16 ++++++++++++++++
 .../corpus/geography-monsoon-climate.md          | 16 ++++++++++++++++
 .../corpus/geography-river-deltas.md             | 16 ++++++++++++++++
 .../corpus/law-contract-offer-acceptance.md      | 16 ++++++++++++++++
 .../corpus/law-copyright-fair-use.md             | 16 ++++++++++++++++
 .../corpus/literature-dream-red-chamber.md       | 16 ++++++++++++++++
 .../corpus/literature-shakespeare-macbeth.md     | 16 ++++++++++++++++
 .../corpus/medicine-antibiotic-resistance.md     | 16 ++++++++++++++++
 .../corpus/medicine-vaccination-immunity.md      | 16 ++++++++++++++++
 .../corpus/science-photosynthesis.md             | 16 ++++++++++++++++
 .../corpus/science-plate-tectonics.md            | 16 ++++++++++++++++
 .../corpus/technology-database-indexes.md        | 16 ++++++++++++++++
 .../corpus/technology-public-key-cryptography.md | 16 ++++++++++++++++
 .../corpus/world-history-cold-war-containment.md | 16 ++++++++++++++++
 .../corpus/world-history-french-revolution.md    | 16 ++++++++++++++++
 .../world-history-industrial-revolution.md       | 16 ++++++++++++++++
 .../corpus/world-history-roman-republic.md       | 16 ++++++++++++++++
 datasets/domain-bilingual-v1/dataset.yaml        |  8 ++++++++
 .../corpus/chinese-history-opium-war.md          | 16 ++++++++++++++++
 .../corpus/chinese-history-qin-unification.md    | 16 ++++++++++++++++
 .../corpus/chinese-history-tang-founding.md      | 16 ++++++++++++++++
 .../corpus/chinese-history-wang-anshi-reform.md  | 16 ++++++++++++++++
 .../corpus/economics-externalities.md            | 16 ++++++++++++++++
 .../corpus/economics-supply-demand.md            | 16 ++++++++++++++++
 .../corpus/finance-compound-interest.md          | 16 ++++++++++++++++
 .../corpus/finance-inflation-bonds.md            | 16 ++++++++++++++++
 .../corpus/geography-monsoon-climate.md          | 16 ++++++++++++++++
 .../corpus/geography-river-deltas.md             | 16 ++++++++++++++++
 .../corpus/law-contract-offer-acceptance.md      | 16 ++++++++++++++++
 .../corpus/law-copyright-fair-use.md             | 16 ++++++++++++++++
 .../corpus/literature-dream-red-chamber.md       | 16 ++++++++++++++++
 .../corpus/literature-shakespeare-macbeth.md     | 16 ++++++++++++++++
 .../corpus/medicine-antibiotic-resistance.md     | 16 ++++++++++++++++
 .../corpus/medicine-vaccination-immunity.md      | 16 ++++++++++++++++
 .../corpus/science-photosynthesis.md             | 16 ++++++++++++++++
 .../corpus/science-plate-tectonics.md            | 16 ++++++++++++++++
 .../corpus/technology-database-indexes.md        | 16 ++++++++++++++++
 .../corpus/technology-public-key-cryptography.md | 16 ++++++++++++++++
 .../corpus/world-history-cold-war-containment.md | 16 ++++++++++++++++
 .../corpus/world-history-french-revolution.md    | 16 ++++++++++++++++
 .../world-history-industrial-revolution.md       | 16 ++++++++++++++++
 .../corpus/world-history-roman-republic.md       | 16 ++++++++++++++++
 datasets/negatives-ood-v1/dataset.yaml           |  7 +++++++
 50 files changed, 783 insertions(+)
 create mode 100644 datasets/domain-bilingual-v1/corpus/chinese-history-opium-war.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/chinese-history-qin-unification.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/chinese-history-tang-founding.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/chinese-history-wang-anshi-reform.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/economics-externalities.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/economics-supply-demand.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/finance-compound-interest.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/finance-inflation-bonds.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/geography-monsoon-climate.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/geography-river-deltas.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/law-contract-offer-acceptance.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/law-copyright-fair-use.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/literature-dream-red-chamber.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/literature-shakespeare-macbeth.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/medicine-antibiotic-resistance.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/medicine-vaccination-immunity.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/science-photosynthesis.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/science-plate-tectonics.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/technology-database-indexes.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/technology-public-key-cryptography.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/world-history-cold-war-containment.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/world-history-french-revolution.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/world-history-industrial-revolution.md
 create mode 100644 datasets/domain-bilingual-v1/corpus/world-history-roman-republic.md
 create mode 100644 datasets/domain-bilingual-v1/dataset.yaml
 create mode 100644 datasets/negatives-ood-v1/corpus/chinese-history-opium-war.md
 create mode 100644 datasets/negatives-ood-v1/corpus/chinese-history-qin-unification.md
 create mode 100644 datasets/negatives-ood-v1/corpus/chinese-history-tang-founding.md
 create mode 100644 datasets/negatives-ood-v1/corpus/chinese-history-wang-anshi-reform.md
 create mode 100644 datasets/negatives-ood-v1/corpus/economics-externalities.md
 create mode 100644 datasets/negatives-ood-v1/corpus/economics-supply-demand.md
 create mode 100644 datasets/negatives-ood-v1/corpus/finance-compound-interest.md
 create mode 100644 datasets/negatives-ood-v1/corpus/finance-inflation-bonds.md
 create mode 100644 datasets/negatives-ood-v1/corpus/geography-monsoon-climate.md
 create mode 100644 datasets/negatives-ood-v1/corpus/geography-river-deltas.md
 create mode 100644 datasets/negatives-ood-v1/corpus/law-contract-offer-acceptance.md
 create mode 100644 datasets/negatives-ood-v1/corpus/law-copyright-fair-use.md
 create mode 100644 datasets/negatives-ood-v1/corpus/literature-dream-red-chamber.md
 create mode 100644 datasets/negatives-ood-v1/corpus/literature-shakespeare-macbeth.md
 create mode 100644 datasets/negatives-ood-v1/corpus/medicine-antibiotic-resistance.md
 create mode 100644 datasets/negatives-ood-v1/corpus/medicine-vaccination-immunity.md
 create mode 100644 datasets/negatives-ood-v1/corpus/science-photosynthesis.md
 create mode 100644 datasets/negatives-ood-v1/corpus/science-plate-tectonics.md
 create mode 100644 datasets/negatives-ood-v1/corpus/technology-database-indexes.md
 create mode 100644 datasets/negatives-ood-v1/corpus/technology-public-key-cryptography.md
 create mode 100644 datasets/negatives-ood-v1/corpus/world-history-cold-war-containment.md
 create mode 100644 datasets/negatives-ood-v1/corpus/world-history-french-revolution.md
 create mode 100644 datasets/negatives-ood-v1/corpus/world-history-industrial-revolution.md
 create mode 100644 datasets/negatives-ood-v1/corpus/world-history-roman-republic.md
 create mode 100644 datasets/negatives-ood-v1/dataset.yaml

diff --git a/datasets/domain-bilingual-v1/corpus/chinese-history-opium-war.md b/datasets/domain-bilingual-v1/corpus/chinese-history-opium-war.md
new file mode 100644
index 0000000..c771471
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/chinese-history-opium-war.md
@@ -0,0 +1,16 @@
+---
+title: 鸦片战争与近代中国开端
+language: zh
+source: local-diverse-synthetic
+---
+
+# 鸦片战争与近代中国开端
+
+## 战争原因
+
+19 世纪上半叶，中英贸易失衡、鸦片走私和清政府禁烟政策共同加剧冲突。林则徐虎门销烟后，英国以保护贸易为由发动战争。
+
+## 历史影响
+
+1842 年《南京条约》签订，清政府割让香港岛、开放通商口岸并支付赔款。鸦片战争打破了传统朝贡和闭关体系，使中国被迫进入不平等条约体系，因此常被视为中国近代史的开端。
+
diff --git a/datasets/domain-bilingual-v1/corpus/chinese-history-qin-unification.md b/datasets/domain-bilingual-v1/corpus/chinese-history-qin-unification.md
new file mode 100644
index 0000000..b17e5ae
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/chinese-history-qin-unification.md
@@ -0,0 +1,16 @@
+---
+title: 秦统一六国与郡县制
+language: zh
+source: local-diverse-synthetic
+---
+
+# 秦统一六国与郡县制
+
+## 统一背景
+
+战国后期，秦国通过商鞅变法积累了较强的军事和行政能力。秦王嬴政先后灭韩、赵、魏、楚、燕、齐，于公元前 221 年完成统一，建立中国历史上第一个中央集权王朝。
+
+## 制度变化
+
+秦朝废除分封制，推行郡县制，由中央任命地方官员管理行政事务。分封制依靠宗族和诸侯维系地方秩序，郡县制则强调官僚任免和中央控制。该制度增强了统一国家的行政整合能力，也加重了基层治理压力。
+
diff --git a/datasets/domain-bilingual-v1/corpus/chinese-history-tang-founding.md b/datasets/domain-bilingual-v1/corpus/chinese-history-tang-founding.md
new file mode 100644
index 0000000..e01023a
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/chinese-history-tang-founding.md
@@ -0,0 +1,16 @@
+---
+title: 唐朝建立与李渊集团
+language: zh
+source: local-diverse-synthetic
+---
+
+# 唐朝建立与李渊集团
+
+## 历史背景
+
+隋末社会动荡、徭役沉重，各地起兵削弱了中央控制。李渊原为太原留守，掌握关陇贵族网络和军事资源。617 年，他从太原起兵，进入关中，拥立代王杨侑为帝，以取得政治合法性。
+
+## 关键事实
+
+618 年，李渊接受禅让，建立唐朝，是为唐高祖。唐初政权并非单靠个人军事冒险，而是依托关陇集团、太原兵力和关中地区的行政资源。随后李世民等人继续平定割据势力，唐朝才逐步完成统一。
+
diff --git a/datasets/domain-bilingual-v1/corpus/chinese-history-wang-anshi-reform.md b/datasets/domain-bilingual-v1/corpus/chinese-history-wang-anshi-reform.md
new file mode 100644
index 0000000..e1e473a
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/chinese-history-wang-anshi-reform.md
@@ -0,0 +1,16 @@
+---
+title: 王安石变法的目标与争议
+language: zh
+source: local-diverse-synthetic
+---
+
+# 王安石变法的目标与争议
+
+## 改革目标
+
+北宋中期面临财政压力、边防支出增加和土地兼并等问题。王安石主持的新法包括青苗法、募役法、市易法和方田均税法，目标是增加国家财政能力，减轻部分农户对高利贷和差役的依赖。
+
+## 争议焦点
+
+反对者认为新法执行过急，地方官为了完成指标可能加重百姓负担。支持者则强调国家需要更主动地调节经济资源。王安石变法的争议不只在政策本身，也在于中央集权、官僚执行和社会承受能力之间的张力。
+
diff --git a/datasets/domain-bilingual-v1/corpus/economics-externalities.md b/datasets/domain-bilingual-v1/corpus/economics-externalities.md
new file mode 100644
index 0000000..391ca66
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/economics-externalities.md
@@ -0,0 +1,16 @@
+---
+title: Externalities and Market Failure
+language: en
+source: local-diverse-synthetic
+---
+
+# Externalities and Market Failure
+
+## Private and Social Costs
+
+An externality occurs when an economic activity affects people who are not part of the transaction. Pollution is a negative externality because a factory may not pay the full social cost of dirty air or water.
+
+## Policy Response
+
+When external costs are ignored, markets can produce too much of the harmful activity. Taxes, regulation, tradable permits, or liability rules can push private decisions closer to social costs. Positive externalities, such as vaccination, may justify subsidies.
+
diff --git a/datasets/domain-bilingual-v1/corpus/economics-supply-demand.md b/datasets/domain-bilingual-v1/corpus/economics-supply-demand.md
new file mode 100644
index 0000000..7c0cec7
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/economics-supply-demand.md
@@ -0,0 +1,16 @@
+---
+title: 供给需求与价格变化
+language: zh
+source: local-diverse-synthetic
+---
+
+# 供给需求与价格变化
+
+## 基本关系
+
+需求表示消费者在不同价格下愿意购买的数量，供给表示生产者愿意出售的数量。当需求增加而供给不变时，价格通常上升；当供给增加而需求不变时，价格通常下降。
+
+## 市场调整
+
+价格变化会影响买卖双方行为。高价格鼓励生产者扩大供给，也可能抑制消费者购买。现实市场还会受到税收、补贴、预期和替代品影响，因此供需模型是分析起点，而不是完整解释。
+
diff --git a/datasets/domain-bilingual-v1/corpus/finance-compound-interest.md b/datasets/domain-bilingual-v1/corpus/finance-compound-interest.md
new file mode 100644
index 0000000..7d655d7
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/finance-compound-interest.md
@@ -0,0 +1,16 @@
+---
+title: 复利与长期投资
+language: zh
+source: local-diverse-synthetic
+---
+
+# 复利与长期投资
+
+## 复利机制
+
+复利指收益继续产生收益。若投资收益被重新投入，本金会随时间扩大，下一期收益也以更大的本金为基础计算。因此，在收益率稳定时，时间越长，复利效应越明显。
+
+## 风险提醒
+
+复利不是保证收益。市场波动、费用、税收和错误的资产配置都会影响最终结果。理解复利的意义在于重视时间、成本控制和风险分散，而不是追求短期高收益。
+
diff --git a/datasets/domain-bilingual-v1/corpus/finance-inflation-bonds.md b/datasets/domain-bilingual-v1/corpus/finance-inflation-bonds.md
new file mode 100644
index 0000000..ace7ebd
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/finance-inflation-bonds.md
@@ -0,0 +1,16 @@
+---
+title: Inflation and Bond Prices
+language: en
+source: local-diverse-synthetic
+---
+
+# Inflation and Bond Prices
+
+## Interest Rate Link
+
+When inflation rises, central banks may raise interest rates. New bonds then offer higher yields, making older bonds with lower coupons less attractive. To compete, the market price of existing bonds usually falls.
+
+## Duration Risk
+
+Long-duration bonds are more sensitive to rate changes because their cash flows arrive farther in the future. Inflation-linked bonds can reduce this risk, but they still have price volatility when real yields change.
+
diff --git a/datasets/domain-bilingual-v1/corpus/geography-monsoon-climate.md b/datasets/domain-bilingual-v1/corpus/geography-monsoon-climate.md
new file mode 100644
index 0000000..6464e8d
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/geography-monsoon-climate.md
@@ -0,0 +1,16 @@
+---
+title: 季风气候与降水分布
+language: zh
+source: local-diverse-synthetic
+---
+
+# 季风气候与降水分布
+
+## 形成原因
+
+季风主要由海陆热力差异和气压带季节移动造成。夏季大陆升温快，形成低压，海洋湿润气流向陆地输送水汽；冬季大陆降温快，风向反转，空气较干燥。
+
+## 降水影响
+
+南亚和东亚许多地区受季风影响明显。夏季风强弱会影响农业灌溉、洪涝风险和水库调度。若季风来得晚或偏弱，可能造成旱情；若水汽过强，则可能引发洪水。
+
diff --git a/datasets/domain-bilingual-v1/corpus/geography-river-deltas.md b/datasets/domain-bilingual-v1/corpus/geography-river-deltas.md
new file mode 100644
index 0000000..8b9e286
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/geography-river-deltas.md
@@ -0,0 +1,16 @@
+---
+title: River Deltas and Sediment
+language: en
+source: local-diverse-synthetic
+---
+
+# River Deltas and Sediment
+
+## Formation
+
+A river delta forms when a river carries sediment to a slower body of water such as a sea or lake. As the current slows, sand, silt, and clay settle out and build new land.
+
+## Human Pressure
+
+Dams can trap sediment upstream, reducing delta growth. At the same time, groundwater pumping and sea-level rise can make deltas sink or flood. Many large deltas are fertile and densely populated, so these changes create major planning challenges.
+
diff --git a/datasets/domain-bilingual-v1/corpus/law-contract-offer-acceptance.md b/datasets/domain-bilingual-v1/corpus/law-contract-offer-acceptance.md
new file mode 100644
index 0000000..11869db
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/law-contract-offer-acceptance.md
@@ -0,0 +1,16 @@
+---
+title: 合同成立中的要约与承诺
+language: zh
+source: local-diverse-synthetic
+---
+
+# 合同成立中的要约与承诺
+
+## 基本概念
+
+在合同法中，要约是希望与他人订立合同的明确意思表示，承诺是受要约人同意要约内容的意思表示。二者一致，通常表明当事人对主要条款形成合意。
+
+## 风险边界
+
+如果表达只是邀请对方报价，通常不构成要约。若承诺改变了价格、数量或履行期限等实质内容，可能被视为新的要约。区分这些概念有助于判断合同是否已经成立。
+
diff --git a/datasets/domain-bilingual-v1/corpus/law-copyright-fair-use.md b/datasets/domain-bilingual-v1/corpus/law-copyright-fair-use.md
new file mode 100644
index 0000000..61336aa
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/law-copyright-fair-use.md
@@ -0,0 +1,16 @@
+---
+title: Copyright and Fair Use
+language: en
+source: local-diverse-synthetic
+---
+
+# Copyright and Fair Use
+
+## Four Factors
+
+Fair use analysis often considers the purpose of use, the nature of the copyrighted work, the amount used, and the effect on the potential market. Transformative educational or critical uses are more likely to favor fair use than purely commercial copying.
+
+## Practical Limit
+
+Fair use is context-specific and not a mechanical checklist. Using a short excerpt for commentary may be allowed, while reproducing the core of a creative work can still be risky even if the excerpt is not very long.
+
diff --git a/datasets/domain-bilingual-v1/corpus/literature-dream-red-chamber.md b/datasets/domain-bilingual-v1/corpus/literature-dream-red-chamber.md
new file mode 100644
index 0000000..678f439
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/literature-dream-red-chamber.md
@@ -0,0 +1,16 @@
+---
+title: 红楼梦中的家族兴衰
+language: zh
+source: local-diverse-synthetic
+---
+
+# 红楼梦中的家族兴衰
+
+## 叙事核心
+
+《红楼梦》以贾府为中心，描写贵族家庭的日常生活、人情关系和制度性衰败。大观园既是青春与才情的空间，也是家族繁华的短暂象征。
+
+## 主题表达
+
+小说通过财务亏空、礼法压力、婚姻安排和人物命运，表现盛极而衰的过程。贾宝玉、林黛玉和薛宝钗的关系不只是爱情叙事，也折射家族利益与个人情感之间的冲突。
+
diff --git a/datasets/domain-bilingual-v1/corpus/literature-shakespeare-macbeth.md b/datasets/domain-bilingual-v1/corpus/literature-shakespeare-macbeth.md
new file mode 100644
index 0000000..ceb5427
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/literature-shakespeare-macbeth.md
@@ -0,0 +1,16 @@
+---
+title: Macbeth and Ambition
+language: en
+source: local-diverse-synthetic
+---
+
+# Macbeth and Ambition
+
+## Tragic Desire
+
+Shakespeare's Macbeth presents ambition as a force that can destroy judgment. Macbeth begins as a respected warrior, but the witches' prophecy and Lady Macbeth's pressure turn possibility into obsession.
+
+## Moral Collapse
+
+After Duncan's murder, Macbeth uses more violence to protect his crown. Each crime makes him less secure, not more powerful. The play links unchecked ambition with fear, isolation, and the breakdown of moral order.
+
diff --git a/datasets/domain-bilingual-v1/corpus/medicine-antibiotic-resistance.md b/datasets/domain-bilingual-v1/corpus/medicine-antibiotic-resistance.md
new file mode 100644
index 0000000..d5685b3
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/medicine-antibiotic-resistance.md
@@ -0,0 +1,16 @@
+---
+title: Antibiotic Resistance
+language: en
+source: local-diverse-synthetic
+---
+
+# Antibiotic Resistance
+
+## Selection Pressure
+
+Antibiotics kill susceptible bacteria, but resistant variants may survive. When antibiotics are used unnecessarily or stopped too early, they create selection pressure that lets resistant bacteria multiply.
+
+## Public Health Risk
+
+Resistance can spread through plasmids, hospitals, farms, and community transmission. The result is that ordinary infections become harder to treat. Stewardship programs reduce misuse by matching drug choice, dose, and duration to evidence.
+
diff --git a/datasets/domain-bilingual-v1/corpus/medicine-vaccination-immunity.md b/datasets/domain-bilingual-v1/corpus/medicine-vaccination-immunity.md
new file mode 100644
index 0000000..132410d
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/medicine-vaccination-immunity.md
@@ -0,0 +1,16 @@
+---
+title: 疫苗与免疫记忆
+language: zh
+source: local-diverse-synthetic
+---
+
+# 疫苗与免疫记忆
+
+## 原理
+
+疫苗通过安全方式向免疫系统展示病原体的抗原，例如灭活病原体、蛋白片段或 mRNA 指令。免疫细胞识别抗原后，会产生抗体反应，并形成记忆 B 细胞和记忆 T 细胞。
+
+## 保护作用
+
+当人体之后遇到真实病原体，免疫记忆能让反应更快、更强，从而降低重症风险。疫苗并不保证所有人完全不感染，但通常能显著降低严重疾病和传播风险。
+
diff --git a/datasets/domain-bilingual-v1/corpus/science-photosynthesis.md b/datasets/domain-bilingual-v1/corpus/science-photosynthesis.md
new file mode 100644
index 0000000..eff2e06
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/science-photosynthesis.md
@@ -0,0 +1,16 @@
+---
+title: Photosynthesis and Energy Conversion
+language: en
+source: local-diverse-synthetic
+---
+
+# Photosynthesis and Energy Conversion
+
+## Light Capture
+
+Photosynthesis converts light energy into chemical energy. Chlorophyll molecules in chloroplasts absorb mainly blue and red light, exciting electrons that move through an electron transport chain.
+
+## Sugar Formation
+
+The light reactions produce ATP and NADPH. The Calvin cycle then uses those molecules to fix carbon dioxide into sugars. Chlorophyll does not directly make sugar; it starts the energy conversion that makes sugar production possible.
+
diff --git a/datasets/domain-bilingual-v1/corpus/science-plate-tectonics.md b/datasets/domain-bilingual-v1/corpus/science-plate-tectonics.md
new file mode 100644
index 0000000..f27544a
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/science-plate-tectonics.md
@@ -0,0 +1,16 @@
+---
+title: 板块构造与地震分布
+language: zh
+source: local-diverse-synthetic
+---
+
+# 板块构造与地震分布
+
+## 基本机制
+
+板块构造理论认为，地球岩石圈由多个板块组成，板块在软流圈上缓慢移动。板块边界分为张裂边界、汇聚边界和转换边界，不同边界对应不同的地质活动。
+
+## 地震与火山
+
+在汇聚边界，海洋板块可能俯冲到大陆板块之下，产生深源地震和火山弧。在转换边界，板块水平错动会积累应力，突然释放时形成地震。因此，环太平洋地区成为地震和火山活动密集区域。
+
diff --git a/datasets/domain-bilingual-v1/corpus/technology-database-indexes.md b/datasets/domain-bilingual-v1/corpus/technology-database-indexes.md
new file mode 100644
index 0000000..a62767e
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/technology-database-indexes.md
@@ -0,0 +1,16 @@
+---
+title: Database Indexes and Query Speed
+language: en
+source: local-diverse-synthetic
+---
+
+# Database Indexes and Query Speed
+
+## Read Performance
+
+A database index stores selected column values in a structure such as a B-tree, allowing the database to find matching rows without scanning the whole table. Indexes are especially useful for filters, joins, and ordered results.
+
+## Write Cost
+
+Every insert, update, or delete must also update affected indexes. Too many indexes increase storage use and write latency. Good schema design chooses indexes that match important query patterns rather than indexing every column.
+
diff --git a/datasets/domain-bilingual-v1/corpus/technology-public-key-cryptography.md b/datasets/domain-bilingual-v1/corpus/technology-public-key-cryptography.md
new file mode 100644
index 0000000..83bc1e5
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/technology-public-key-cryptography.md
@@ -0,0 +1,16 @@
+---
+title: 公钥密码学与数字签名
+language: zh
+source: local-diverse-synthetic
+---
+
+# 公钥密码学与数字签名
+
+## 密钥结构
+
+公钥密码学使用一对密钥：公钥可以公开，私钥必须保密。发送方可以用私钥生成数字签名，接收方用对应公钥验证签名是否匹配。
+
+## 应用价值
+
+数字签名能证明消息确实来自私钥持有者，并检测内容是否被篡改。HTTPS 证书、软件包签名和区块链交易都依赖类似机制。风险在于私钥泄露会破坏身份可信度，因此密钥管理和吊销机制同样重要。
+
diff --git a/datasets/domain-bilingual-v1/corpus/world-history-cold-war-containment.md b/datasets/domain-bilingual-v1/corpus/world-history-cold-war-containment.md
new file mode 100644
index 0000000..ad1d1f8
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/world-history-cold-war-containment.md
@@ -0,0 +1,16 @@
+---
+title: Cold War Containment Policy
+language: en
+source: local-diverse-synthetic
+---
+
+# Cold War Containment Policy
+
+## Strategic Goal
+
+Containment was a United States strategy aimed at limiting the spread of Soviet influence after World War II. Rather than directly overthrowing the Soviet Union, policymakers sought to prevent communist expansion through alliances, economic aid, military deterrence, and regional commitments.
+
+## Examples
+
+The Truman Doctrine, Marshall Plan, NATO, and the Korean War all reflected containment thinking. The policy shaped decades of diplomacy and conflict, but it also led the United States into controversial interventions where local politics were interpreted through a Cold War lens.
+
diff --git a/datasets/domain-bilingual-v1/corpus/world-history-french-revolution.md b/datasets/domain-bilingual-v1/corpus/world-history-french-revolution.md
new file mode 100644
index 0000000..33fd650
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/world-history-french-revolution.md
@@ -0,0 +1,16 @@
+---
+title: French Revolution and Political Legitimacy
+language: en
+source: local-diverse-synthetic
+---
+
+# French Revolution and Political Legitimacy
+
+## Old Order
+
+Before 1789, French monarchy drew legitimacy from dynasty, religion, and inherited privilege. The Estates-General reflected social hierarchy rather than equal citizenship.
+
+## Revolutionary Shift
+
+The French Revolution argued that sovereignty belonged to the nation. The Declaration of the Rights of Man and of the Citizen framed rights as universal rather than granted by a king. This change inspired constitutional politics, mass participation, and later conflicts over terror, war, and republican authority.
+
diff --git a/datasets/domain-bilingual-v1/corpus/world-history-industrial-revolution.md b/datasets/domain-bilingual-v1/corpus/world-history-industrial-revolution.md
new file mode 100644
index 0000000..88ef10e
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/world-history-industrial-revolution.md
@@ -0,0 +1,16 @@
+---
+title: Industrial Revolution and Factory Production
+language: en
+source: local-diverse-synthetic
+---
+
+# Industrial Revolution and Factory Production
+
+## Core Change
+
+The Industrial Revolution shifted production from households and small workshops into factories powered by water, steam, and later electricity. Textile manufacturing was an early example: spinning and weaving machines raised output, concentrated labor, and required new forms of supervision.
+
+## Social Effects
+
+Factory production changed the rhythm of work. Labor became tied to clocks, wages, and urban housing rather than seasonal household production. The change increased output but also created problems such as unsafe working conditions, child labor, and crowded industrial cities.
+
diff --git a/datasets/domain-bilingual-v1/corpus/world-history-roman-republic.md b/datasets/domain-bilingual-v1/corpus/world-history-roman-republic.md
new file mode 100644
index 0000000..509d770
--- /dev/null
+++ b/datasets/domain-bilingual-v1/corpus/world-history-roman-republic.md
@@ -0,0 +1,16 @@
+---
+title: Roman Republic and Checks on Power
+language: en
+source: local-diverse-synthetic
+---
+
+# Roman Republic and Checks on Power
+
+## Shared Authority
+
+The Roman Republic divided authority among consuls, the Senate, assemblies, and magistrates. Two consuls served at the same time and could block each other's actions, reducing the chance that one leader could dominate the state.
+
+## Limits and Tensions
+
+Annual terms, veto powers, and mixed institutions were meant to check ambition. Yet social inequality, military loyalty to generals, and civil wars weakened republican safeguards. The later rise of Julius Caesar showed that institutions alone could fail when political norms collapsed.
+
diff --git a/datasets/domain-bilingual-v1/dataset.yaml b/datasets/domain-bilingual-v1/dataset.yaml
new file mode 100644
index 0000000..5dfb4aa
--- /dev/null
+++ b/datasets/domain-bilingual-v1/dataset.yaml
@@ -0,0 +1,8 @@
+name: domain-bilingual-v1
+description: >
+  In-house bilingual (50/50 zh+en) domain retrieval set over the diverse-v2
+  corpus. The Phase-1 retrieval gate. The engine gates one blended flat-threshold
+  set; the zh/en split is recorded in reports/BASELINES.md (see
+  docs/dikw-eval-plan.md §2.4). Thresholds are set at observed − margin from a
+  real-vector calibration run.
+thresholds: {}
diff --git a/datasets/negatives-ood-v1/corpus/chinese-history-opium-war.md b/datasets/negatives-ood-v1/corpus/chinese-history-opium-war.md
new file mode 100644
index 0000000..c771471
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/chinese-history-opium-war.md
@@ -0,0 +1,16 @@
+---
+title: 鸦片战争与近代中国开端
+language: zh
+source: local-diverse-synthetic
+---
+
+# 鸦片战争与近代中国开端
+
+## 战争原因
+
+19 世纪上半叶，中英贸易失衡、鸦片走私和清政府禁烟政策共同加剧冲突。林则徐虎门销烟后，英国以保护贸易为由发动战争。
+
+## 历史影响
+
+1842 年《南京条约》签订，清政府割让香港岛、开放通商口岸并支付赔款。鸦片战争打破了传统朝贡和闭关体系，使中国被迫进入不平等条约体系，因此常被视为中国近代史的开端。
+
diff --git a/datasets/negatives-ood-v1/corpus/chinese-history-qin-unification.md b/datasets/negatives-ood-v1/corpus/chinese-history-qin-unification.md
new file mode 100644
index 0000000..b17e5ae
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/chinese-history-qin-unification.md
@@ -0,0 +1,16 @@
+---
+title: 秦统一六国与郡县制
+language: zh
+source: local-diverse-synthetic
+---
+
+# 秦统一六国与郡县制
+
+## 统一背景
+
+战国后期，秦国通过商鞅变法积累了较强的军事和行政能力。秦王嬴政先后灭韩、赵、魏、楚、燕、齐，于公元前 221 年完成统一，建立中国历史上第一个中央集权王朝。
+
+## 制度变化
+
+秦朝废除分封制，推行郡县制，由中央任命地方官员管理行政事务。分封制依靠宗族和诸侯维系地方秩序，郡县制则强调官僚任免和中央控制。该制度增强了统一国家的行政整合能力，也加重了基层治理压力。
+
diff --git a/datasets/negatives-ood-v1/corpus/chinese-history-tang-founding.md b/datasets/negatives-ood-v1/corpus/chinese-history-tang-founding.md
new file mode 100644
index 0000000..e01023a
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/chinese-history-tang-founding.md
@@ -0,0 +1,16 @@
+---
+title: 唐朝建立与李渊集团
+language: zh
+source: local-diverse-synthetic
+---
+
+# 唐朝建立与李渊集团
+
+## 历史背景
+
+隋末社会动荡、徭役沉重，各地起兵削弱了中央控制。李渊原为太原留守，掌握关陇贵族网络和军事资源。617 年，他从太原起兵，进入关中，拥立代王杨侑为帝，以取得政治合法性。
+
+## 关键事实
+
+618 年，李渊接受禅让，建立唐朝，是为唐高祖。唐初政权并非单靠个人军事冒险，而是依托关陇集团、太原兵力和关中地区的行政资源。随后李世民等人继续平定割据势力，唐朝才逐步完成统一。
+
diff --git a/datasets/negatives-ood-v1/corpus/chinese-history-wang-anshi-reform.md b/datasets/negatives-ood-v1/corpus/chinese-history-wang-anshi-reform.md
new file mode 100644
index 0000000..e1e473a
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/chinese-history-wang-anshi-reform.md
@@ -0,0 +1,16 @@
+---
+title: 王安石变法的目标与争议
+language: zh
+source: local-diverse-synthetic
+---
+
+# 王安石变法的目标与争议
+
+## 改革目标
+
+北宋中期面临财政压力、边防支出增加和土地兼并等问题。王安石主持的新法包括青苗法、募役法、市易法和方田均税法，目标是增加国家财政能力，减轻部分农户对高利贷和差役的依赖。
+
+## 争议焦点
+
+反对者认为新法执行过急，地方官为了完成指标可能加重百姓负担。支持者则强调国家需要更主动地调节经济资源。王安石变法的争议不只在政策本身，也在于中央集权、官僚执行和社会承受能力之间的张力。
+
diff --git a/datasets/negatives-ood-v1/corpus/economics-externalities.md b/datasets/negatives-ood-v1/corpus/economics-externalities.md
new file mode 100644
index 0000000..391ca66
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/economics-externalities.md
@@ -0,0 +1,16 @@
+---
+title: Externalities and Market Failure
+language: en
+source: local-diverse-synthetic
+---
+
+# Externalities and Market Failure
+
+## Private and Social Costs
+
+An externality occurs when an economic activity affects people who are not part of the transaction. Pollution is a negative externality because a factory may not pay the full social cost of dirty air or water.
+
+## Policy Response
+
+When external costs are ignored, markets can produce too much of the harmful activity. Taxes, regulation, tradable permits, or liability rules can push private decisions closer to social costs. Positive externalities, such as vaccination, may justify subsidies.
+
diff --git a/datasets/negatives-ood-v1/corpus/economics-supply-demand.md b/datasets/negatives-ood-v1/corpus/economics-supply-demand.md
new file mode 100644
index 0000000..7c0cec7
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/economics-supply-demand.md
@@ -0,0 +1,16 @@
+---
+title: 供给需求与价格变化
+language: zh
+source: local-diverse-synthetic
+---
+
+# 供给需求与价格变化
+
+## 基本关系
+
+需求表示消费者在不同价格下愿意购买的数量，供给表示生产者愿意出售的数量。当需求增加而供给不变时，价格通常上升；当供给增加而需求不变时，价格通常下降。
+
+## 市场调整
+
+价格变化会影响买卖双方行为。高价格鼓励生产者扩大供给，也可能抑制消费者购买。现实市场还会受到税收、补贴、预期和替代品影响，因此供需模型是分析起点，而不是完整解释。
+
diff --git a/datasets/negatives-ood-v1/corpus/finance-compound-interest.md b/datasets/negatives-ood-v1/corpus/finance-compound-interest.md
new file mode 100644
index 0000000..7d655d7
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/finance-compound-interest.md
@@ -0,0 +1,16 @@
+---
+title: 复利与长期投资
+language: zh
+source: local-diverse-synthetic
+---
+
+# 复利与长期投资
+
+## 复利机制
+
+复利指收益继续产生收益。若投资收益被重新投入，本金会随时间扩大，下一期收益也以更大的本金为基础计算。因此，在收益率稳定时，时间越长，复利效应越明显。
+
+## 风险提醒
+
+复利不是保证收益。市场波动、费用、税收和错误的资产配置都会影响最终结果。理解复利的意义在于重视时间、成本控制和风险分散，而不是追求短期高收益。
+
diff --git a/datasets/negatives-ood-v1/corpus/finance-inflation-bonds.md b/datasets/negatives-ood-v1/corpus/finance-inflation-bonds.md
new file mode 100644
index 0000000..ace7ebd
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/finance-inflation-bonds.md
@@ -0,0 +1,16 @@
+---
+title: Inflation and Bond Prices
+language: en
+source: local-diverse-synthetic
+---
+
+# Inflation and Bond Prices
+
+## Interest Rate Link
+
+When inflation rises, central banks may raise interest rates. New bonds then offer higher yields, making older bonds with lower coupons less attractive. To compete, the market price of existing bonds usually falls.
+
+## Duration Risk
+
+Long-duration bonds are more sensitive to rate changes because their cash flows arrive farther in the future. Inflation-linked bonds can reduce this risk, but they still have price volatility when real yields change.
+
diff --git a/datasets/negatives-ood-v1/corpus/geography-monsoon-climate.md b/datasets/negatives-ood-v1/corpus/geography-monsoon-climate.md
new file mode 100644
index 0000000..6464e8d
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/geography-monsoon-climate.md
@@ -0,0 +1,16 @@
+---
+title: 季风气候与降水分布
+language: zh
+source: local-diverse-synthetic
+---
+
+# 季风气候与降水分布
+
+## 形成原因
+
+季风主要由海陆热力差异和气压带季节移动造成。夏季大陆升温快，形成低压，海洋湿润气流向陆地输送水汽；冬季大陆降温快，风向反转，空气较干燥。
+
+## 降水影响
+
+南亚和东亚许多地区受季风影响明显。夏季风强弱会影响农业灌溉、洪涝风险和水库调度。若季风来得晚或偏弱，可能造成旱情；若水汽过强，则可能引发洪水。
+
diff --git a/datasets/negatives-ood-v1/corpus/geography-river-deltas.md b/datasets/negatives-ood-v1/corpus/geography-river-deltas.md
new file mode 100644
index 0000000..8b9e286
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/geography-river-deltas.md
@@ -0,0 +1,16 @@
+---
+title: River Deltas and Sediment
+language: en
+source: local-diverse-synthetic
+---
+
+# River Deltas and Sediment
+
+## Formation
+
+A river delta forms when a river carries sediment to a slower body of water such as a sea or lake. As the current slows, sand, silt, and clay settle out and build new land.
+
+## Human Pressure
+
+Dams can trap sediment upstream, reducing delta growth. At the same time, groundwater pumping and sea-level rise can make deltas sink or flood. Many large deltas are fertile and densely populated, so these changes create major planning challenges.
+
diff --git a/datasets/negatives-ood-v1/corpus/law-contract-offer-acceptance.md b/datasets/negatives-ood-v1/corpus/law-contract-offer-acceptance.md
new file mode 100644
index 0000000..11869db
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/law-contract-offer-acceptance.md
@@ -0,0 +1,16 @@
+---
+title: 合同成立中的要约与承诺
+language: zh
+source: local-diverse-synthetic
+---
+
+# 合同成立中的要约与承诺
+
+## 基本概念
+
+在合同法中，要约是希望与他人订立合同的明确意思表示，承诺是受要约人同意要约内容的意思表示。二者一致，通常表明当事人对主要条款形成合意。
+
+## 风险边界
+
+如果表达只是邀请对方报价，通常不构成要约。若承诺改变了价格、数量或履行期限等实质内容，可能被视为新的要约。区分这些概念有助于判断合同是否已经成立。
+
diff --git a/datasets/negatives-ood-v1/corpus/law-copyright-fair-use.md b/datasets/negatives-ood-v1/corpus/law-copyright-fair-use.md
new file mode 100644
index 0000000..61336aa
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/law-copyright-fair-use.md
@@ -0,0 +1,16 @@
+---
+title: Copyright and Fair Use
+language: en
+source: local-diverse-synthetic
+---
+
+# Copyright and Fair Use
+
+## Four Factors
+
+Fair use analysis often considers the purpose of use, the nature of the copyrighted work, the amount used, and the effect on the potential market. Transformative educational or critical uses are more likely to favor fair use than purely commercial copying.
+
+## Practical Limit
+
+Fair use is context-specific and not a mechanical checklist. Using a short excerpt for commentary may be allowed, while reproducing the core of a creative work can still be risky even if the excerpt is not very long.
+
diff --git a/datasets/negatives-ood-v1/corpus/literature-dream-red-chamber.md b/datasets/negatives-ood-v1/corpus/literature-dream-red-chamber.md
new file mode 100644
index 0000000..678f439
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/literature-dream-red-chamber.md
@@ -0,0 +1,16 @@
+---
+title: 红楼梦中的家族兴衰
+language: zh
+source: local-diverse-synthetic
+---
+
+# 红楼梦中的家族兴衰
+
+## 叙事核心
+
+《红楼梦》以贾府为中心，描写贵族家庭的日常生活、人情关系和制度性衰败。大观园既是青春与才情的空间，也是家族繁华的短暂象征。
+
+## 主题表达
+
+小说通过财务亏空、礼法压力、婚姻安排和人物命运，表现盛极而衰的过程。贾宝玉、林黛玉和薛宝钗的关系不只是爱情叙事，也折射家族利益与个人情感之间的冲突。
+
diff --git a/datasets/negatives-ood-v1/corpus/literature-shakespeare-macbeth.md b/datasets/negatives-ood-v1/corpus/literature-shakespeare-macbeth.md
new file mode 100644
index 0000000..ceb5427
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/literature-shakespeare-macbeth.md
@@ -0,0 +1,16 @@
+---
+title: Macbeth and Ambition
+language: en
+source: local-diverse-synthetic
+---
+
+# Macbeth and Ambition
+
+## Tragic Desire
+
+Shakespeare's Macbeth presents ambition as a force that can destroy judgment. Macbeth begins as a respected warrior, but the witches' prophecy and Lady Macbeth's pressure turn possibility into obsession.
+
+## Moral Collapse
+
+After Duncan's murder, Macbeth uses more violence to protect his crown. Each crime makes him less secure, not more powerful. The play links unchecked ambition with fear, isolation, and the breakdown of moral order.
+
diff --git a/datasets/negatives-ood-v1/corpus/medicine-antibiotic-resistance.md b/datasets/negatives-ood-v1/corpus/medicine-antibiotic-resistance.md
new file mode 100644
index 0000000..d5685b3
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/medicine-antibiotic-resistance.md
@@ -0,0 +1,16 @@
+---
+title: Antibiotic Resistance
+language: en
+source: local-diverse-synthetic
+---
+
+# Antibiotic Resistance
+
+## Selection Pressure
+
+Antibiotics kill susceptible bacteria, but resistant variants may survive. When antibiotics are used unnecessarily or stopped too early, they create selection pressure that lets resistant bacteria multiply.
+
+## Public Health Risk
+
+Resistance can spread through plasmids, hospitals, farms, and community transmission. The result is that ordinary infections become harder to treat. Stewardship programs reduce misuse by matching drug choice, dose, and duration to evidence.
+
diff --git a/datasets/negatives-ood-v1/corpus/medicine-vaccination-immunity.md b/datasets/negatives-ood-v1/corpus/medicine-vaccination-immunity.md
new file mode 100644
index 0000000..132410d
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/medicine-vaccination-immunity.md
@@ -0,0 +1,16 @@
+---
+title: 疫苗与免疫记忆
+language: zh
+source: local-diverse-synthetic
+---
+
+# 疫苗与免疫记忆
+
+## 原理
+
+疫苗通过安全方式向免疫系统展示病原体的抗原，例如灭活病原体、蛋白片段或 mRNA 指令。免疫细胞识别抗原后，会产生抗体反应，并形成记忆 B 细胞和记忆 T 细胞。
+
+## 保护作用
+
+当人体之后遇到真实病原体，免疫记忆能让反应更快、更强，从而降低重症风险。疫苗并不保证所有人完全不感染，但通常能显著降低严重疾病和传播风险。
+
diff --git a/datasets/negatives-ood-v1/corpus/science-photosynthesis.md b/datasets/negatives-ood-v1/corpus/science-photosynthesis.md
new file mode 100644
index 0000000..eff2e06
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/science-photosynthesis.md
@@ -0,0 +1,16 @@
+---
+title: Photosynthesis and Energy Conversion
+language: en
+source: local-diverse-synthetic
+---
+
+# Photosynthesis and Energy Conversion
+
+## Light Capture
+
+Photosynthesis converts light energy into chemical energy. Chlorophyll molecules in chloroplasts absorb mainly blue and red light, exciting electrons that move through an electron transport chain.
+
+## Sugar Formation
+
+The light reactions produce ATP and NADPH. The Calvin cycle then uses those molecules to fix carbon dioxide into sugars. Chlorophyll does not directly make sugar; it starts the energy conversion that makes sugar production possible.
+
diff --git a/datasets/negatives-ood-v1/corpus/science-plate-tectonics.md b/datasets/negatives-ood-v1/corpus/science-plate-tectonics.md
new file mode 100644
index 0000000..f27544a
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/science-plate-tectonics.md
@@ -0,0 +1,16 @@
+---
+title: 板块构造与地震分布
+language: zh
+source: local-diverse-synthetic
+---
+
+# 板块构造与地震分布
+
+## 基本机制
+
+板块构造理论认为，地球岩石圈由多个板块组成，板块在软流圈上缓慢移动。板块边界分为张裂边界、汇聚边界和转换边界，不同边界对应不同的地质活动。
+
+## 地震与火山
+
+在汇聚边界，海洋板块可能俯冲到大陆板块之下，产生深源地震和火山弧。在转换边界，板块水平错动会积累应力，突然释放时形成地震。因此，环太平洋地区成为地震和火山活动密集区域。
+
diff --git a/datasets/negatives-ood-v1/corpus/technology-database-indexes.md b/datasets/negatives-ood-v1/corpus/technology-database-indexes.md
new file mode 100644
index 0000000..a62767e
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/technology-database-indexes.md
@@ -0,0 +1,16 @@
+---
+title: Database Indexes and Query Speed
+language: en
+source: local-diverse-synthetic
+---
+
+# Database Indexes and Query Speed
+
+## Read Performance
+
+A database index stores selected column values in a structure such as a B-tree, allowing the database to find matching rows without scanning the whole table. Indexes are especially useful for filters, joins, and ordered results.
+
+## Write Cost
+
+Every insert, update, or delete must also update affected indexes. Too many indexes increase storage use and write latency. Good schema design chooses indexes that match important query patterns rather than indexing every column.
+
diff --git a/datasets/negatives-ood-v1/corpus/technology-public-key-cryptography.md b/datasets/negatives-ood-v1/corpus/technology-public-key-cryptography.md
new file mode 100644
index 0000000..83bc1e5
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/technology-public-key-cryptography.md
@@ -0,0 +1,16 @@
+---
+title: 公钥密码学与数字签名
+language: zh
+source: local-diverse-synthetic
+---
+
+# 公钥密码学与数字签名
+
+## 密钥结构
+
+公钥密码学使用一对密钥：公钥可以公开，私钥必须保密。发送方可以用私钥生成数字签名，接收方用对应公钥验证签名是否匹配。
+
+## 应用价值
+
+数字签名能证明消息确实来自私钥持有者，并检测内容是否被篡改。HTTPS 证书、软件包签名和区块链交易都依赖类似机制。风险在于私钥泄露会破坏身份可信度，因此密钥管理和吊销机制同样重要。
+
diff --git a/datasets/negatives-ood-v1/corpus/world-history-cold-war-containment.md b/datasets/negatives-ood-v1/corpus/world-history-cold-war-containment.md
new file mode 100644
index 0000000..ad1d1f8
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/world-history-cold-war-containment.md
@@ -0,0 +1,16 @@
+---
+title: Cold War Containment Policy
+language: en
+source: local-diverse-synthetic
+---
+
+# Cold War Containment Policy
+
+## Strategic Goal
+
+Containment was a United States strategy aimed at limiting the spread of Soviet influence after World War II. Rather than directly overthrowing the Soviet Union, policymakers sought to prevent communist expansion through alliances, economic aid, military deterrence, and regional commitments.
+
+## Examples
+
+The Truman Doctrine, Marshall Plan, NATO, and the Korean War all reflected containment thinking. The policy shaped decades of diplomacy and conflict, but it also led the United States into controversial interventions where local politics were interpreted through a Cold War lens.
+
diff --git a/datasets/negatives-ood-v1/corpus/world-history-french-revolution.md b/datasets/negatives-ood-v1/corpus/world-history-french-revolution.md
new file mode 100644
index 0000000..33fd650
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/world-history-french-revolution.md
@@ -0,0 +1,16 @@
+---
+title: French Revolution and Political Legitimacy
+language: en
+source: local-diverse-synthetic
+---
+
+# French Revolution and Political Legitimacy
+
+## Old Order
+
+Before 1789, French monarchy drew legitimacy from dynasty, religion, and inherited privilege. The Estates-General reflected social hierarchy rather than equal citizenship.
+
+## Revolutionary Shift
+
+The French Revolution argued that sovereignty belonged to the nation. The Declaration of the Rights of Man and of the Citizen framed rights as universal rather than granted by a king. This change inspired constitutional politics, mass participation, and later conflicts over terror, war, and republican authority.
+
diff --git a/datasets/negatives-ood-v1/corpus/world-history-industrial-revolution.md b/datasets/negatives-ood-v1/corpus/world-history-industrial-revolution.md
new file mode 100644
index 0000000..88ef10e
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/world-history-industrial-revolution.md
@@ -0,0 +1,16 @@
+---
+title: Industrial Revolution and Factory Production
+language: en
+source: local-diverse-synthetic
+---
+
+# Industrial Revolution and Factory Production
+
+## Core Change
+
+The Industrial Revolution shifted production from households and small workshops into factories powered by water, steam, and later electricity. Textile manufacturing was an early example: spinning and weaving machines raised output, concentrated labor, and required new forms of supervision.
+
+## Social Effects
+
+Factory production changed the rhythm of work. Labor became tied to clocks, wages, and urban housing rather than seasonal household production. The change increased output but also created problems such as unsafe working conditions, child labor, and crowded industrial cities.
+
diff --git a/datasets/negatives-ood-v1/corpus/world-history-roman-republic.md b/datasets/negatives-ood-v1/corpus/world-history-roman-republic.md
new file mode 100644
index 0000000..509d770
--- /dev/null
+++ b/datasets/negatives-ood-v1/corpus/world-history-roman-republic.md
@@ -0,0 +1,16 @@
+---
+title: Roman Republic and Checks on Power
+language: en
+source: local-diverse-synthetic
+---
+
+# Roman Republic and Checks on Power
+
+## Shared Authority
+
+The Roman Republic divided authority among consuls, the Senate, assemblies, and magistrates. Two consuls served at the same time and could block each other's actions, reducing the chance that one leader could dominate the state.
+
+## Limits and Tensions
+
+Annual terms, veto powers, and mixed institutions were meant to check ambition. Yet social inequality, military loyalty to generals, and civil wars weakened republican safeguards. The later rise of Julius Caesar showed that institutions alone could fail when political norms collapsed.
+
diff --git a/datasets/negatives-ood-v1/dataset.yaml b/datasets/negatives-ood-v1/dataset.yaml
new file mode 100644
index 0000000..a070428
--- /dev/null
+++ b/datasets/negatives-ood-v1/dataset.yaml
@@ -0,0 +1,7 @@
+name: negatives-ood-v1
+description: >
+  Off-corpus negatives (expect_none) riding the diverse-v2 corpus: plausible but
+  unanswerable zh+en queries that a healthy engine should return nothing relevant
+  for. expect_none is diagnostic-only in dikw-core (not a threshold key), so this
+  set is observe-only — see the negative diagnostics in reports/BASELINES.md.
+thresholds: {}

From 4104d728959cb2dc3d3dbd342a60ac58c125bbfa Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 22:04:52 +0800
Subject: [PATCH 5/7] feat(datasets): curated bilingual positives + OOD
 negatives; LLM-gen via factory
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

domain-bilingual-v1: 34 verified positives (18 zh + 16 en) covering all 24 docs,
with deliberate intra-cluster-confusable queries in the history clusters for
ranking signal. negatives-ood-v1: 23 verified expect_none queries (11 zh + 12 en),
plausible-but-uncovered.

Generated through the MiniMax factory (scripts/generate_candidates.py --instruction
steering) then human-verified gold. Fixes:
- generate_candidates.py: add --instruction passthrough for targeted generation.
- llm_client.py: raise output budget to 16000 tokens — MiniMax-M2.7 reasoning was
  exhausting the 4096 cap and truncating JSON mid-array (stop_reason max_tokens).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 datasets/domain-bilingual-v1/queries.yaml | 111 ++++++++++++++++++++++
 datasets/negatives-ood-v1/queries.yaml    |  78 +++++++++++++++
 scripts/generate_candidates.py            |   8 ++
 src/dikw_data/llm_client.py               |  10 +-
 4 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 datasets/domain-bilingual-v1/queries.yaml
 create mode 100644 datasets/negatives-ood-v1/queries.yaml

diff --git a/datasets/domain-bilingual-v1/queries.yaml b/datasets/domain-bilingual-v1/queries.yaml
new file mode 100644
index 0000000..ceb5b7b
--- /dev/null
+++ b/datasets/domain-bilingual-v1/queries.yaml
@@ -0,0 +1,111 @@
+# Bilingual domain positives over the diverse-v2 corpus. ids are language-prefixed
+# (zh-/en-) so reports can split per language; a query's language matches its gold
+# doc's frontmatter `language:`. Queries are LLM-generated (MiniMax) then verified:
+# every expect_any stem was checked to be the single correct, answerable doc.
+# Some chinese-history / world-history queries are deliberately under-specified to
+# create intra-cluster ranking difficulty (the gold stays unique).
+queries:
+  # --- zh (18) ---
+  - id: zh-opium-war-military-conflict
+    q: "哪些因素导致了19世纪中期中国与外国之间的军事冲突？"
+    expect_any: [chinese-history-opium-war]
+  - id: zh-opium-war-hongkong-cession
+    q: "香港岛的割让是哪场战争后的条约规定的？"
+    expect_any: [chinese-history-opium-war]
+  - id: zh-qin-territorial-administration
+    q: "秦朝如何在政治上实现对广阔领土的有效管辖？"
+    expect_any: [chinese-history-qin-unification]
+  - id: zh-qin-fengjian-vs-junxian
+    q: "分封制与郡县制在地方治理方式上有何根本区别？"
+    expect_any: [chinese-history-qin-unification]
+  - id: zh-tang-after-sui-collapse
+    q: "隋朝灭亡后，谁在关中地区建立了新的统一王朝？"
+    expect_any: [chinese-history-tang-founding]
+  - id: zh-tang-gaozu-accession
+    q: "唐高祖是通过什么方式获得皇位并建立新政权的？"
+    expect_any: [chinese-history-tang-founding]
+  - id: zh-wanganshi-fiscal-pressure
+    q: "北宋时期的财政困难如何推动了官方对经济制度的调整？"
+    expect_any: [chinese-history-wang-anshi-reform]
+  - id: zh-wanganshi-new-laws-goals
+    q: "青苗法和方田均税法等新法的核心目标是什么？"
+    expect_any: [chinese-history-wang-anshi-reform]
+  - id: zh-supply-demand-price-change
+    q: "需求增加而供给保持不变时，市场价格会怎样变化？"
+    expect_any: [economics-supply-demand]
+  - id: zh-compound-interest-mechanism
+    q: "长期投资中收益继续产生收益的机制是如何运作的？"
+    expect_any: [finance-compound-interest]
+  - id: zh-monsoon-seasonal-shift
+    q: "季风系统是如何在夏季和冬季之间转变的？"
+    expect_any: [geography-monsoon-climate]
+  - id: zh-contract-valid-offer
+    q: "合同订立过程中，什么样的意思表示才能构成有效的要约？"
+    expect_any: [law-contract-offer-acceptance]
+  - id: zh-dream-rise-and-decline
+    q: "贾府的繁华与衰败如何在小说的叙事结构中呈现？"
+    expect_any: [literature-dream-red-chamber]
+  - id: zh-dream-grand-view-garden
+    q: "《红楼梦》中的大观园象征着什么？它与家族命运有何关联？"
+    expect_any: [literature-dream-red-chamber]
+  - id: zh-vaccination-rapid-response
+    q: "人体接种疫苗后如何获得对病原体的快速反应能力？"
+    expect_any: [medicine-vaccination-immunity]
+  - id: zh-plate-tectonics-pacific-ring
+    q: "为什么环太平洋地区地震和火山活动特别频繁？"
+    expect_any: [science-plate-tectonics]
+  - id: zh-public-key-digital-signature
+    q: "数字签名技术如何确保消息的真实性和完整性？"
+    expect_any: [technology-public-key-cryptography]
+  - id: zh-public-key-private-key-leak
+    q: "公钥密码系统中的私钥泄露会产生什么后果？"
+    expect_any: [technology-public-key-cryptography]
+  # --- en (16) ---
+  - id: en-externalities-market-failure
+    q: "How do external costs like pollution lead to market failure, and what policy tools can correct it?"
+    expect_any: [economics-externalities]
+  - id: en-externalities-positive-subsidy
+    q: "Why might subsidies be justified for activities that generate positive externalities?"
+    expect_any: [economics-externalities]
+  - id: en-inflation-bond-prices
+    q: "What happens to the market price of existing bonds when inflation rises?"
+    expect_any: [finance-inflation-bonds]
+  - id: en-river-delta-formation
+    q: "What natural processes create a river delta at the mouth of a river?"
+    expect_any: [geography-river-deltas]
+  - id: en-fair-use-four-factors
+    q: "Which four factors are typically considered in a fair use analysis of copyrighted material?"
+    expect_any: [law-copyright-fair-use]
+  - id: en-macbeth-ambition
+    q: "In Macbeth, how does unchecked ambition drive the protagonist's moral collapse?"
+    expect_any: [literature-shakespeare-macbeth]
+  - id: en-antibiotic-stopping-early
+    q: "Why does stopping antibiotic treatment early increase the risk of antibiotic resistance?"
+    expect_any: [medicine-antibiotic-resistance]
+  - id: en-photosynthesis-chlorophyll
+    q: "How does chlorophyll enable plants to convert light energy into chemical energy?"
+    expect_any: [science-photosynthesis]
+  - id: en-db-index-read-benefit
+    q: "What performance benefit do database indexes provide for read queries?"
+    expect_any: [technology-database-indexes]
+  - id: en-db-index-write-tradeoff
+    q: "What are the trade-offs of using many indexes on a database, particularly regarding write performance?"
+    expect_any: [technology-database-indexes]
+  - id: en-coldwar-containment-strategy
+    q: "What was the US containment strategy during the Cold War and which major policies exemplified it?"
+    expect_any: [world-history-cold-war-containment]
+  - id: en-coldwar-ideological-competition
+    q: "In what ways did ideological competition shape state actions during periods of global conflict?"
+    expect_any: [world-history-cold-war-containment]
+  - id: en-french-rev-legitimacy
+    q: "How did the French Revolution shift the source of political legitimacy from monarchy to the nation?"
+    expect_any: [world-history-french-revolution]
+  - id: en-industrial-household-to-factory
+    q: "What were the social consequences of moving production from households to factories during the Industrial Revolution?"
+    expect_any: [world-history-industrial-revolution]
+  - id: en-industrial-labor-urban
+    q: "How did the shift to factory production alter labor relations and urban living conditions?"
+    expect_any: [world-history-industrial-revolution]
+  - id: en-roman-power-checks
+    q: "How did the Roman Republic's institutional design prevent the concentration of power in a single leader?"
+    expect_any: [world-history-roman-republic]
diff --git a/datasets/negatives-ood-v1/queries.yaml b/datasets/negatives-ood-v1/queries.yaml
new file mode 100644
index 0000000..3e4b936
--- /dev/null
+++ b/datasets/negatives-ood-v1/queries.yaml
@@ -0,0 +1,78 @@
+# Off-corpus negatives (expect_none) riding the diverse-v2 corpus. Plausible,
+# well-formed zh/en questions on topics ADJACENT to the corpus domains but not
+# answerable by any document here — a healthy engine returns nothing relevant.
+# LLM-generated (MiniMax) then verified to be genuinely uncovered. ids are
+# language-prefixed for the per-language record. expect_none is diagnostic-only in
+# dikw-core (no threshold key), so this set is observe-only.
+queries:
+  # --- zh (11) ---
+  - id: zh-cooking-mapo-tofu
+    q: "如何在家制作正宗的四川麻婆豆腐？"
+    expect_none: true
+  - id: zh-sports-baseball-pitcher
+    q: "棒球比赛中，投手可以在哪些情况下被迫退场？"
+    expect_none: true
+  - id: zh-devops-docker-env-vars
+    q: "在Docker容器中，如何设置环境变量并确保在重启后仍然有效？"
+    expect_none: true
+  - id: zh-weather-beijing-smog
+    q: "今天北京的天气情况如何，是否有雾霾？"
+    expect_none: true
+  - id: zh-travel-tokyo-itinerary
+    q: "计划去日本东京旅行，推荐的七天行程和必备物品是什么？"
+    expect_none: true
+  - id: zh-app-wechat-backup
+    q: "如何在微信中设置聊天记录自动备份到云端？"
+    expect_none: true
+  - id: zh-health-headache-sorethroat
+    q: "如果出现轻度头痛和喉咙痛，应该如何自我护理？"
+    expect_none: true
+  - id: zh-music-guitar-chords
+    q: "学习吉他时，如何正确练习和弦转换？"
+    expect_none: true
+  - id: zh-film-parasite-director
+    q: "电影《寄生虫》的导演是谁？"
+    expect_none: true
+  - id: zh-car-engine-overheat
+    q: "汽车发动机出现高温警报时，应该立即采取哪些措施？"
+    expect_none: true
+  - id: zh-gardening-tomato-pests
+    q: "在春季种植番茄时，如何防治常见病虫害？"
+    expect_none: true
+  # --- en (12) ---
+  - id: en-cooking-margherita-pizza
+    q: "How do you make a classic Margherita pizza from scratch?"
+    expect_none: true
+  - id: en-sports-soccer-offside
+    q: "What are the official rules for an offside call in soccer?"
+    expect_none: true
+  - id: en-devops-jenkins-cicd
+    q: "How can I set up a CI/CD pipeline using Jenkins and Docker?"
+    expect_none: true
+  - id: en-weather-nyc-forecast
+    q: "What is the weather forecast for New York City today?"
+    expect_none: true
+  - id: en-travel-italy-itinerary
+    q: "What are the best travel itineraries for a 10-day trip to Italy?"
+    expect_none: true
+  - id: en-app-instagram-2fa
+    q: "How do I enable two-factor authentication on my Instagram account?"
+    expect_none: true
+  - id: en-health-sorethroat-remedies
+    q: "What are effective home remedies for a sore throat?"
+    expect_none: true
+  - id: en-music-notation-basics
+    q: "Can you explain the basics of music notation for beginners?"
+    expect_none: true
+  - id: en-film-inception-themes
+    q: "What are the main themes in the film Inception directed by Christopher Nolan?"
+    expect_none: true
+  - id: en-car-brake-spongy
+    q: "What steps should I take if my car's brake pedal feels spongy?"
+    expect_none: true
+  - id: en-gardening-small-backyard
+    q: "How do I start a vegetable garden in a small backyard?"
+    expect_none: true
+  - id: en-space-jwst-status
+    q: "What is the current status of the James Webb Space Telescope mission?"
+    expect_none: true
diff --git a/scripts/generate_candidates.py b/scripts/generate_candidates.py
index 30aa862..1512d43 100644
--- a/scripts/generate_candidates.py
+++ b/scripts/generate_candidates.py
@@ -11,16 +11,24 @@ def main() -> int:
     parser = argparse.ArgumentParser(description="Generate retrieval query candidates.")
     add_common_args(parser)
     parser.add_argument("--queries", type=int, default=30)
+    parser.add_argument(
+        "--instruction",
+        default="",
+        help="Extra steering appended to the prompt (e.g. positives-only, "
+        "coverage/balance, or expect_none-only constraints).",
+    )
     args = parser.parse_args()
 
     corpus_dir = dataset_dir(args.dataset) / "corpus"
     corpus = _read_corpus(corpus_dir)
+    extra = f"{args.instruction.strip()}\n\n" if args.instruction.strip() else ""
     system = "You generate retrieval-evaluation query candidates with exact document stems."
     user = (
         f"Generate {args.queries} retrieval query candidates from this corpus. "
         "Return a JSON array. Each object must include q, type, expect_any, evidence, "
         "confidence, and rationale. Use expect_none=true for out-of-domain negatives. "
         "expect_any values must be file stems from the corpus.\n\n"
+        f"{extra}"
         f"{corpus}"
     )
     task = prompt_task(
diff --git a/src/dikw_data/llm_client.py b/src/dikw_data/llm_client.py
index e7e5262..0a92df3 100644
--- a/src/dikw_data/llm_client.py
+++ b/src/dikw_data/llm_client.py
@@ -16,6 +16,14 @@
 RETRYABLE_STATUSES = {408, 409, 429, 529}
 NON_RETRYABLE_STATUSES = {400, 401, 403}
 
+# Output-token ceiling per completion. MiniMax-M2.7 is a reasoning model: its
+# internal reasoning counts against ``max_tokens``, so a tight budget truncates
+# the actual answer mid-token (observed: a JSON array cut off after one element
+# with ``stop_reason: max_tokens``). Give generation enough room for reasoning
+# plus a multi-item JSON payload. Mirrors the eval plan's note that reasoning
+# models need a larger token budget.
+MAX_OUTPUT_TOKENS = 16000
+
 
 class MiniMaxCallError(RuntimeError):
     def __init__(self, message: str, *, status_code: int | None = None) -> None:
@@ -77,7 +85,7 @@ async def complete(self, *, system: str, user: str, model: str) -> str:
                 model=model,
                 system=system,
                 messages=[{"role": "user", "content": user}],
-                max_tokens=4096,
+                max_tokens=MAX_OUTPUT_TOKENS,
                 temperature=0.2,
             )
         except Exception as e:  # SDK exception types differ across versions.

From e2f5a0cea5e9ffde9fec228872a22a25fe261e73 Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 22:10:18 +0800
Subject: [PATCH 6/7] feat(datasets): calibrate domain-bilingual-v1 thresholds
 at observed-margin
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Canonical doc/hybrid observed 1.0 (the 24-distinct-topic corpus saturates the
vector/hybrid views). Gate at observed-margin: hit@k/mrr 0.95, ndcg/recall 0.97 —
a regression-detector floor, not a discriminative benchmark. Denser confusable
domain-bilingual-v2 noted as the discriminative follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 datasets/domain-bilingual-v1/dataset.yaml | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/datasets/domain-bilingual-v1/dataset.yaml b/datasets/domain-bilingual-v1/dataset.yaml
index 5dfb4aa..ba653b6 100644
--- a/datasets/domain-bilingual-v1/dataset.yaml
+++ b/datasets/domain-bilingual-v1/dataset.yaml
@@ -5,4 +5,15 @@ description: >
   set; the zh/en split is recorded in reports/BASELINES.md (see
   docs/dikw-eval-plan.md §2.4). Thresholds are set at observed − margin from a
   real-vector calibration run.
-thresholds: {}
+  NOTE: this 24-distinct-topic corpus saturates the vector/hybrid views at 1.0, so
+  these gates are a high floor (regression detector), not a discriminative
+  benchmark — a denser, deliberately-confusable domain-bilingual-v2 is the
+  discriminative follow-up. See reports/BASELINES.md (2026-06-26).
+# Canonical doc/hybrid observed = 1.0 across the board (2026-06-26 calibration).
+# Gate at observed − margin: −0.05 for hit@k / mrr, −0.03 for ndcg / recall.
+thresholds:
+  hit_at_3: 0.95
+  hit_at_10: 0.95
+  mrr: 0.95
+  ndcg_at_10: 0.97
+  recall_at_100: 0.97

From 3a79e58d125703d788b9119fc543779a96dabe1e Mon Sep 17 00:00:00 2001
From: Le He <hele@Les-MacBook-Pro.local>
Date: Fri, 26 Jun 2026 22:11:56 +0800
Subject: [PATCH 7/7] eval: record Phase 1 in-house calibration
 (domain-bilingual-v1 + negatives-ood-v1)

Blended per-mode table + zh/en split (saturated 1.0; bm25 mrr 0.985/ndcg 0.989 the
only signal), domain-bilingual-v1 floor at observed-margin, negatives-ood-v1
observe-only diagnostics. Splitter reconciles with the engine within 1e-9.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 reports/BASELINES.md | 65 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/reports/BASELINES.md b/reports/BASELINES.md
index 25ed4ef..130860c 100644
--- a/reports/BASELINES.md
+++ b/reports/BASELINES.md
@@ -77,3 +77,68 @@ by absolute path from `dikw-core/evals/datasets/` (out-of-tree). Canonical view
 thresholds get set at `observed − margin` once the in-house sets
 (`domain-bilingual-v1`, `negatives-ood-v1`) exist — see `docs/dikw-eval-plan.md`
 §2.3 and §3.
+
+## 2026-06-26 — domain-bilingual-v1 + negatives-ood-v1 (Phase 1 in-house sets)
+
+Run from `dikw-data` against a read-only `dikw-core` v0.6.2 (editable, `[cjk]`).
+Provider: **MiniMax-M2.7** (LLM — used only for query *generation*, not retrieval)
++ **Gitee Qwen3-Embedding-0.6B@1024** (embeddings) + sqlite. `--retrieval all`,
+`--eval retrieval`, `--cache read_write`, `serve-and-run`, 1 run each. Both reuse
+the `synthetic-diverse-v2` 24-doc corpus (12 zh / 12 en). Queries are
+LLM-generated through the MiniMax factory (`scripts/generate_candidates.py`) then
+human-verified gold. Canonical view is `doc/hybrid`. See design:
+`docs/phase1-inhouse-datasets-design.md`.
+
+**domain-bilingual-v1** (34 positives: 18 zh + 16 en, every doc covered) —
+`passed: True`, exit 0.
+
+| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 |
+|---|---|---|---|---|---|
+| bm25 | 1.000 | 1.000 | 0.985 | 0.989 | 1.000 |
+| vector | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+| **hybrid** | **1.000** | **1.000** | **1.000** | **1.000** | **1.000** |
+
+Per-language split (canonical `doc/hybrid`, via `tools/split_metrics_by_lang.py`):
+
+| lang | n | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 |
+|---|---|---|---|---|---|---|
+| all | 34 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+| zh | 18 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+| en | 16 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
+
+- **Saturates at 1.0** on vector/hybrid — 24 distinct-topic docs are trivially
+  separable by the embedder. Only **bm25** carries signal (`mrr 0.985`,
+  `ndcg_at_10 0.989`), from the deliberately intra-cluster-confusable history
+  queries. The splitter's `all` block reconciles with the engine's blended doc
+  metrics to within 1e-9 (the tool is validated against the engine's own formulas).
+- **Gate set at `observed − margin`** (first committed in-house floor):
+  `hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.95 / ndcg_at_10 0.97 / recall_at_100 0.97`
+  (−0.05 hit@k/mrr, −0.03 ndcg/recall). This is a **regression-detector floor, not
+  a discriminative benchmark** — a denser, deliberately-confusable
+  `domain-bilingual-v2` is the discriminative follow-up.
+
+**negatives-ood-v1** (23 `expect_none`: 11 zh + 12 en, riding the same corpus) —
+`passed: True`, exit 0.
+
+- `thresholds: {}`, `metrics: {}` — **observe-only**. `expect_none` is *diagnostic
+  only* in dikw-core (`runner.py:244`: no threshold key, no exit-1), and doc-level
+  retrieval cannot abstain (it always returns a ranked list), so there is no
+  "satisfaction" metric to gate at this layer.
+- Diagnostic: for every off-topic query the top-ranked doc is an unrelated corpus
+  doc (e.g. `麻婆豆腐` → `science-plate-tectonics`) — no spurious strong match into a
+  same-topic domain doc. Score-based pos-vs-neg separation needs the served
+  `retrieve` contract (scored cutoff) and is out of scope for the doc-level eval.
+
+**Cross-cutting**
+
+- Read-only held: both datasets live under `dikw-data/datasets/` (corpus copied
+  from `synthetic-diverse-v2`). `git -C ../dikw-core status` stays clean.
+- Factory fix: MiniMax-M2.7 reasoning was exhausting the 4096-token output cap and
+  truncating candidate JSON mid-array (`stop_reason: max_tokens`); raised the
+  generation output budget to 16000 (`src/dikw_data/llm_client.py`).
+
+**Gates.** First committed in-house floor: `domain-bilingual-v1` (saturated →
+regression-detector). `negatives-ood-v1` observe-only. Recalibrate / promote once a
+discriminative `domain-bilingual-v2` exists (corpus > ~50 docs, deliberately
+confusable) — see `docs/dikw-eval-plan.md` §2.3/§3 and
+`docs/phase1-inhouse-datasets-design.md`.