feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1 by helebest · Pull Request #7 · OpenDIKW/dikw-data

helebest · 2026-06-26T14:13:20Z

Implements eval-plan datasets ii (domain-bilingual-v1) and iii
(negatives-ood-v1) — the first in-house Phase-1 retrieval sets. Design:
docs/phase1-inhouse-datasets-design.md; plan: docs/superpowers/plans/.

What's here

domain-bilingual-v1 — 34 verified positives (18 zh + 16 en) over the reused
synthetic-diverse-v2 24-doc corpus; every doc covered; intra-cluster-confusable
history queries for ranking signal. Single dataset, blended engine gate; zh/en
split recorded in reports/BASELINES.md (§2.4) via tools/split_metrics_by_lang.py.
negatives-ood-v1 — 23 verified expect_none queries (11 zh + 12 en),
plausible-but-uncovered; observe-only (expect_none is diagnostic-only in dikw-core).
tools/split_metrics_by_lang.py (+ tests) — per-language metric split from an
eval NDJSON; formulas mirror dikw-core's, reconciles with the engine within 1e-9.
Factory: --instruction steering for generate_candidates.py; raised MiniMax
output budget to 16000 (reasoning was truncating candidate JSON at 4096).

Calibration (real-vector, dikw-core v0.6.2, MiniMax+Gitee)

domain-bilingual-v1: passed, exit 0. Saturates at 1.0 on vector/hybrid (24
distinct topics); bm25 mrr 0.985 / ndcg 0.989 the only signal. Gate at
observed−margin (hit@k/mrr 0.95, ndcg/recall 0.97) — a regression-detector
floor, not a discriminative benchmark. Denser domain-bilingual-v2 is the
discriminative follow-up.
negatives-ood-v1: passed, exit 0, observe-only.

Verification

ruff + mypy src + pytest green; both datasets validate; eval-gate content
check passes locally; dikw-core tracked tree clean (read-only held).

Merge order #5 → #6 → #7 (GitHub auto-retargets on each merge).

🤖 Generated with Claude Code

…atives-ood-v1) Approved design implementing eval-plan datasets ii/iii: single bilingual dataset with engine-gated blended thresholds plus an offline zh/en split recorded in BASELINES.md, a reused 24-doc corpus, LLM-generated-then-verified queries, and an observe-only off-corpus negatives set. Calibrate thresholds at observed - margin from one real-vector run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

8-task plan: per-language metric splitter (TDD), corpus scaffold, LLM query generation via the MiniMax factory, curation/materialization, real-vector calibration with observed-margin thresholds, zh/en split into BASELINES.md, and the stacked PR. dikw-core read-only throughout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Recovers the zh/en breakdown (docs/dikw-eval-plan.md §2.4) from an eval NDJSON's per_query rows, since dataset.yaml thresholds are flat. Metric formulas mirror dikw-core/src/dikw_core/eval/metrics.py exactly so split_metrics(all) reconciles with the engine's blended doc metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…us copy) Both reuse synthetic-diverse-v2's 24-doc corpus (12 zh / 12 en). dataset.yaml thresholds left empty; domain-bilingual-v1 is calibrated to observed-margin after the real-vector run, negatives-ood-v1 stays observe-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…via factory domain-bilingual-v1: 34 verified positives (18 zh + 16 en) covering all 24 docs, with deliberate intra-cluster-confusable queries in the history clusters for ranking signal. negatives-ood-v1: 23 verified expect_none queries (11 zh + 12 en), plausible-but-uncovered. Generated through the MiniMax factory (scripts/generate_candidates.py --instruction steering) then human-verified gold. Fixes: - generate_candidates.py: add --instruction passthrough for targeted generation. - llm_client.py: raise output budget to 16000 tokens — MiniMax-M2.7 reasoning was exhausting the 4096 cap and truncating JSON mid-array (stop_reason max_tokens). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…margin Canonical doc/hybrid observed 1.0 (the 24-distinct-topic corpus saturates the vector/hybrid views). Gate at observed-margin: hit@k/mrr 0.95, ndcg/recall 0.97 — a regression-detector floor, not a discriminative benchmark. Denser confusable domain-bilingual-v2 noted as the discriminative follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tives-ood-v1) Blended per-mode table + zh/en split (saturated 1.0; bm25 mrr 0.985/ndcg 0.989 the only signal), domain-bilingual-v1 floor at observed-margin, negatives-ood-v1 observe-only diagnostics. Splitter reconciles with the engine within 1e-9. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

helebest force-pushed the eval/anchor-calibration branch from 8cbc031 to 1af56be Compare June 28, 2026 12:57

helebest force-pushed the eval/phase1-inhouse-datasets branch from 0c0795a to 57d6eb1 Compare June 28, 2026 12:57

helebest force-pushed the eval/anchor-calibration branch from 1af56be to 3d46bf4 Compare June 28, 2026 12:58

helebest changed the base branch from eval/anchor-calibration to main June 28, 2026 12:59

Le He and others added 7 commits June 28, 2026 21:00

helebest force-pushed the eval/phase1-inhouse-datasets branch from 57d6eb1 to 3a79e58 Compare June 28, 2026 13:01

helebest merged commit 6df064f into main Jun 28, 2026

helebest mentioned this pull request Jun 28, 2026

domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts) #11

Merged

helebest deleted the eval/phase1-inhouse-datasets branch June 28, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1#7

feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1#7
helebest merged 7 commits into
mainfrom
eval/phase1-inhouse-datasets

helebest commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 26, 2026

What's here

Calibration (real-vector, dikw-core v0.6.2, MiniMax+Gitee)

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant