Skip to content

feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1#7

Merged
helebest merged 7 commits into
mainfrom
eval/phase1-inhouse-datasets
Jun 28, 2026
Merged

feat(datasets): Phase 1 in-house sets — domain-bilingual-v1 + negatives-ood-v1#7
helebest merged 7 commits into
mainfrom
eval/phase1-inhouse-datasets

Conversation

@helebest

Copy link
Copy Markdown
Collaborator

Implements eval-plan datasets ii (domain-bilingual-v1) and iii
(negatives-ood-v1) — the first in-house Phase-1 retrieval sets. Design:
docs/phase1-inhouse-datasets-design.md; plan: docs/superpowers/plans/.

What's here

  • domain-bilingual-v1 — 34 verified positives (18 zh + 16 en) over the reused
    synthetic-diverse-v2 24-doc corpus; every doc covered; intra-cluster-confusable
    history queries for ranking signal. Single dataset, blended engine gate; zh/en
    split recorded in reports/BASELINES.md (§2.4) via tools/split_metrics_by_lang.py.
  • negatives-ood-v1 — 23 verified expect_none queries (11 zh + 12 en),
    plausible-but-uncovered; observe-only (expect_none is diagnostic-only in dikw-core).
  • tools/split_metrics_by_lang.py (+ tests) — per-language metric split from an
    eval NDJSON; formulas mirror dikw-core's, reconciles with the engine within 1e-9.
  • Factory: --instruction steering for generate_candidates.py; raised MiniMax
    output budget to 16000 (reasoning was truncating candidate JSON at 4096).

Calibration (real-vector, dikw-core v0.6.2, MiniMax+Gitee)

  • domain-bilingual-v1: passed, exit 0. Saturates at 1.0 on vector/hybrid (24
    distinct topics); bm25 mrr 0.985 / ndcg 0.989 the only signal. Gate at
    observed−margin (hit@k/mrr 0.95, ndcg/recall 0.97) — a regression-detector
    floor, not a discriminative benchmark. Denser domain-bilingual-v2 is the
    discriminative follow-up.
  • negatives-ood-v1: passed, exit 0, observe-only.

Verification

  • ruff + mypy src + pytest green; both datasets validate; eval-gate content
    check passes locally; dikw-core tracked tree clean (read-only held).

Merge order #5#6#7 (GitHub auto-retargets on each merge).

🤖 Generated with Claude Code

@helebest helebest force-pushed the eval/anchor-calibration branch from 8cbc031 to 1af56be Compare June 28, 2026 12:57
@helebest helebest force-pushed the eval/phase1-inhouse-datasets branch from 0c0795a to 57d6eb1 Compare June 28, 2026 12:57
@helebest helebest force-pushed the eval/anchor-calibration branch from 1af56be to 3d46bf4 Compare June 28, 2026 12:58
@helebest helebest changed the base branch from eval/anchor-calibration to main June 28, 2026 12:59
Le He and others added 7 commits June 28, 2026 21:00
…atives-ood-v1)

Approved design implementing eval-plan datasets ii/iii: single bilingual
dataset with engine-gated blended thresholds plus an offline zh/en split
recorded in BASELINES.md, a reused 24-doc corpus, LLM-generated-then-verified
queries, and an observe-only off-corpus negatives set. Calibrate thresholds at
observed - margin from one real-vector run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8-task plan: per-language metric splitter (TDD), corpus scaffold, LLM query
generation via the MiniMax factory, curation/materialization, real-vector
calibration with observed-margin thresholds, zh/en split into BASELINES.md, and
the stacked PR. dikw-core read-only throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Recovers the zh/en breakdown (docs/dikw-eval-plan.md §2.4) from an eval NDJSON's
per_query rows, since dataset.yaml thresholds are flat. Metric formulas mirror
dikw-core/src/dikw_core/eval/metrics.py exactly so split_metrics(all) reconciles
with the engine's blended doc metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…us copy)

Both reuse synthetic-diverse-v2's 24-doc corpus (12 zh / 12 en). dataset.yaml
thresholds left empty; domain-bilingual-v1 is calibrated to observed-margin after
the real-vector run, negatives-ood-v1 stays observe-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…via factory

domain-bilingual-v1: 34 verified positives (18 zh + 16 en) covering all 24 docs,
with deliberate intra-cluster-confusable queries in the history clusters for
ranking signal. negatives-ood-v1: 23 verified expect_none queries (11 zh + 12 en),
plausible-but-uncovered.

Generated through the MiniMax factory (scripts/generate_candidates.py --instruction
steering) then human-verified gold. Fixes:
- generate_candidates.py: add --instruction passthrough for targeted generation.
- llm_client.py: raise output budget to 16000 tokens — MiniMax-M2.7 reasoning was
  exhausting the 4096 cap and truncating JSON mid-array (stop_reason max_tokens).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…margin

Canonical doc/hybrid observed 1.0 (the 24-distinct-topic corpus saturates the
vector/hybrid views). Gate at observed-margin: hit@k/mrr 0.95, ndcg/recall 0.97 —
a regression-detector floor, not a discriminative benchmark. Denser confusable
domain-bilingual-v2 noted as the discriminative follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tives-ood-v1)

Blended per-mode table + zh/en split (saturated 1.0; bm25 mrr 0.985/ndcg 0.989 the
only signal), domain-bilingual-v1 floor at observed-margin, negatives-ood-v1
observe-only diagnostics. Splitter reconciles with the engine within 1e-9.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@helebest helebest force-pushed the eval/phase1-inhouse-datasets branch from 57d6eb1 to 3a79e58 Compare June 28, 2026 13:01
@helebest helebest merged commit 6df064f into main Jun 28, 2026
@helebest helebest deleted the eval/phase1-inhouse-datasets branch June 28, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant