domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts) by helebest · Pull Request #11 · OpenDIKW/dikw-data

helebest · 2026-06-28T13:15:46Z

Draft — corpus + query drafts are in; awaiting human gold-verification + calibration.

The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid
at 1.0). Design: docs/phase1-domain-bilingual-v2-design.md.

What's in this PR now

Corpus: 56 docs (8 intra-cluster-confusable clusters × 7, 28 zh / 28 en),
codex gpt-5.5 xhigh. Sibling-injection: each doc's prompt names its cluster
siblings → docs overlap in vocabulary but own distinct answerable facts.
queries.yaml: 56 confusable queries (1/doc), DRAFT. Each is lexically tempting
toward siblings yet uniquely answered by its gold doc.
scripts/generate_domain_bilingual_v2.py — the 8-cluster spec as code, with corpus
(--provider codex) and query (--queries) generation.

Status / what's left

✅ lint-type-test green; validate_dataset passes.
❌ baseline-must-update red by design — v2 thresholds are placeholder
(empty); the BASELINES entry + gate land only after the single calibration run,
which is deferred to post-dikw-core#249/#250 (parallel-plan Workstream D).
This PR is not mergeable until then.
👤 Human gold-verification needed on queries.yaml: confirm each gold is the
uniquely correct answer, drop/ fix ambiguous, and tighten queries that
over-name the gold (some embed the answer's distinctive term verbatim, which can
make retrieval too easy — they should describe the answer without naming it).

🤖 Generated with Claude Code

…eval gate) domain-bilingual-v1 saturates the vector/hybrid views at 1.0 (24 mutually-distinct topics), so its gate is a high regression floor, not a discriminative benchmark. v2 fixes this at the corpus level: 8 intra-cluster-confusable topic clusters (~56 docs, 28 zh / 28 en) where a query's gold doc must out-rank ~6 near-neighbour siblings, giving ndcg_at_10 / mrr real ranking signal. Design only (no spend yet) — corpus + confusable queries generated via codex gpt-5.5 xhigh after sign-off; threshold calibration deferred to the single post-#249/#250 run (parallel-plan Workstream D). dikw-core stays read-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s (codex) The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid at 1.0). 8 intra-cluster-confusable clusters x 7 = 56 docs (28 zh / 28 en) plus 56 confusable queries (1/doc), all generated by codex gpt-5.5 xhigh. The mechanism is sibling-injection: each doc's prompt names its cluster siblings so documents overlap in vocabulary/framing but own distinct answerable facts; each query is phrased to be lexically tempting toward the siblings yet uniquely answered by its gold doc. That gives ndcg_at_10 / mrr real ranking signal. - scripts/generate_domain_bilingual_v2.py: the 8-cluster spec as code + corpus (--provider codex) and query (--queries) generation, reusing the factory (RetryingMiniMaxClient, retries, audit, --resume). - datasets/domain-bilingual-v2/{corpus (56 .md), queries.yaml, dataset.yaml}. queries.yaml is a DRAFT: gold targets need human verification (uniqueness, sibling-wrongness, and tightening queries that over-name the gold). Thresholds are placeholder (empty) — calibrated at observed-margin from the single post-#249/#250 run (Workstream D). validate_dataset passes; ruff + mypy src + pytest green (87). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ector run) One real-vector run (dikw-core 0.6.4, Gitee Qwen3@1024, --retrieval all, cold embed): hit_at_3 1.000 / hit_at_10 1.000 / mrr 0.991 / ndcg_at_10 0.993 / recall_at_100 1.000. Thresholds set at observed-margin (-0.05 hit@k/mrr, -0.03 ndcg/recall): hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.94 / ndcg_at_10 0.96 / recall_at_100 0.97. Honest finding: only PARTIALLY de-saturated. bm25 (ndcg 0.967) carries real intra-cluster signal — the confusable corpus works — but the DRAFT queries over-name their gold (embed the answer term verbatim), so vector/hybrid stay ~0.99 and the zh slice is fully 1.0. A query gold-tightening pass is the discriminative follow-up; recalibrate after it. reports/BASELINES.md records the full per-mode + zh/en split. v2 calibration is independent of #249 (no negatives) and #250 (single fixed config). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

helebest and others added 2 commits June 28, 2026 21:21

helebest force-pushed the eval/domain-bilingual-v2 branch from f75c829 to dd00c23 Compare June 28, 2026 14:00

helebest changed the title ~~[design] domain-bilingual-v2 — discriminative confusable retrieval gate~~ domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts) Jun 28, 2026

helebest marked this pull request as ready for review June 28, 2026 14:59

helebest merged commit 544656e into main Jun 28, 2026
3 checks passed

helebest deleted the eval/domain-bilingual-v2 branch June 28, 2026 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts)#11

domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts)#11
helebest merged 3 commits into
mainfrom
eval/domain-bilingual-v2

helebest commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

helebest commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR now

Status / what's left

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

helebest commented Jun 28, 2026 •

edited

Loading