domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts)#11
Merged
Merged
Conversation
…eval gate) domain-bilingual-v1 saturates the vector/hybrid views at 1.0 (24 mutually-distinct topics), so its gate is a high regression floor, not a discriminative benchmark. v2 fixes this at the corpus level: 8 intra-cluster-confusable topic clusters (~56 docs, 28 zh / 28 en) where a query's gold doc must out-rank ~6 near-neighbour siblings, giving ndcg_at_10 / mrr real ranking signal. Design only (no spend yet) — corpus + confusable queries generated via codex gpt-5.5 xhigh after sign-off; threshold calibration deferred to the single post-#249/#250 run (parallel-plan Workstream D). dikw-core stays read-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s (codex)
The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid
at 1.0). 8 intra-cluster-confusable clusters x 7 = 56 docs (28 zh / 28 en) plus 56
confusable queries (1/doc), all generated by codex gpt-5.5 xhigh.
The mechanism is sibling-injection: each doc's prompt names its cluster siblings so
documents overlap in vocabulary/framing but own distinct answerable facts; each
query is phrased to be lexically tempting toward the siblings yet uniquely answered
by its gold doc. That gives ndcg_at_10 / mrr real ranking signal.
- scripts/generate_domain_bilingual_v2.py: the 8-cluster spec as code + corpus
(--provider codex) and query (--queries) generation, reusing the factory
(RetryingMiniMaxClient, retries, audit, --resume).
- datasets/domain-bilingual-v2/{corpus (56 .md), queries.yaml, dataset.yaml}.
queries.yaml is a DRAFT: gold targets need human verification (uniqueness,
sibling-wrongness, and tightening queries that over-name the gold). Thresholds are
placeholder (empty) — calibrated at observed-margin from the single post-#249/#250
run (Workstream D). validate_dataset passes; ruff + mypy src + pytest green (87).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f75c829 to
dd00c23
Compare
…ector run) One real-vector run (dikw-core 0.6.4, Gitee Qwen3@1024, --retrieval all, cold embed): hit_at_3 1.000 / hit_at_10 1.000 / mrr 0.991 / ndcg_at_10 0.993 / recall_at_100 1.000. Thresholds set at observed-margin (-0.05 hit@k/mrr, -0.03 ndcg/recall): hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.94 / ndcg_at_10 0.96 / recall_at_100 0.97. Honest finding: only PARTIALLY de-saturated. bm25 (ndcg 0.967) carries real intra-cluster signal — the confusable corpus works — but the DRAFT queries over-name their gold (embed the answer term verbatim), so vector/hybrid stay ~0.99 and the zh slice is fully 1.0. A query gold-tightening pass is the discriminative follow-up; recalibrate after it. reports/BASELINES.md records the full per-mode + zh/en split. v2 calibration is independent of #249 (no negatives) and #250 (single fixed config). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft — corpus + query drafts are in; awaiting human gold-verification + calibration.
The discriminative follow-up to
domain-bilingual-v1(which saturates vector/hybridat 1.0). Design:
docs/phase1-domain-bilingual-v2-design.md.What's in this PR now
codex gpt-5.5 xhigh. Sibling-injection: each doc's prompt names its cluster
siblings → docs overlap in vocabulary but own distinct answerable facts.
toward siblings yet uniquely answered by its gold doc.
scripts/generate_domain_bilingual_v2.py— the 8-cluster spec as code, with corpus(
--provider codex) and query (--queries) generation.Status / what's left
lint-type-testgreen;validate_datasetpasses.baseline-must-updatered by design — v2 thresholds are placeholder(empty); the BASELINES entry + gate land only after the single calibration run,
which is deferred to post-dikw-core#249/#250 (parallel-plan Workstream D).
This PR is not mergeable until then.
queries.yaml: confirm each gold is theuniquely correct answer, drop/ fix ambiguous, and tighten queries that
over-name the gold (some embed the answer's distinctive term verbatim, which can
make retrieval too easy — they should describe the answer without naming it).
🤖 Generated with Claude Code