Skip to content

domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts)#11

Merged
helebest merged 3 commits into
mainfrom
eval/domain-bilingual-v2
Jun 28, 2026
Merged

domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts)#11
helebest merged 3 commits into
mainfrom
eval/domain-bilingual-v2

Conversation

@helebest

@helebest helebest commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Draft — corpus + query drafts are in; awaiting human gold-verification + calibration.

The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid
at 1.0). Design: docs/phase1-domain-bilingual-v2-design.md.

What's in this PR now

  • Corpus: 56 docs (8 intra-cluster-confusable clusters × 7, 28 zh / 28 en),
    codex gpt-5.5 xhigh. Sibling-injection: each doc's prompt names its cluster
    siblings → docs overlap in vocabulary but own distinct answerable facts.
  • queries.yaml: 56 confusable queries (1/doc), DRAFT. Each is lexically tempting
    toward siblings yet uniquely answered by its gold doc.
  • scripts/generate_domain_bilingual_v2.py — the 8-cluster spec as code, with corpus
    (--provider codex) and query (--queries) generation.

Status / what's left

  • lint-type-test green; validate_dataset passes.
  • baseline-must-update red by design — v2 thresholds are placeholder
    (empty); the BASELINES entry + gate land only after the single calibration run,
    which is deferred to post-dikw-core#249/#250 (parallel-plan Workstream D).
    This PR is not mergeable until then.
  • 👤 Human gold-verification needed on queries.yaml: confirm each gold is the
    uniquely correct answer, drop/ fix ambiguous, and tighten queries that
    over-name the gold
    (some embed the answer's distinctive term verbatim, which can
    make retrieval too easy — they should describe the answer without naming it).

🤖 Generated with Claude Code

helebest and others added 2 commits June 28, 2026 21:21
…eval gate)

domain-bilingual-v1 saturates the vector/hybrid views at 1.0 (24 mutually-distinct
topics), so its gate is a high regression floor, not a discriminative benchmark.
v2 fixes this at the corpus level: 8 intra-cluster-confusable topic clusters
(~56 docs, 28 zh / 28 en) where a query's gold doc must out-rank ~6 near-neighbour
siblings, giving ndcg_at_10 / mrr real ranking signal.

Design only (no spend yet) — corpus + confusable queries generated via codex
gpt-5.5 xhigh after sign-off; threshold calibration deferred to the single
post-#249/#250 run (parallel-plan Workstream D). dikw-core stays read-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s (codex)

The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid
at 1.0). 8 intra-cluster-confusable clusters x 7 = 56 docs (28 zh / 28 en) plus 56
confusable queries (1/doc), all generated by codex gpt-5.5 xhigh.

The mechanism is sibling-injection: each doc's prompt names its cluster siblings so
documents overlap in vocabulary/framing but own distinct answerable facts; each
query is phrased to be lexically tempting toward the siblings yet uniquely answered
by its gold doc. That gives ndcg_at_10 / mrr real ranking signal.

- scripts/generate_domain_bilingual_v2.py: the 8-cluster spec as code + corpus
  (--provider codex) and query (--queries) generation, reusing the factory
  (RetryingMiniMaxClient, retries, audit, --resume).
- datasets/domain-bilingual-v2/{corpus (56 .md), queries.yaml, dataset.yaml}.

queries.yaml is a DRAFT: gold targets need human verification (uniqueness,
sibling-wrongness, and tightening queries that over-name the gold). Thresholds are
placeholder (empty) — calibrated at observed-margin from the single post-#249/#250
run (Workstream D). validate_dataset passes; ruff + mypy src + pytest green (87).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@helebest helebest force-pushed the eval/domain-bilingual-v2 branch from f75c829 to dd00c23 Compare June 28, 2026 14:00
@helebest helebest changed the title [design] domain-bilingual-v2 — discriminative confusable retrieval gate domain-bilingual-v2 — discriminative confusable retrieval gate (corpus + query drafts) Jun 28, 2026
…ector run)

One real-vector run (dikw-core 0.6.4, Gitee Qwen3@1024, --retrieval all, cold embed):
hit_at_3 1.000 / hit_at_10 1.000 / mrr 0.991 / ndcg_at_10 0.993 / recall_at_100 1.000.

Thresholds set at observed-margin (-0.05 hit@k/mrr, -0.03 ndcg/recall):
hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.94 / ndcg_at_10 0.96 / recall_at_100 0.97.

Honest finding: only PARTIALLY de-saturated. bm25 (ndcg 0.967) carries real
intra-cluster signal — the confusable corpus works — but the DRAFT queries over-name
their gold (embed the answer term verbatim), so vector/hybrid stay ~0.99 and the zh
slice is fully 1.0. A query gold-tightening pass is the discriminative follow-up;
recalibrate after it. reports/BASELINES.md records the full per-mode + zh/en split.

v2 calibration is independent of #249 (no negatives) and #250 (single fixed config).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@helebest helebest marked this pull request as ready for review June 28, 2026 14:59
@helebest helebest merged commit 544656e into main Jun 28, 2026
3 checks passed
@helebest helebest deleted the eval/domain-bilingual-v2 branch June 28, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant