From 73dfffc41fced805959a5a7bd9690389ca5f384a Mon Sep 17 00:00:00 2001 From: holo Date: Sun, 28 Jun 2026 21:15:41 +0800 Subject: [PATCH 1/3] docs: design for domain-bilingual-v2 (discriminative confusable retrieval gate) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit domain-bilingual-v1 saturates the vector/hybrid views at 1.0 (24 mutually-distinct topics), so its gate is a high regression floor, not a discriminative benchmark. v2 fixes this at the corpus level: 8 intra-cluster-confusable topic clusters (~56 docs, 28 zh / 28 en) where a query's gold doc must out-rank ~6 near-neighbour siblings, giving ndcg_at_10 / mrr real ranking signal. Design only (no spend yet) — corpus + confusable queries generated via codex gpt-5.5 xhigh after sign-off; threshold calibration deferred to the single post-#249/#250 run (parallel-plan Workstream D). dikw-core stays read-only. Co-Authored-By: Claude Opus 4.8 --- docs/phase1-domain-bilingual-v2-design.md | 151 ++++++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 docs/phase1-domain-bilingual-v2-design.md diff --git a/docs/phase1-domain-bilingual-v2-design.md b/docs/phase1-domain-bilingual-v2-design.md new file mode 100644 index 0000000..816b92d --- /dev/null +++ b/docs/phase1-domain-bilingual-v2-design.md @@ -0,0 +1,151 @@ +# Phase 1 follow-up: `domain-bilingual-v2` — the discriminative retrieval gate (design) + +> **Status:** design draft for review (2026-06-28); implementation pending your sign-off. +> Implements the `domain-bilingual-v2` follow-up named in +> [`phase1-inhouse-datasets-design.md`](phase1-inhouse-datasets-design.md) §10 and in +> `datasets/domain-bilingual-v1/dataset.yaml`. Follows the threshold methodology in +> [`dikw-eval-plan.md`](dikw-eval-plan.md) §2.3 and the bilingual-split rule in §2.4. +> `dikw-core` stays **read-only**. Built in parallel with dikw-core#249/#250 — neither +> blocks this construction (only the final threshold calibration waits; see §6). + +## 1. Context — why v2 + +`domain-bilingual-v1` landed (#7) but its own `dataset.yaml` records the problem: its +24 docs span ~10 **mutually-distinct** topics, so the vector/hybrid views **saturate at +1.0** — every query trivially retrieves its lone on-topic doc. The committed thresholds +are therefore a *high regression floor*, not a *discriminative benchmark*: they catch a +pipeline breakage but cannot detect a ranking-quality regression, which is exactly what +"the gate that matters" must measure. + +`v2` fixes this at the **corpus** level (not by tuning anything): a denser corpus of +**deliberately intra-cluster-confusable** documents, so a query's gold doc must +*out-rank its near-neighbours*. That gives `ndcg_at_10` / `mrr` real signal while +`recall_at_100` still anchors pipeline health. The confusability is the whole point — +we engineer the corpus to be hard, then (§6) record observed and gate at +`observed − margin` per §2.3; we do **not** engineer queries to a target. + +## 2. Decisions (proposed) + +1. **New, denser corpus — not a reuse of v1.** v1's 16-line stubs across distinct + topics cannot be made confusable by query wording alone. v2 is freshly authored as + **8 topic clusters × ~7 docs = ~56 docs**, each doc ~250–500 words so chunking has + substance and siblings genuinely overlap in vocabulary. +2. **Single dataset + recorded zh/en split** (same as v1, §2.4): `dataset.yaml + thresholds:` is a flat map; zh/en metrics are recomputed offline from the run's + NDJSON via the existing `tools/split_metrics_by_lang.py` and recorded in + `reports/BASELINES.md`. **4 clusters are zh, 4 are en** (28/28), so the zh slice + exercises the jieba/CJK path and the en slice does not — query language matches its + gold doc's language. +3. **Confusability lives *within* a cluster, distinctness *across* clusters.** Sibling + docs share vocabulary (a query must discriminate among them → ranking signal); the 8 + clusters are mutually distinct (→ `recall_at_100` stays healthy, the set isn't + pathologically hard). This is the lever that de-saturates `ndcg`/`mrr` without making + the floor meaningless. +4. **Generate with codex gpt-5.5 xhigh** (your choice): the corpus via + `scripts/generate_bilingual_corpus.py --provider codex`, the confusable queries via + `scripts/generate_candidates.py --provider codex` with `--instruction` steering. xhigh + reasoning matters most for writing queries whose *wrong* answer is a plausible sibling. +5. **Calibration is deferred** to the single post-#249/#250 pass (Workstream D of the + parallel plan). v2 ships with placeholder thresholds; gated only after a real-vector run. + +## 3. Corpus design — 8 confusable clusters (~56 docs, 28 zh / 28 en) + +Each cluster's docs are mutually confusable (shared terms, adjacent concepts), so a +well-crafted query about one doc has 6 plausible-but-wrong siblings. Stems are +`-`; frontmatter carries `title`, `language`, `source: +openai-codex-synthetic`. + +### zh clusters (jieba/CJK path) — 28 docs + +| cluster (stem prefix) | ~7 docs | intra-cluster confusion axis | +|---|---|---| +| `tang` 唐朝制度与历史 | 建立 / 均田制 / 租庸调制 / 科举制 / 唐与西域 / 安史之乱 / 两税法 | 制度 vs 事件 vs 改革,同朝代术语高度重叠 | +| `china-money` 中国货币金融史 | 交子纸币 / 北宋通货膨胀 / 白银货币化 / 票号钱庄 / 盐引制度 / 青苗法 / 一条鞭法 | 货币工具 vs 财政改革,跨朝代但同主题 | +| `tcm` 中医基础理论 | 阴阳 / 五行 / 气血津液 / 经络 / 脏腑 / 四诊 / 辨证论治 | 理论概念互相引用、边界模糊 | +| `china-lit` 中国古典文学 | 红楼梦 / 三国演义 / 水浒传 / 西游记 / 唐诗 / 宋词 / 元曲 | 四大名著彼此混淆;诗词曲三体裁混淆 | + +### en clusters (non-CJK path) — 28 docs + +| cluster (stem prefix) | ~7 docs | intra-cluster confusion axis | +|---|---|---| +| `crypto` Cryptography | symmetric-key / public-key / hash-functions / digital-signatures / diffie-hellman / tls-handshake / block-vs-stream | key-exchange vs encryption vs integrity, shared jargon | +| `cell-energy` Cellular energy & photosynthesis | light-reactions / calvin-cycle / glycolysis / cellular-respiration / electron-transport-chain / chemiosmosis / photorespiration | overlapping biochemical pathways & molecules | +| `french-rev` French Revolution & Napoleonic era | causes / estates-general-1789 / reign-of-terror / thermidor / napoleon-rise / napoleonic-wars / congress-of-vienna | sequential phases of one era, shared actors | +| `macro-money` Money & inflation (macro) | money-supply / inflation-causes / central-bank-policy / interest-rates-yield-curve / quantitative-easing / phillips-curve / hyperinflation | interlocking macro concepts, shared vocabulary | + +(Exact per-cluster doc count may flex 6–8 to keep each cluster internally coherent; total +stays ~56, zh/en stays balanced. Final stems are fixed at materialization.) + +## 4. Query design — ~70 queries, deliberately confusable + +- **Coverage:** every doc gets ≥ 1 positive; ~35 zh / ~35 en; ids prefixed `zh-` / `en-` + (the `split_metrics_by_lang` contract), language matching the gold doc. +- **Confusable subset (≥ 40%):** queries phrased so a *sibling* doc is the plausible + wrong top hit — e.g. `zh-tang-two-tax-vs-zuyongdiao`: "唐朝后期取代租庸调、按土地和资产 + 征税的赋税制度是什么?" must rank `tang-两税法` above `tang-租庸调制`. These carry the + `ndcg`/`mrr` signal. +- **`expect_any: [stem]`** single-gold positives (exactly-one-doc answers), per the + contract. No `expect_none` here — negatives remain `negatives-ood-v1`'s job. +- Generated via `generate_candidates.py --provider codex --instruction ""`, then **human-verified** (you, via the `queries.yaml` PR diff): confirm the gold stem is the *uniquely correct* answer and siblings are genuinely wrong; drop ambiguous; dedup; rebalance zh/en. + +## 5. Generation + verification workflow (no `dikw-core` calls) + +1. **Scaffold** `datasets/domain-bilingual-v2/` (corpus dir + empty `queries.yaml` + + `dataset.yaml` with placeholder thresholds). +2. **Generate corpus (codex):** per-cluster prompts to + `generate_bilingual_corpus.py --provider codex`, one doc per stem, stamping + `language:` and `source: openai-codex-synthetic`. Cluster prompts instruct codex to + make siblings *overlap in vocabulary but differ in the answerable fact*. +3. **Generate queries (codex):** per-cluster `generate_candidates.py --provider codex + --instruction …`, emphasizing the intra-cluster-confusable subset. +4. **Human-verify (maintainer):** curate `queries.yaml` — gold uniqueness, sibling + wrongness, language tag, balance, dedup. Optional `scripts/llm_review.py` second pass. +5. **Validate ($0):** `scripts/validate_dataset.py datasets/domain-bilingual-v2` must pass. +6. **CI:** `datasets/**` change → the eval-gate workflow requires a new `BASELINES.md` + entry (lands in §6's calibration commit). + +## 6. Threshold calibration — deferred to the post-#249/#250 pass + +Per the parallel plan's Workstream D, v2 joins the **single** real-vector calibration run +once dikw-core ships #249/#250: + +- One run `scripts/run_eval.py --datasets …/domain-bilingual-v2 --retrieval all` + (`--cache read_write`; cold-embed once). Confirm `summary.json worst_exit_code == 0`. +- `tools/split_metrics_by_lang.py` → zh/en split into `reports/BASELINES.md`. +- Gate `domain-bilingual-v2` at `observed − margin` (−0.05 `hit@k`/`mrr`, −0.03 + `ndcg`/`recall`), written into `dataset.yaml`. +- **De-saturation acceptance:** hybrid `ndcg_at_10` should land meaningfully **< 1.0** + (the confusable design worked). If it still saturates, log it and densify further + (more siblings per cluster) before gating. + +Until then `dataset.yaml` carries placeholder thresholds and the dataset is observe-only. + +## 7. Deliverables + +- `datasets/domain-bilingual-v2/**` (corpus ~56 `.md` + `queries.yaml` + `dataset.yaml`). +- `reports/BASELINES.md` entry (added in the calibration commit, Workstream D). +- This design doc; a phase-note touch in `dikw-eval-plan.md` §2.2 (v2 row). +- Branch `eval/domain-bilingual-v2`; PR base `main`. + +## 8. Risks + +1. **Codex throughput / cost:** xhigh at concurrency 1 with 600s timeouts → ~56 docs is + slow and real spend. Mitigation: generate per-cluster in batches with `--resume`; if + time/cost bites, fall back to M3 for the *plainer* docs and keep codex for the + confusable queries (your call at generation time). +2. **Over-confusability:** if siblings overlap so much that *no* query has a unique gold, + `expect_any` becomes ill-defined. Mitigation: each doc must own ≥ 1 distinct + answerable fact; the human-verify step rejects non-unique golds. +3. **Residual saturation:** distinct-cluster structure may still retrieve easily. + Mitigation: the §6 de-saturation check gates whether v2 actually earned its purpose; + densify if not. +4. **codex OAuth refresh_token rotation** (heavy generation): may invalidate the codex + CLI login. Decouple first with `python -m dikw_data.codex_auth login`. + +## 9. Acceptance criteria + +- `validate_dataset.py` passes; `ruff` / `mypy src` / `pytest` green. +- ~56 docs, 28 zh / 28 en; every doc has ≥ 1 positive; ~35/~35 zh/en queries; ids + prefixed `zh-`/`en-`; ≥ 40% intra-cluster-confusable. +- (Workstream D) calibration run `worst_exit_code == 0`; `BASELINES.md` entry with + blended + zh/en split; hybrid `ndcg_at_10` de-saturated (< 1.0) or the limitation logged. From dd00c23562397c7f5f54f79c2c330493a464253d Mon Sep 17 00:00:00 2001 From: holo Date: Sun, 28 Jun 2026 22:00:29 +0800 Subject: [PATCH 2/3] =?UTF-8?q?feat(datasets):=20domain-bilingual-v2=20?= =?UTF-8?q?=E2=80=94=20confusable=20corpus=20+=20query=20drafts=20(codex)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The discriminative follow-up to domain-bilingual-v1 (which saturates vector/hybrid at 1.0). 8 intra-cluster-confusable clusters x 7 = 56 docs (28 zh / 28 en) plus 56 confusable queries (1/doc), all generated by codex gpt-5.5 xhigh. The mechanism is sibling-injection: each doc's prompt names its cluster siblings so documents overlap in vocabulary/framing but own distinct answerable facts; each query is phrased to be lexically tempting toward the siblings yet uniquely answered by its gold doc. That gives ndcg_at_10 / mrr real ranking signal. - scripts/generate_domain_bilingual_v2.py: the 8-cluster spec as code + corpus (--provider codex) and query (--queries) generation, reusing the factory (RetryingMiniMaxClient, retries, audit, --resume). - datasets/domain-bilingual-v2/{corpus (56 .md), queries.yaml, dataset.yaml}. queries.yaml is a DRAFT: gold targets need human verification (uniqueness, sibling-wrongness, and tightening queries that over-name the gold). Thresholds are placeholder (empty) — calibrated at observed-margin from the single post-#249/#250 run (Workstream D). validate_dataset passes; ruff + mypy src + pytest green (87). Co-Authored-By: Claude Opus 4.8 --- .../corpus/cell-energy-calvin-cycle.md | 31 ++ .../corpus/cell-energy-chemiosmosis.md | 29 ++ .../corpus/cell-energy-electron-transport.md | 29 ++ .../corpus/cell-energy-glycolysis.md | 36 ++ .../corpus/cell-energy-light-reactions.md | 39 ++ .../corpus/cell-energy-photorespiration.md | 25 ++ .../corpus/cell-energy-respiration.md | 31 ++ .../corpus/china-lit-hongloumeng.md | 19 + .../corpus/china-lit-sanguo.md | 19 + .../corpus/china-lit-shuihu.md | 16 + .../corpus/china-lit-songci.md | 19 + .../corpus/china-lit-tangshi.md | 15 + .../corpus/china-lit-xiyouji.md | 19 + .../corpus/china-lit-yuanqu.md | 19 + .../corpus/china-money-jiaozi.md | 19 + .../corpus/china-money-piaohao.md | 19 + .../corpus/china-money-qingmiao.md | 19 + .../corpus/china-money-silver.md | 19 + .../corpus/china-money-song-inflation.md | 19 + .../corpus/china-money-yanyin.md | 19 + .../corpus/china-money-yitiaobian.md | 19 + .../corpus/crypto-block-stream.md | 34 ++ .../corpus/crypto-diffie-hellman.md | 34 ++ .../domain-bilingual-v2/corpus/crypto-hash.md | 27 ++ .../corpus/crypto-public-key.md | 27 ++ .../corpus/crypto-signatures.md | 28 ++ .../corpus/crypto-symmetric.md | 25 ++ .../domain-bilingual-v2/corpus/crypto-tls.md | 32 ++ .../corpus/french-rev-causes.md | 27 ++ .../corpus/french-rev-estates-general.md | 29 ++ .../corpus/french-rev-napoleon-rise.md | 27 ++ .../corpus/french-rev-napoleonic-wars.md | 27 ++ .../corpus/french-rev-terror.md | 25 ++ .../corpus/french-rev-thermidor.md | 29 ++ .../corpus/french-rev-vienna.md | 27 ++ .../corpus/macro-cb-policy.md | 33 ++ .../corpus/macro-hyperinflation.md | 27 ++ .../corpus/macro-inflation-causes.md | 35 ++ .../corpus/macro-interest-rates.md | 32 ++ .../corpus/macro-money-supply.md | 31 ++ .../corpus/macro-phillips.md | 31 ++ .../corpus/macro-quantitative-easing.md | 29 ++ .../domain-bilingual-v2/corpus/tang-anshi.md | 19 + .../corpus/tang-founding.md | 19 + .../corpus/tang-juntian.md | 19 + .../domain-bilingual-v2/corpus/tang-keju.md | 19 + .../corpus/tang-liangshui.md | 21 ++ .../domain-bilingual-v2/corpus/tang-xiyu.md | 19 + .../corpus/tang-zuyongdiao.md | 19 + .../corpus/tcm-bianzheng.md | 17 + .../domain-bilingual-v2/corpus/tcm-jingluo.md | 19 + .../domain-bilingual-v2/corpus/tcm-qixue.md | 19 + .../domain-bilingual-v2/corpus/tcm-sizhen.md | 19 + .../domain-bilingual-v2/corpus/tcm-wuxing.md | 19 + .../domain-bilingual-v2/corpus/tcm-yinyang.md | 19 + .../domain-bilingual-v2/corpus/tcm-zangfu.md | 23 ++ datasets/domain-bilingual-v2/dataset.yaml | 12 + datasets/domain-bilingual-v2/queries.yaml | 172 +++++++++ scripts/generate_domain_bilingual_v2.py | 342 ++++++++++++++++++ 59 files changed, 1891 insertions(+) create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-calvin-cycle.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-chemiosmosis.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-electron-transport.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-glycolysis.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-light-reactions.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-photorespiration.md create mode 100644 datasets/domain-bilingual-v2/corpus/cell-energy-respiration.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-hongloumeng.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-sanguo.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-shuihu.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-songci.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-tangshi.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-xiyouji.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-lit-yuanqu.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-jiaozi.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-piaohao.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-qingmiao.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-silver.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-song-inflation.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-yanyin.md create mode 100644 datasets/domain-bilingual-v2/corpus/china-money-yitiaobian.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-block-stream.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-diffie-hellman.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-hash.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-public-key.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-signatures.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-symmetric.md create mode 100644 datasets/domain-bilingual-v2/corpus/crypto-tls.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-causes.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-estates-general.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-napoleon-rise.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-napoleonic-wars.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-terror.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-thermidor.md create mode 100644 datasets/domain-bilingual-v2/corpus/french-rev-vienna.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-cb-policy.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-hyperinflation.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-inflation-causes.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-interest-rates.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-money-supply.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-phillips.md create mode 100644 datasets/domain-bilingual-v2/corpus/macro-quantitative-easing.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-anshi.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-founding.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-juntian.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-keju.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-liangshui.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-xiyu.md create mode 100644 datasets/domain-bilingual-v2/corpus/tang-zuyongdiao.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-bianzheng.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-jingluo.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-qixue.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-sizhen.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-wuxing.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-yinyang.md create mode 100644 datasets/domain-bilingual-v2/corpus/tcm-zangfu.md create mode 100644 datasets/domain-bilingual-v2/dataset.yaml create mode 100644 datasets/domain-bilingual-v2/queries.yaml create mode 100644 scripts/generate_domain_bilingual_v2.py diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-calvin-cycle.md b/datasets/domain-bilingual-v2/corpus/cell-energy-calvin-cycle.md new file mode 100644 index 0000000..55f907a --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-calvin-cycle.md @@ -0,0 +1,31 @@ +--- +title: The Calvin cycle +language: en +source: openai-codex-synthetic +--- + +# The Calvin cycle + +## Role in cellular energy + +The Calvin cycle is the light-independent carbon-fixation pathway of photosynthesis. It runs in the **stroma of chloroplasts**, where enzymes use the chemical energy of **ATP** and the reducing power of **NADPH** to convert inorganic carbon dioxide into carbohydrate. It is “light-independent” because its reactions do not require photons directly, although they depend on ATP and NADPH supplied by the photosynthetic energy system. + +The pathway is historically associated with **Melvin Calvin**, **Andrew Benson**, and **James Bassham**, who traced carbon atoms using radioactive carbon-14. In many plants it is called the **C3 pathway** because the first stable product has three carbon atoms. + +## Carbon fixation by RuBisCO + +The defining enzyme of the Calvin cycle is **RuBisCO**: ribulose-1,5-bisphosphate carboxylase/oxygenase. In its carbon-fixing role, RuBisCO attaches **CO₂** to the five-carbon acceptor **ribulose-1,5-bisphosphate** (**RuBP**). The unstable six-carbon intermediate immediately yields two molecules of **3-phosphoglycerate** (**3-PGA**). + +This carboxylation step is the entry point for atmospheric carbon into organic metabolism. Unlike pathways that break down sugar for energy, the Calvin cycle uses energy input to build reduced carbon compounds that can later support sucrose, starch, cellulose, and other plant molecules. + +## Reduction and sugar output + +After fixation, 3-PGA is phosphorylated by **ATP** and reduced by **NADPH** to form **glyceraldehyde-3-phosphate** (**G3P**), also called triose phosphate. G3P is the main carbohydrate product exported from the cycle. It is not yet glucose, but two G3P molecules can be used by plant metabolism to assemble a six-carbon sugar. + +For every **3 CO₂** fixed, the cycle produces one net G3P molecule while consuming **9 ATP** and **6 NADPH**. The remaining carbon skeletons stay in the cycle so that the CO₂ acceptor can be rebuilt. + +## Regeneration of RuBP + +Most G3P molecules are rearranged through a series of enzyme-catalysed steps to regenerate **RuBP**, allowing RuBisCO to continue carbon fixation. This regeneration phase also requires ATP. + +Overall, the Calvin cycle is the central anabolic route by which photosynthetic cells turn CO₂ into usable carbohydrate, coupling carbon fixation to ATP consumption and NADPH oxidation. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-chemiosmosis.md b/datasets/domain-bilingual-v2/corpus/cell-energy-chemiosmosis.md new file mode 100644 index 0000000..4a50205 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-chemiosmosis.md @@ -0,0 +1,29 @@ +--- +title: Chemiosmosis and ATP synthase +language: en +source: openai-codex-synthetic +--- + +# Chemiosmosis and ATP Synthase + +## The proton-motive force + +Chemiosmosis is the coupling of an electrochemical proton gradient to ATP production. The stored energy is called the **proton-motive force** (PMF), often written as Δp. It has two parts: a membrane voltage, **Δψ**, and a proton concentration difference, **ΔpH**. Together they make one side of an energy-transducing membrane more positive and more acidic than the other. + +In mitochondria, this membrane is the **inner mitochondrial membrane**; in chloroplasts, it is the **thylakoid membrane**. In both settings, ATP synthase does not create the gradient by itself. Instead, it uses the existing proton-motive force as its immediate energy source. Protons move down their electrochemical gradient through ATP synthase, and that downhill movement is what powers the uphill phosphorylation of **ADP + inorganic phosphate (Pi)** to form **ATP**. + +Peter Mitchell proposed this chemiosmotic mechanism in the 1960s, explaining how a membrane gradient could link energy capture to ATP formation. + +## ATP synthase as a rotary enzyme + +ATP synthase is a membrane protein complex commonly called **F₀F₁-ATP synthase**. The **F₀** portion sits in the membrane and forms the proton channel. The **F₁** portion projects from the membrane and contains catalytic sites where ATP is made. + +As protons pass through F₀, they bind and release from subunits in the rotating **c-ring**. This rotation turns the central **γ (gamma) stalk**, which extends into the F₁ head. The γ stalk forces the three catalytic **β subunits** of F₁ through different conformations: loose, tight, and open. This binding-change mechanism, associated with Paul D. Boyer, allows ADP and Pi to bind, ATP to form tightly, and ATP to be released. + +John E. Walker’s structural work helped show how the enzyme’s architecture supports this rotary catalysis. + +## The direct link to ATP formation + +The key point is that ATP synthase is driven by **proton flow down the proton-motive force**, not by direct electron transfer, light absorption, carbon fixation, or glucose splitting. Those processes may supply energy to cells in other steps, but the immediate driver of ATP synthase is the PMF across a membrane. + +When the gradient is strong, proton flow turns the enzyme and favors ATP production. When the gradient collapses, ATP synthase can no longer efficiently phosphorylate ADP. Thus, chemiosmosis explains how a membrane-stored electrochemical gradient is converted into the chemical bond energy of ATP. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-electron-transport.md b/datasets/domain-bilingual-v2/corpus/cell-energy-electron-transport.md new file mode 100644 index 0000000..966cbbc --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-electron-transport.md @@ -0,0 +1,29 @@ +--- +title: The electron transport chain +language: en +source: openai-codex-synthetic +--- + +# The electron transport chain + +## Location and purpose + +The electron transport chain, or ETC, is a set of protein complexes and mobile electron carriers embedded in the inner mitochondrial membrane. Its defining role is to pass high-energy electrons from reduced carriers such as NADH and FADH₂ to a final electron acceptor, while using the released energy to pump protons across the membrane. + +In mitochondria, protons are moved from the matrix into the intermembrane space. This unequal distribution creates an electrochemical proton gradient: the intermembrane space becomes more acidic and positively charged relative to the matrix. The ETC therefore does not directly “make sugar” or split glucose; its central job is membrane-based electron transfer coupled to proton pumping. + +## Carriers and complexes + +Electrons enter the chain mainly through Complex I, also called NADH dehydrogenase, when NADH is oxidised. Complex I transfers electrons to ubiquinone, also known as coenzyme Q, and pumps protons across the inner membrane. + +FADH₂-associated electrons enter through Complex II, succinate dehydrogenase. Complex II passes electrons to ubiquinone but does not pump protons. This difference matters because electrons entering through Complex I contribute more strongly to the proton gradient. + +Reduced ubiquinone carries electrons within the membrane to Complex III, the cytochrome bc₁ complex. Complex III transfers electrons to cytochrome c, a small mobile protein on the outer side of the inner membrane, and contributes further proton movement into the intermembrane space. + +Cytochrome c delivers electrons to Complex IV, cytochrome c oxidase. Complex IV transfers electrons to molecular oxygen, producing water, and pumps additional protons across the membrane. + +## Building the gradient + +The ETC’s proton pumping is directional and organised. For a pair of electrons from NADH, Complex I, Complex III, and Complex IV together move multiple protons from the matrix to the intermembrane space. A common accounting is about 10 protons translocated per NADH: 4 by Complex I, 4 by Complex III, and 2 by Complex IV. Electrons entering through FADH₂ bypass Complex I, so fewer protons are pumped. + +The resulting proton gradient stores energy across the inner mitochondrial membrane. This stored energy is the immediate product of the electron carriers’ activity and links oxidation of electron carriers to later ATP production. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-glycolysis.md b/datasets/domain-bilingual-v2/corpus/cell-energy-glycolysis.md new file mode 100644 index 0000000..49be97e --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-glycolysis.md @@ -0,0 +1,36 @@ +--- +title: Glycolysis +language: en +source: openai-codex-synthetic +--- + +# Glycolysis + +## Core Role in Cellular Energy + +Glycolysis is the cytosolic pathway that splits one six-carbon glucose molecule into two three-carbon pyruvate molecules. It is the first stage of glucose breakdown in many cells and provides a small, immediate net yield of ATP without requiring oxygen directly. Because it occurs in the cytosol, glycolysis does not depend on membrane-bound energy-converting structures. + +The pathway is also called the Embden–Meyerhof–Parnas pathway. Its central purpose is controlled chemical rearrangement: glucose is phosphorylated, cleaved, oxidized, and converted into pyruvate while conserving some released energy in ATP and reduced electron carriers. + +## ATP Investment and Payoff + +Glycolysis contains ten enzyme-catalyzed steps, commonly grouped into two phases. + +In the energy-investment phase, the cell spends 2 ATP to activate glucose. Key enzymes include **hexokinase** or **glucokinase**, which phosphorylates glucose, and **phosphofructokinase-1 (PFK-1)**, a major regulatory enzyme that commits the sugar to glycolytic breakdown. + +In the energy-payoff phase, the two three-carbon intermediates are processed to produce ATP and NADH. **Glyceraldehyde-3-phosphate dehydrogenase** reduces **NAD+** to **NADH**, capturing high-energy electrons. ATP is formed by substrate-level phosphorylation, especially through **phosphoglycerate kinase** and **pyruvate kinase**. + +For each glucose molecule, the net glycolytic yield is: + +- **2 pyruvate** +- **2 ATP net** +- **2 NADH** +- **2 water molecules** + +Although 4 ATP are produced in the payoff phase, 2 ATP were used earlier, so the net ATP gain is only 2. + +## Pyruvate as the Product + +The defining endpoint of glycolysis is pyruvate, not carbon dioxide or a fully oxidized waste product. Pyruvate retains much of the original chemical energy of glucose and can be used in different downstream pathways depending on cell type and conditions. + +Glycolysis is therefore best understood as a rapid, cytosolic glucose-splitting process: it converts glucose into pyruvate while producing a modest net supply of ATP and NADH for cellular energy metabolism. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-light-reactions.md b/datasets/domain-bilingual-v2/corpus/cell-energy-light-reactions.md new file mode 100644 index 0000000..f23a14d --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-light-reactions.md @@ -0,0 +1,39 @@ +--- +title: The light-dependent reactions +language: en +source: openai-codex-synthetic +--- + +# The light-dependent reactions + +## Purpose and location + +The light-dependent reactions are the chloroplast processes that convert absorbed light into the chemical energy carriers **ATP** and **NADPH**. They occur in the **thylakoid membrane**, a folded membrane system inside chloroplasts, with the **thylakoid lumen** on one side and the **stroma** on the other. + +These reactions use two linked photosystems: **Photosystem II (PSII)** and **Photosystem I (PSI)**. Their pigments, including chlorophyll a, chlorophyll b, and carotenoids, capture photons and pass excitation energy to special chlorophyll pairs called **P680** in PSII and **P700** in PSI. + +## Water splitting at Photosystem II + +The sequence begins at **Photosystem II**. When P680 absorbs light, it loses high-energy electrons to an electron acceptor. The missing electrons are replaced by electrons taken from water by the **oxygen-evolving complex**, a protein complex containing a manganese-calcium cluster. + +For every two water molecules split, the reaction produces: + +- 4 electrons for the photosynthetic electron transport chain +- 4 protons released into the thylakoid lumen +- 1 molecule of oxygen gas, **O₂** + +Thus, the oxygen released by photosynthesis comes directly from **water**, not from carbon dioxide. + +## Electron flow and proton movement + +Electrons from PSII move through a chain of thylakoid carriers, including **plastoquinone (PQ)**, the **cytochrome b6f complex**, and **plastocyanin (PC)**. As electrons pass through cytochrome b6f, additional protons are moved from the stroma into the thylakoid lumen. + +This electron flow builds a proton concentration difference across the thylakoid membrane. The lumen becomes more acidic than the stroma, storing energy as an electrochemical gradient. + +## ATP and NADPH formation + +The proton gradient is used by **chloroplast ATP synthase** to convert ADP and inorganic phosphate into **ATP** on the stromal side of the membrane. At the same time, electrons reaching **Photosystem I** are re-energized by light absorbed at P700. + +The excited electrons from PSI pass to **ferredoxin**, then to **ferredoxin–NADP⁺ reductase (FNR)**. FNR reduces **NADP⁺** to **NADPH** using electrons and protons from the stroma. + +Together, PSII, PSI, water splitting, electron transport, and ATP synthase form the light-dependent reactions: a light-powered system that produces **ATP, NADPH, and O₂** in the thylakoids. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-photorespiration.md b/datasets/domain-bilingual-v2/corpus/cell-energy-photorespiration.md new file mode 100644 index 0000000..68f0251 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-photorespiration.md @@ -0,0 +1,25 @@ +--- +title: Photorespiration +language: en +source: openai-codex-synthetic +--- + +# Photorespiration + +## RuBisCO as Oxygenase + +Photorespiration is the wasteful side reaction that occurs when RuBisCO, the enzyme ribulose-1,5-bisphosphate carboxylase/oxygenase, reacts with oxygen gas instead of carbon dioxide. In the productive Calvin cycle, RuBisCO attaches CO2 to ribulose-1,5-bisphosphate (RuBP), creating molecules that enter carbon fixation and can ultimately support sugar production. In photorespiration, the same active site accepts O2, so RuBP is oxygenated rather than carboxylated. + +This oxygenase reaction produces one molecule of 3-phosphoglycerate, which can still rejoin Calvin-cycle metabolism, and one molecule of 2-phosphoglycolate, which cannot be used directly for sugar synthesis. The 2-phosphoglycolate must be recycled through a salvage pathway involving the chloroplast, peroxisome, and mitochondrion. During this recycling, plants lose previously fixed carbon as CO2 and consume cellular energy, including ATP and reducing power, without making additional carbohydrate. + +## Contrast with the Calvin Cycle + +The Calvin cycle is a carbon-gaining pathway: RuBisCO uses CO2, RuBP is regenerated, and ATP and NADPH help convert fixed carbon into energy-rich carbohydrate intermediates. Photorespiration is carbon-losing: RuBisCO uses O2, part of the substrate is diverted into phosphoglycolate repair, and CO2 is released back to the cell and atmosphere. + +Thus the crucial distinction is RuBisCO’s substrate choice. Carboxylase activity supports carbon fixation. Oxygenase activity competes with that fixation and reduces photosynthetic efficiency. This is why photorespiration is often described as “wasteful,” even though it is a real metabolic pathway that plants must run to recover carbon from toxic or unusable phosphoglycolate. + +## Conditions and Biological Importance + +Photorespiration is especially common in C3 plants such as wheat, rice, and soybean when leaves are hot, dry, or CO2-limited. Under these conditions, stomata may close to conserve water, internal CO2 drops, and O2 becomes more likely to occupy RuBisCO’s active site. The result is less net carbon gain for the same investment of ATP and NADPH. + +Some plants reduce this problem. C4 plants such as maize and sugarcane concentrate CO2 near RuBisCO, while CAM plants such as pineapple separate CO2 uptake from daytime carbon fixation. These adaptations do not change what photorespiration is: the oxygenase activity of RuBisCO that opposes the Calvin cycle’s productive carboxylation. diff --git a/datasets/domain-bilingual-v2/corpus/cell-energy-respiration.md b/datasets/domain-bilingual-v2/corpus/cell-energy-respiration.md new file mode 100644 index 0000000..f850eeb --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/cell-energy-respiration.md @@ -0,0 +1,31 @@ +--- +title: Cellular respiration overview +language: en +source: openai-codex-synthetic +--- + +# Cellular respiration overview + +Cellular respiration is the catabolic pathway that oxidises glucose completely to carbon dioxide and water while conserving part of the released energy as ATP. In aerobic eukaryotic cells, its main stages are glycolysis in the cytosol, pyruvate oxidation and the Krebs cycle in the mitochondrial matrix, and the electron transport chain on the inner mitochondrial membrane. + +The overall reaction is commonly summarized as: + +**C₆H₁₂O₆ + 6 O₂ → 6 CO₂ + 6 H₂O + ATP + heat** + +## Carbon flow: glucose to carbon dioxide + +Respiration begins when one molecule of glucose, a six-carbon sugar, is converted by glycolysis into two molecules of pyruvate. This stage produces a small amount of ATP directly and transfers high-energy electrons to **NAD⁺**, forming **NADH**. + +Each pyruvate then enters the mitochondrion and is converted by the **pyruvate dehydrogenase complex** into **acetyl-CoA**. This link reaction releases one CO₂ per pyruvate, so two CO₂ are produced per glucose before the Krebs cycle begins. + +In the **Krebs cycle**, also called the citric acid cycle or TCA cycle, each acetyl-CoA combines with **oxaloacetate** to form citrate. Through a series of enzyme-catalysed steps in the mitochondrial matrix, the two-carbon acetyl group is fully oxidised. Per glucose, the cycle releases four additional CO₂, produces two ATP or GTP by substrate-level phosphorylation, and loads electrons onto **NADH** and **FADH₂**. + +## Electron flow: carriers to oxygen + +By the end of glycolysis, pyruvate oxidation, and the Krebs cycle, the carbon atoms of glucose have been released as six CO₂ molecules. Most usable energy, however, is stored temporarily in reduced electron carriers: approximately ten NADH and two FADH₂ per glucose. + +These carriers deliver electrons to the **electron transport chain**. As electrons pass through membrane protein complexes, their energy is used to help establish a proton gradient across the inner mitochondrial membrane. **Oxygen** is the final electron acceptor; it combines with electrons and protons to form water. + +## ATP yield and purpose + +The proton-motive force generated by electron transport drives oxidative phosphorylation, producing most of the ATP made during aerobic respiration. In many eukaryotic cells, complete oxidation of one glucose yields about **30–32 ATP**, depending on shuttle systems and membrane efficiency. The essential outcome is the controlled oxidation of glucose to CO₂ and H₂O, coupled to ATP production. diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-hongloumeng.md b/datasets/domain-bilingual-v2/corpus/china-lit-hongloumeng.md new file mode 100644 index 0000000..2eb75dd --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-hongloumeng.md @@ -0,0 +1,19 @@ +--- +title: 红楼梦 +language: zh +source: openai-codex-synthetic +--- + +# 红楼梦 + +## 作品定位与成书 + +《红楼梦》是清代曹雪芹创作的章回体长篇小说,又名《石头记》。通行本一百二十回,前八十回通常认为出自曹雪芹之手,后四十回与程伟元、高鹗整理本关系密切。小说以贾、史、王、薛四大家族为背景,集中书写宁国府、荣国府由富贵鼎盛走向衰败的过程。 + +## 贾府兴衰与主要人物 + +贾府的繁华以贾母、贾政、王夫人、王熙凤等人物维系,又因奢靡、权势依附、内部腐败而逐渐崩塌。元春省亲和大观园的营建显示家族声势,抄检大观园、获罪抄家则揭示盛极而衰。王熙凤精明强干却贪权弄术,贾探春敏锐能干而难改大势,贾琏、贾珍等人物表现出贵族子弟的败坏。 + +## 宝黛爱情与核心主题 + +贾宝玉厌恶功名利禄,珍重少女才情;林黛玉敏感聪慧,以诗才和真情回应宝玉。二人的“木石前盟”与薛宝钗所代表的“金玉良缘”形成冲突,最终以黛玉病逝、宝玉出家象征理想爱情的破灭。小说通过金陵十二钗的命运,表现青春、爱情、女性悲剧与封建礼教的压迫,是中国古典小说中人物群像、心理描写和家族兴衰叙事的高峰。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-sanguo.md b/datasets/domain-bilingual-v2/corpus/china-lit-sanguo.md new file mode 100644 index 0000000..9ab9194 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-sanguo.md @@ -0,0 +1,19 @@ +--- +title: 三国演义 +language: zh +source: openai-codex-synthetic +--- + +# 三国演义 + +## 群雄割据与结构 + +罗贯中的《三国演义》是明代章回体历史小说,以“天下大势,分久必合,合久必分”为总纲,铺写东汉末年黄巾起义后中央失序、州郡豪强并起的局面。董卓挟天子入洛阳,袁绍、袁术、吕布、刘表、刘璋、张鲁等各据一方;曹操“挟天子以令诸侯”,刘备辗转徐州、荆州、益州,孙权据江东,最终形成魏、蜀、吴三分天下。小说以政权兴衰、谋略较量和忠义名分组织情节。 + +## 主要人物形象 + +作品塑造了曹操、刘备、孙权三方核心人物。曹操兼具雄才与猜忌,是北方霸主;刘备以仁德、汉室宗亲自居,关羽重义、张飞勇猛;诸葛亮出山后辅佐蜀汉,以隆中对、草船借箭、空城计等表现智谋。江东方面有孙权、周瑜、鲁肃、陆逊,体现守业、联盟与水战才略。司马懿则在后期与诸葛亮相持,为魏晋更替埋下伏笔。 + +## 重大战役与主题 + +官渡之战中曹操以少胜多击败袁绍,奠定统一北方基础;赤壁之战中孙刘联盟借长江天险与火攻破曹操,确立三国鼎立格局;夷陵之战中陆逊火烧连营,使刘备伐吴失败,蜀汉由盛转弱。汉中争夺、关羽失荆州、诸葛亮北伐与街亭失守,继续展示战略、地理和人心的作用。全书主题集中于乱世英雄、忠义伦理、权谋成败与历史循环。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-shuihu.md b/datasets/domain-bilingual-v2/corpus/china-lit-shuihu.md new file mode 100644 index 0000000..7756b78 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-shuihu.md @@ -0,0 +1,16 @@ +--- +title: 水浒传 +language: zh +source: openai-codex-synthetic +--- + +# 水浒传 + +## 作品概况与梁山聚义 +《水浒传》为施耐庵创作的中国古典章回体白话小说。作品以北宋末年社会秩序失衡为背景,围绕梁山泊聚义展开:晁盖等人智取生辰纲,宋江上梁山后确立“替天行道”的旗号,最终形成一百单八将的英雄集团。天罡、地煞的排名,把分散的江湖故事组织成完整的群像叙事。 + +## 梁山好汉群像 +梁山好汉出身各异,有禁军教头林冲、花和尚鲁智深、行者武松、黑旋风李逵、浪子燕青、智多星吴用等。作品重在描写人物性格与行动:林冲由忍到反,武松刚烈复仇,鲁智深扶危济困,宋江以义气和名望凝聚众人。这些人物不是单纯的传奇武勇,而是在压迫、冤狱、逼迫中被推向反抗。 + +## 官逼民反主题 +“官逼民反”是《水浒传》的核心主题。高俅、蔡京等权势人物以及地方贪官、恶霸共同构成黑暗政治环境,逼使良民、军官、商贩、僧人和江湖人走投无路。小说既歌颂梁山好汉的侠义、忠诚与反抗精神,也写出招安后的矛盾结局,显示民间力量与朝廷秩序之间难以调和的冲突。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-songci.md b/datasets/domain-bilingual-v2/corpus/china-lit-songci.md new file mode 100644 index 0000000..a78a75d --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-songci.md @@ -0,0 +1,19 @@ +--- +title: 宋词 +language: zh +source: openai-codex-synthetic +--- + +# 宋词 + +## 体裁与词牌格律 + +宋词是中国古典文学中与诗、曲并列的重要韵文体裁,兴盛于两宋。它依“词牌”填作,词牌如《浣溪沙》《蝶恋花》《雨霖铃》《念奴娇》《水调歌头》等,规定字数、句式、平仄、押韵和分片方式。词又称“长短句”,按篇幅常分小令、中调、长调;上下片的章法、换韵与声律,是理解其艺术特点的关键。 + +## 婉约与豪放 + +宋词最常见的风格概括是婉约与豪放。婉约词重含蓄细腻,常写离愁、相思、春恨、身世之感,语言清丽,情绪回环。柳永《雨霖铃》、晏殊《浣溪沙》、李清照《声声慢》是典型代表。豪放词则气象开阔,常写怀古、报国、人生感慨和壮志难酬,意象宏大,节奏奔放。苏轼《念奴娇·赤壁怀古》、辛弃疾《永遇乐·京口北固亭怀古》最能体现这一面貌。 + +## 代表词人 + +北宋词人有晏殊、欧阳修、柳永、苏轼、秦观、周邦彦;南宋词人有李清照、辛弃疾、姜夔、吴文英等。柳永发展慢词,苏轼扩大题材,李清照以个人情感入词,辛弃疾融家国之思与英雄气概,周邦彦、姜夔则重音律精严。宋词的价值正在于词牌格律与个人情志相结合,形成多样而鲜明的文学风貌。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-tangshi.md b/datasets/domain-bilingual-v2/corpus/china-lit-tangshi.md new file mode 100644 index 0000000..a74125c --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-tangshi.md @@ -0,0 +1,15 @@ +--- +title: 唐诗 +language: zh +source: openai-codex-synthetic +--- + +# 唐诗 + +## 体裁与格律 + +唐诗是中国古典诗歌在唐代达到高峰的总称,既承接汉魏六朝古体,又形成格律严整的近体诗。近体诗以律诗、绝句最具代表性:律诗通常八句,分首联、颔联、颈联、尾联,五言、七言为主,讲究平仄、押韵、粘对,颔联与颈联多须对仗;绝句四句,篇幅短而意境集中,也有五绝、七绝之分。它们以有限字数表现山水、边塞、送别、怀古、社会忧患等主题,成为后世评诗、选本和教学的核心体裁。 + +## 代表诗人与风格 + +李白被称为“诗仙”,代表作有《将进酒》《蜀道难》《静夜思》,风格豪放飘逸,善用夸张、想象和神话色彩,突出个体精神与盛唐气象。杜甫被称为“诗圣”,《春望》《登高》《三吏》《三别》体现沉郁顿挫的艺术风格,以严整律诗和深厚同情记录时代苦难。王维长于山水田园,诗中有画;孟浩然清淡自然;高适、岑参写边塞雄浑;白居易提倡通俗明白,以讽喻诗关注现实;李商隐、杜牧则代表晚唐精巧含蓄、咏史伤时的面貌。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-xiyouji.md b/datasets/domain-bilingual-v2/corpus/china-lit-xiyouji.md new file mode 100644 index 0000000..4f78f90 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-xiyouji.md @@ -0,0 +1,19 @@ +--- +title: 西游记 +language: zh +source: openai-codex-synthetic +--- + +# 西游记 + +## 取经故事 + +《西游记》是明代吴承恩整理创作的长篇章回体神魔小说,以唐僧师徒西天取经为主线。故事从花果山石猴出世、拜师学艺、大闹天宫写起,孙悟空被如来佛祖压在五行山下;后经观音菩萨安排,随唐僧玄奘赴灵山求取真经。途中收服猪八戒、沙悟净,并有白龙马相随,历经九九八十一难,如三打白骨精、车迟国斗法、真假美猴王、火焰山借芭蕉扇等,最终取得经卷,修成正果。 + +## 师徒四人形象 + +唐僧虔诚慈悲、意志坚定,但有时辨识不明、过于执拗,是取经事业的精神核心。孙悟空机智勇敢、神通广大,能七十二变、腾云驾雾,既桀骜不驯又忠心护师,是最具光彩的英雄形象。猪八戒贪吃好色、爱占便宜,常打退堂鼓,却也憨厚可亲、能出力助战。沙悟净沉稳寡言、任劳任怨,维系队伍秩序,体现坚忍与忠诚。 + +## 主题与艺术特点 + +作品借神魔世界和降妖伏魔的情节,表现磨难、修行、信念与团队协作。它融合民间传说、宗教想象和讽刺笔法,情节奇幻而结构连贯,人物性格鲜明,使取经之路成为中国古典小说中最具代表性的成长与考验叙事之一。 diff --git a/datasets/domain-bilingual-v2/corpus/china-lit-yuanqu.md b/datasets/domain-bilingual-v2/corpus/china-lit-yuanqu.md new file mode 100644 index 0000000..dc4af3f --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-lit-yuanqu.md @@ -0,0 +1,19 @@ +--- +title: 元曲 +language: zh +source: openai-codex-synthetic +--- + +# 元曲 + +## 元杂剧与散曲 + +元曲是中国古典文学中与诗词、小说并列的重要体裁,主要包括元杂剧和散曲。元杂剧以舞台演出为核心,通常“一本四折”,有时加“楔子”,结合唱、念、做、打,按宫调和曲牌组织音乐,角色有末、旦、净、丑等。散曲则偏重抒情吟咏,分小令和套数,语言较诗词更口语化,如马致远《天净沙·秋思》、张养浩《山坡羊·潼关怀古》皆为名篇。 + +## 代表作家与作品 + +关汉卿是元杂剧最重要的作家之一,代表作有《窦娥冤》《救风尘》《望江亭》,善写下层人物的冤屈、机智与抗争。王实甫《西厢记》以才子佳人故事展现爱情理想;白朴有《梧桐雨》《墙头马上》;马致远除散曲外,杂剧《汉宫秋》亦著名;郑光祖《倩女离魂》以离奇情节表现情感执着。关汉卿、白朴、马致远、郑光祖常被称为“元曲四大家”。 + +## 艺术特点 + +元曲的突出特点是“本色当行”:语言明白晓畅,善用俗语、衬字和市井声口;结构紧凑,冲突集中,适合舞台呈现;情感表达既有慷慨悲歌,也有婉转哀怨。它把现实批判、人物塑造和音乐曲调结合起来,使中国古典文学在诗词之外形成了更具戏剧性和民间气息的高峰。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-jiaozi.md b/datasets/domain-bilingual-v2/corpus/china-money-jiaozi.md new file mode 100644 index 0000000..158e9fc --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-jiaozi.md @@ -0,0 +1,19 @@ +--- +title: 交子与纸币的起源 +language: zh +source: openai-codex-synthetic +--- + +# 交子与纸币的起源 + +## 产生背景:四川铁钱与商人信用 + +北宋前期,四川(成都府、益州路)长期行用铁钱,铜钱输入受限,铁钱价值低而重量大,商旅纳税、买卖和远途运输都不便。成都富商因此开设“交子铺”:客户交纳现钱,取得一纸凭证,可在同业铺户兑回或转让。这种凭证最初只是民间信用票据。宋真宗景德年间,益州知州张咏整顿十六户富商联合发行,以印押、暗记和担保维持信用;但私营者挪用本钱、兑付不稳,促使官府接管。 + +## 发行机制:益州交子务 + +宋仁宗天圣元年(1023),朝廷在成都设置益州交子务,收归官办并发行官交子。它被普遍视为世界最早由政府主持的纸币。官交子按“界”分期发行,设最高额度,常见记载为一界一百二十五万余贯,并以三十六万贯铁钱作本钱准备。票面由官府印造,载明贯数、界分、兑付期限,加盖官印,严禁私造。 + +## 流通与财政意义 + +官交子可在四川市场支付、缴纳和兑换,期满以旧换新,收取少量工墨费。其核心机制是把沉重的金属钱转化为可携带、可转让、可兑付的国家信用通货,并通过额度、本钱和期限约束发行。交子的出现,标志着北宋财政与商业已能用制度化信用补充金属货币。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-piaohao.md b/datasets/domain-bilingual-v2/corpus/china-money-piaohao.md new file mode 100644 index 0000000..6e2ba93 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-piaohao.md @@ -0,0 +1,19 @@ +--- +title: 票号与钱庄 +language: zh +source: openai-codex-synthetic +--- + +# 票号与钱庄 + +## 晋商票号的汇兑网络 + +票号是清代以汇兑为核心的传统金融机构,代表为山西平遥的日升昌,常与雷履泰、李大全等晋商姓名相连。其基本做法是:客户在甲地交付白银或钱款,由票号开给汇票,凭密押、平码和印鉴在乙地分号兑付。总号设在平遥、祁县、太谷等地,分号遍布北京、天津、汉口、上海、广州以及西北边地,形成跨省清算网络。票号收取汇费,兼办官款、商款、军饷和税银的运送替代业务,减少实银长途押运的风险与成本。 + +## 钱庄的存放款与地方信用 + +钱庄多见于江南、市镇和通商口岸,上海、宁波、绍兴钱庄尤盛。它们经营存款、放款、贴现、拆借和银钱兑换,服务对象包括商号、牙行、作坊和富户。客户存入银两或制钱,钱庄登记账户并支付或免收利息;商人周转不足时,可凭信用、抵押或保人取得短期放款。钱庄还签发庄票、会票,办理同业清算,按当日银钱比价折算,体现白银与铜钱并用环境下的市场定价。 + +## 运作机制与历史地位 + +票号重在远距离汇兑,钱庄重在地方存放款和商业周转;二者都依赖字号声誉、熟人担保、账簿核算和同业规约。官府财政、盐商、粮商、布商等常借其调拨资金,使传统金融嵌入赋税、贸易和货币流通之中。晚清以后,近代银行兴起,票号因组织封闭、资本有限而衰落,钱庄则部分转入银行体系或继续承担地方信用中介功能。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-qingmiao.md b/datasets/domain-bilingual-v2/corpus/china-money-qingmiao.md new file mode 100644 index 0000000..d33ccc5 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-qingmiao.md @@ -0,0 +1,19 @@ +--- +title: 王安石青苗法 +language: zh +source: openai-codex-synthetic +--- + +# 王安石青苗法 + +## 低息官贷设计 + +青苗法是北宋熙宁变法中的财政与信用制度,主持者为王安石,得到宋神宗支持,熙宁二年(1069)由制置三司条例司推行。其核心是以常平、广惠等仓本钱为资本,在农民春夏青黄不接、作物未熟之时发放“青苗钱”或折谷官贷,秋收后归还本息。制度名义上采取自愿请贷、按户等第给付,每期取息二分左右,低于民间豪强、高利贷者在荒歉时索取的重息,因此被称为“低息官贷”。 + +## 抑兼并与增财政 + +青苗法的目标不只是救济贫农,更是“抑兼并、通货财”。王安石认为,民间富户乘贫民急需种子、口粮时放债,兼并土地、控制乡里;国家若以较低利率提供信用,可以减少农户被迫典卖田产,削弱兼并之家。同时,官府收取利息,使常平仓资本流动起来,增加经费来源,达到“不加赋而国用足”的财政效果。它体现了北宋以国家力量介入货币、借贷和赋役秩序的改革思路。 + +## 争议与后果 + +反对者如司马光、苏轼等指责青苗法“与民争利”。争议集中在三点:其一,地方官为完成放贷额度,常将自愿变为摊派;其二,胥吏评等、催收、折算时容易侵扰农户;其三,二分息虽低于私贷,却仍可能在歉收年份转为沉重负担。支持者强调它能平抑高利贷、稳定小农;反对者则认为执行失真会伤民。青苗法因而成为王安石变法中最能体现财政增收与社会救济张力的制度。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-silver.md b/datasets/domain-bilingual-v2/corpus/china-money-silver.md new file mode 100644 index 0000000..f158daf --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-silver.md @@ -0,0 +1,19 @@ +--- +title: 明清白银货币化 +language: zh +source: openai-codex-synthetic +--- + +# 明清白银货币化 + +## 明代白银成为主要通货 + +明初曾重视铜钱与官府信用货币,但民间大额交易、田赋折纳和商税结算逐渐转向白银。到嘉靖、万历时期,白银以“两、钱、分”计重,按成色折算,成为市场买卖、官府会计和财政征收的核心尺度。铜钱仍用于小额零售,白银则承担大额支付、储藏财富和跨区域清算功能。 + +## 海外白银流入 + +16世纪以后,中国对白银需求扩大,与海外贸易相互推动。日本石见银山等矿产白银经海商输入东南沿海;西属美洲波托西、萨卡特卡斯的白银则通过马尼拉大帆船贸易,经吕宋、澳门、月港等渠道进入中国。隆庆开关后,福建、广东海贸活跃,丝绸、瓷器等商品大量换取外银,形成持续的白银输入。 + +## 银本位的形成 + +明代财政改革中的赋役折银,使国家收入越来越依赖白银;市场价格、地租、商税和军饷也普遍以银计价。白银并非统一铸币,而是以银锭、碎银按重量流通,因此银本位表现为“称量货币”体系。清代继承此格局,库平银、漕项、地丁等财政核算均以银两为准。由此,白银成为明清财政、市场和长途贸易共同认可的主通货。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-song-inflation.md b/datasets/domain-bilingual-v2/corpus/china-money-song-inflation.md new file mode 100644 index 0000000..cbe00c7 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-song-inflation.md @@ -0,0 +1,19 @@ +--- +title: 北宋的通货膨胀 +language: zh +source: openai-codex-synthetic +--- + +# 北宋的通货膨胀 + +## 官办交子与发行纪律 + +四川行用铁钱,面值小而重量大,商旅纳税都不便。天圣元年(1023),北宋在益州设交子务,将交子纳入官办,设“三年一界”、限额印造,并以铁钱留作本钱。这个安排本是为稳定市易、便利解纳;一旦界数、准备和兑现被破坏,纸券便会脱离钱本。 + +## 交子钱引超发与物价 + +神宗以后,西北边费、河防、军粮和赈济屡需现钱,朝廷与转运司常把交子、钱引当作预支财政的手段。旧界尚未收换,新界已经加印,准备金又被移作他用。至徽宗崇宁、大观间,四川钱引名额远过可兑铁钱,民间按折价收受,米麦、布帛、房租出现“现钱价”与“引价”的差别。通胀的核心原因,是官府以纸面发行弥补支出,而非商品突然增加。 + +## 铜钱铁钱并行的困局 + +北宋并无统一金属通货:京畿、东南多用铜钱,川峡多用铁钱,陕西等地又有混用。官定铜铁比价常落后于市价,铜钱外流或被囤积,铁钱因笨重而折价,跨路交易必须反复折算。交子、钱引原可缓解铁钱携带之弊,超发后却放大铜钱、铁钱与纸券之间的价差,造成拒收、套换和财政信用下跌。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-yanyin.md b/datasets/domain-bilingual-v2/corpus/china-money-yanyin.md new file mode 100644 index 0000000..cd25070 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-yanyin.md @@ -0,0 +1,19 @@ +--- +title: 盐引制度 +language: zh +source: openai-codex-synthetic +--- + +# 盐引制度 + +## 制度原理 + +盐引,又称盐钞,是食盐专卖下由官府签发的取盐凭证。国家控制盐场、灶户和销售区域,商人不能自由购盐,须先向户部、盐运司或指定官库缴纳钱物,取得注明盐额、场灶、行销地和期限的盐引,再到两淮、两浙、河东等盐区支盐贩卖。唐代刘晏整顿榷盐后,宋、元、明、清各代均以类似办法把食盐收益纳入财政。 + +## 有价凭证的流通 + +盐引的价值来自国家专卖权:持引者可按额取得官盐并在规定口岸、州县销售。因此它不仅是行政许可,也是可计价、可转让、可抵押的有价凭证。商人之间常按盐价、路费、滞引风险折价买卖盐引;盐业资本集中地区如扬州、徽州,围绕引额形成囤积、拆分和转售。盐引本身不等同通货,但在特定行业内具有准金融工具性质。 + +## 财政工具与开中法 + +盐引使官府能够先收款、后给盐,把未来盐利提前变成财政收入。明代“开中法”尤其典型:商人向边镇输纳粮草,换取盐引,再赴盐场支盐销售,朝廷借此解决军饷、转运和边防供给问题。其风险在于引额滥发、支盐迟滞和私盐冲击。一旦官府信用下降,盐引价格便会折损,财政收入与市场秩序同时受损。 diff --git a/datasets/domain-bilingual-v2/corpus/china-money-yitiaobian.md b/datasets/domain-bilingual-v2/corpus/china-money-yitiaobian.md new file mode 100644 index 0000000..822673d --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/china-money-yitiaobian.md @@ -0,0 +1,19 @@ +--- +title: 一条鞭法 +language: zh +source: openai-codex-synthetic +--- + +# 一条鞭法 + +## 背景与推行 + +一条鞭法是明代中后期财政改革的核心,而非单一货币发行制度。嘉靖、隆庆间,南直隶、浙江等地已有把赋税和差役合并征收的做法;万历初年,内阁首辅张居正借整饬户部、清丈田亩之机,将其推广为全国性制度。万历九年(1581)清丈土地,校正鱼鳞图册和黄册,目的在于核实田亩、丁口与应纳钱粮,减少豪强隐占、地方加派和胥吏中饱。 + +## 制度内容 + +所谓“一条鞭”,关键在“赋役合并、折银征收”。原来分散的夏税、秋粮、均徭、里甲、杂泛差役等名目,尽量合为一项总额;应服的力役折成雇役银,连同田赋按州县核定,主要依据田亩兼及丁口编派。纳税者不再逐项交纳实物或亲身应役,而多以白银缴纳,地方官汇总解送户部或用于本地额定支出。 + +## 影响与限度 + +一条鞭法简化了征收科目,使国家财政更依赖银两通货,也提高了中央掌握钱粮的能力。它在形式上减轻了百姓往返奔役和多头摊派的负担,但执行效果取决于清丈是否真实、银价是否稳定、官吏是否守额。晚明财政压力加重后,旧的杂派和临时加征仍可能复出,因此它是赋役制度和银纳税制的重大调整,并非彻底免役或永久减税。 diff --git a/datasets/domain-bilingual-v2/corpus/crypto-block-stream.md b/datasets/domain-bilingual-v2/corpus/crypto-block-stream.md new file mode 100644 index 0000000..e30b2a2 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-block-stream.md @@ -0,0 +1,34 @@ +--- +title: Block vs stream ciphers +language: en +source: openai-codex-synthetic +--- + +# Block vs stream ciphers + +Block ciphers and stream ciphers are symmetric confidentiality primitives, but they package encryption differently. The practical choice affects padding, nonce handling, performance, error behavior, and how safely software can encrypt records, files, or network packets. + +## Block ciphers and modes + +A block cipher is a keyed permutation on fixed-size blocks. AES, for example, operates on 128-bit blocks; it does not by itself define how to encrypt a message of arbitrary length. A mode of operation supplies that missing structure. + +Common modes make different tradeoffs: + +- **ECB** encrypts each block independently and is unsafe for most data because equal plaintext blocks produce equal ciphertext blocks. +- **CBC** chains blocks using an initialization vector and requires padding for non-multiple block lengths; decryption errors can affect neighboring blocks. +- **CTR** encrypts counters to create a keystream, making a block cipher behave much like a stream cipher; it needs a unique nonce/counter combination. +- **GCM** combines counter-style encryption with an authentication tag and is widely used when both confidentiality and tamper detection are required. + +Block-cipher modes are often a good fit for standardized hardware support, bulk file encryption, and systems that already rely on mature AES implementations. Their main complexity is that the mode rules matter as much as the underlying cipher. + +## Stream ciphers + +A stream cipher generates a pseudorandom keystream from a secret key plus a nonce or initialization value, then XORs that keystream with plaintext. Encryption and decryption are the same XOR operation. ChaCha20 and Salsa20 are modern examples; RC4 is a historical example that should not be used in new designs. + +Stream ciphers naturally handle data of any length, including single bytes, without padding. They are useful for low-latency communication, software-only environments, and packet formats where buffering full blocks would be inconvenient. Many designs use counters internally, allowing efficient seeking or parallel generation. + +## Tradeoffs and safety rules + +The most important shared rule is uniqueness: never reuse the same keystream. Reusing a stream-cipher key/nonce pair, or reusing a CTR-mode counter sequence, exposes the XOR of the plaintexts and can reveal both messages. + +Block ciphers provide a compact, well-analyzed core but require a correct mode. Stream ciphers provide a direct keystream interface but place strict responsibility on nonce management. In practice, use vetted constructions such as AES-GCM or ChaCha20-Poly1305 rather than assembling raw primitives manually. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-diffie-hellman.md b/datasets/domain-bilingual-v2/corpus/crypto-diffie-hellman.md new file mode 100644 index 0000000..761eee5 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-diffie-hellman.md @@ -0,0 +1,34 @@ +--- +title: Diffie–Hellman key exchange +language: en +source: openai-codex-synthetic +--- + +# Diffie–Hellman key exchange + +Diffie–Hellman key exchange is the standard primitive for **agreeing a shared secret over a public channel with no prior shared key**. Published in 1976 by Whitfield Diffie and Martin Hellman, with related ideas from Ralph Merkle, it lets two parties derive the same secret value even though an eavesdropper can observe every message they send. + +Diffie–Hellman is not message encryption and not an identity proof by itself. Its output is shared secret material that another protocol can turn into usable session keys. + +## Core finite-field exchange + +In classic finite-field Diffie–Hellman, both parties agree on public parameters: a large prime modulus `p` and a generator `g` of a suitable cyclic group. These values do not need to be secret. + +1. Alice chooses a private random exponent `a` and sends `A = g^a mod p`. +2. Bob chooses a private random exponent `b` and sends `B = g^b mod p`. +3. Alice computes `s = B^a mod p`. +4. Bob computes `s = A^b mod p`. + +Both results equal `g^(ab) mod p`, so Alice and Bob arrive at the same shared secret. An observer sees `p`, `g`, `A`, and `B`, but should not be able to recover `a`, `b`, or `g^(ab)` if the discrete logarithm problem is hard in the chosen group. + +## Variants and named groups + +Modern deployments often use elliptic-curve Diffie–Hellman, usually called ECDH. Instead of modular exponentiation in a finite field, ECDH uses scalar multiplication on an elliptic-curve group. Common named choices include Curve25519, used through X25519, and NIST P-256. Finite-field deployments may use standardized safe-prime groups such as those described in RFC 7919. + +Ephemeral Diffie–Hellman uses fresh private values for each session. This prevents reuse of the same exchange secret and is the usual choice when building secure channels. + +## Security properties and limits + +Diffie–Hellman protects against a passive network observer, but unauthenticated Diffie–Hellman is vulnerable to an active man-in-the-middle who substitutes public values. Therefore real protocols pair it with a separate authentication mechanism and usually feed the raw shared value into a key-derivation step before use. + +The distinctive purpose of Diffie–Hellman is key agreement: two parties with no prior shared key can create one across an untrusted public network. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-hash.md b/datasets/domain-bilingual-v2/corpus/crypto-hash.md new file mode 100644 index 0000000..8037639 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-hash.md @@ -0,0 +1,27 @@ +--- +title: Cryptographic hash functions +language: en +source: openai-codex-synthetic +--- + +# Cryptographic hash functions + +## SHA-2 as a one-way digest primitive + +A cryptographic hash function maps an input message of any practical length to a fixed-size value called a **hash**, **digest**, or **checksum**. The SHA-2 family, standardized by NIST in **FIPS 180-4**, includes **SHA-224**, **SHA-256**, **SHA-384**, **SHA-512**, **SHA-512/224**, and **SHA-512/256**. For example, SHA-256 produces a 256-bit digest, commonly written as 64 hexadecimal characters. + +A hash function is not a cipher. It has no encryption key, no decryption operation, and no way to recover the original message from the digest. Its security goal is **one-wayness**: given a digest, it should be computationally infeasible to find any message that produced it. This property is also called **preimage resistance**. + +Small input changes should produce unrelated-looking digest changes. The text `hello` and the text `Hello` produce completely different SHA-256 outputs, which makes hashes useful as compact fingerprints for data. + +## Collision resistance and integrity + +A second core property is **collision resistance**: it should be infeasible to find two different messages with the same digest. Since every fixed-size digest has a finite number of possible values, collisions must exist mathematically, but a secure hash makes finding one beyond practical reach. SHA-256 is designed for about 128 bits of collision security because of the birthday bound. + +Hashes also aim for **second-preimage resistance**: given one specific message, it should be infeasible to construct a different message with the same digest. This matters for file replacement attacks, software archives, logs, and document records. + +For integrity checking, a trusted publisher can provide a SHA-256 digest for a release file. A user downloads the file, computes the digest locally, and compares it with the trusted value. If the values differ, the data was corrupted or altered. If they match, the file is very likely identical to the one represented by the trusted digest. + +## What hashes do not provide + +Hash functions provide integrity evidence, not confidentiality. Publishing a SHA-256 digest does not hide the message, and hashing predictable data does not make it secret. A hash is also not proof of who created a message unless it is combined with a separate authentication mechanism. In applied cryptography, SHA-2 is therefore best understood as a one-way, collision-resistant fingerprint primitive for detecting change, not as encryption. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-public-key.md b/datasets/domain-bilingual-v2/corpus/crypto-public-key.md new file mode 100644 index 0000000..955acef --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-public-key.md @@ -0,0 +1,27 @@ +--- +title: Public-key cryptography +language: en +source: openai-codex-synthetic +--- + +# Public-key cryptography + +Public-key cryptography uses an **asymmetric key pair**: a public key that may be distributed widely and a private key that must remain secret. Its central confidentiality pattern is simple: **encrypt with the public key, decrypt with the private key**. Anyone who knows the recipient’s public key can create ciphertext, but only the holder of the matching private key can recover the plaintext. + +## Asymmetric key pairs and confidentiality + +Unlike a shared-secret scheme, the two keys in an asymmetric pair are mathematically related but not interchangeable. The public key is placed in directories, certificates, configuration files, or messages; the private key is protected in a software keystore, hardware security module, smart card, or other restricted environment. + +This model is useful when two parties do not already share a secret key. A sender can obtain Alice’s public key and encrypt a message for Alice without learning her private key. If Alice’s private key is lost, ciphertext encrypted to the corresponding public key is effectively unrecoverable. If the private key is copied or exposed, confidentiality for messages encrypted to that public key is no longer trustworthy. + +## RSA encryption + +**RSA**, named for Ron Rivest, Adi Shamir, and Leonard Adleman, is the classic public-key encryption algorithm. An RSA public key contains a modulus `n` and public exponent `e`; the private key contains secret values including a private exponent `d`. The modulus is derived from two large random primes. In simplified form, RSA encryption computes ciphertext from a message using the public key, and decryption reverses the operation using the private key. + +Real RSA encryption must use a padding and encoding scheme. The modern standard is **RSAES-OAEP**, specified in **PKCS #1** and **RFC 8017**. “Textbook RSA” without padding is unsafe because it is deterministic and structurally malleable. Common contemporary RSA modulus sizes are **2048 bits** at a minimum, with **3072 bits** or larger chosen for longer security margins according to guidance such as NIST publications. + +## Practical limits and responsibilities + +RSA is not normally used to encrypt large files directly. It can encrypt only data smaller than the modulus after padding overhead, so systems often encrypt a compact secret value for the recipient and protect bulk data separately. + +Public-key encryption provides confidentiality for the encrypted content, but it does not automatically prove who created the ciphertext or that external metadata was not changed. Correct public-key use therefore requires authentic public keys, secure random number generation, safe private-key storage, and well-reviewed implementations such as OpenSSL, BoringSSL, or platform cryptographic libraries. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-signatures.md b/datasets/domain-bilingual-v2/corpus/crypto-signatures.md new file mode 100644 index 0000000..e631618 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-signatures.md @@ -0,0 +1,28 @@ +--- +title: Digital signatures +language: en +source: openai-codex-synthetic +--- + +# Digital signatures + +## Purpose and security property + +A digital signature is a public-key primitive used to prove that a particular private key approved an exact message. In common signing formats—such as CMS/PKCS #7 `SignedData` used by S/MIME, OpenPGP signatures, and PDF Advanced Electronic Signatures—a signer first obtains a message digest of the bytes to be authorized and then applies a signature algorithm with the private key. The result is a compact signature value stored with or beside the document. + +The security goal is authenticity and integrity: if the document changes by one byte, recomputing the digest makes verification fail. A signature can also support non-repudiation: when a private key is assigned to Alice, kept under her exclusive control, and certified or logged, a valid signature is evidence that Alice authorized the signed content. In practice, non-repudiation also depends on key custody, timestamps, revocation status, policy, and audit records; mathematics alone does not prove human intent. + +## Signing a digest with the private key + +A typical signing workflow is: + +1. Serialize or canonicalize the exact content to be signed, including context such as `invoice-v1`, a document identifier, or signing time if required. +2. Produce a fixed-length digest of that content, for example a SHA-256 digest. +3. Use the private signing key with an algorithm such as RSA-PSS, ECDSA over NIST P-256, or Ed25519ph to generate a signature over the digest or algorithm-defined transcript. +4. Publish the message, signature, algorithm identifiers, and the signer’s public-key certificate or fingerprint. + +The private key never leaves the signer’s control, often remaining inside a YubiKey, smart card, or FIPS-validated hardware security module. It is not accurate to describe modern signing as simply “encrypting with the private key”; real schemes use padding, domain separation, nonce rules, and algorithm-specific checks to prevent forgery. + +## Verification with the public key + +A verifier obtains the signer’s authentic public key, recomputes the digest from the received bytes, and checks the signature according to the declared algorithm. Verification returns valid or invalid; it does not recover hidden content or provide secrecy. A valid result means the holder of the matching private key signed that exact digest, so later modification, substitution, or accidental corruption is detectable. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-symmetric.md b/datasets/domain-bilingual-v2/corpus/crypto-symmetric.md new file mode 100644 index 0000000..b98698f --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-symmetric.md @@ -0,0 +1,25 @@ +--- +title: Symmetric-key ciphers +language: en +source: openai-codex-synthetic +--- + +# Symmetric-key ciphers + +## Shared-secret encryption + +A symmetric-key cipher protects confidentiality when the sender and receiver already possess the same secret key. The same key is used to transform plaintext into ciphertext and to reverse that process back into plaintext. Anyone without the shared key should be unable to recover the original message, even if they can observe or store the ciphertext. + +This model is common in file encryption, database field encryption, backup protection, and private application-to-application channels where a key has already been provisioned. Its main operational rule is simple: the key must remain secret and must be available only to authorized parties. If the shared key is exposed, past and future ciphertext protected with that key may be at risk, depending on the mode of operation and key-rotation practices. + +## AES as the standard block cipher + +The Advanced Encryption Standard, or AES, is the dominant modern symmetric block cipher. It was standardized by NIST in FIPS 197 and is based on the Rijndael design by Joan Daemen and Vincent Rijmen. AES operates on fixed-size 128-bit blocks and supports 128-bit, 192-bit, and 256-bit keys. These variants are commonly written as AES-128, AES-192, and AES-256. + +AES by itself encrypts exactly one block at a time. Real messages are usually longer than 128 bits, so AES is used with a mode of operation that defines how multiple blocks are processed and how repeated plaintext patterns are avoided. + +## Modes, IVs, and correct use + +Common AES modes include CBC, CTR, and GCM. CBC mode requires an unpredictable initialization vector and padding such as PKCS#7 for messages that are not an exact multiple of the block size. CTR mode turns AES block operations into a counter-based construction and requires that the same counter/nonce value never be reused with the same key. GCM is widely used because it combines AES encryption with an authentication tag, but it also depends critically on unique nonces. + +Correct symmetric encryption requires more than choosing AES. Implementations must generate high-entropy keys, use approved modes, handle IVs or nonces correctly, rotate keys when appropriate, and erase keys from memory when no longer needed. AES is considered strong when used this way; most failures come from key exposure, nonce reuse, weak random-number generation, or custom modes rather than from the AES algorithm itself. diff --git a/datasets/domain-bilingual-v2/corpus/crypto-tls.md b/datasets/domain-bilingual-v2/corpus/crypto-tls.md new file mode 100644 index 0000000..ec2c61c --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/crypto-tls.md @@ -0,0 +1,32 @@ +--- +title: The TLS handshake +language: en +source: openai-codex-synthetic +--- + +# The TLS handshake + +## Purpose in HTTPS + +The TLS handshake is the protocol step that lets a client and server agree on fresh session keys and lets the client authenticate that it is talking to the intended server. In HTTPS, this happens before HTTP requests are sent, for example when a browser connects to `https://www.example.com`. + +TLS does not replace ciphers, hashes, signatures, certificates, or key exchange. Instead, it coordinates them. Modern TLS 1.3, standardized in RFC 8446, combines these primitives into a short sequence of messages that establishes an encrypted channel with server identity checking. + +## Main handshake flow + +A typical TLS 1.3 handshake begins with a **ClientHello**. The client sends supported protocol versions, cipher suites, random data, extensions such as SNI, and key-share information. + +The server replies with a **ServerHello**, selecting the TLS version, cipher suite, and matching key-share parameters. From the exchanged values, both sides derive temporary session secrets for this connection. These secrets are then expanded into traffic keys used to protect later handshake and application records. + +Next, the server sends its **Certificate** message. This usually contains an X.509 certificate chain linking the server name, such as `www.example.com`, to a public key trusted through a certificate authority such as DigiCert or Let’s Encrypt. The client validates the chain, the hostname, expiration dates, and policy requirements. + +The server then sends **CertificateVerify**, proving possession of the private key corresponding to the certificate. Finally, both sides send **Finished** messages, which confirm that the same handshake transcript and derived secrets were used. + +## What the handshake guarantees + +The TLS handshake provides two central results: + +1. **Session key negotiation:** client and server end with shared traffic keys that were not sent directly over the network. +2. **Server authentication:** the client gains evidence that the server controls the private key for a certificate valid for the requested name. + +After the handshake, application data such as HTTP headers, cookies, and request bodies is protected using the negotiated record-layer algorithms. Session resumption and PSK modes can shorten later connections, but the core idea remains the same: TLS is the protocol that combines cryptographic primitives to authenticate the server and establish keys for a secure session. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-causes.md b/datasets/domain-bilingual-v2/corpus/french-rev-causes.md new file mode 100644 index 0000000..ca2c7c6 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-causes.md @@ -0,0 +1,27 @@ +--- +title: Causes of the French Revolution +language: en +source: openai-codex-synthetic +--- + +# Causes of the French Revolution + +## The Pre-1789 Fiscal Crisis + +The French Revolution grew out of a deep financial emergency in the **Ancien Régime** monarchy of **Louis XVI**. By the late 1780s, the crown was burdened by debts from earlier conflicts, especially the **Seven Years’ War** and French support for the **American War of Independence**. Interest payments consumed a large share of royal revenue, leaving little room for ordinary administration, court expenses at **Versailles**, or military costs. + +The tax system made the crisis worse. Direct and indirect taxes such as the **taille**, **vingtième**, and **gabelle** fell unevenly, with many nobles, clergy, provinces, and officeholders enjoying exemptions or special privileges. Tax farming, venal offices, and local privileges made collection inefficient and widely resented. Reform ministers including **Anne-Robert-Jacques Turgot**, **Jacques Necker**, **Charles Alexandre de Calonne**, and **Étienne Charles de Loménie de Brienne** attempted to reorganize royal finances, but proposals for broader taxation of privileged groups met resistance from the **Parlement of Paris** and the **Assembly of Notables** in 1787. + +By 1788, the monarchy faced a severe credit crisis and struggled to borrow. Bad harvests and rising bread prices intensified social distress, turning a fiscal problem into a broader political crisis. + +## Enlightenment Ideas and Political Criticism + +The intellectual climate of eighteenth-century France also weakened the legitimacy of absolute monarchy and inherited privilege. **Montesquieu** argued for separation of powers and criticized despotism. **Voltaire** attacked religious intolerance and arbitrary authority. **Jean-Jacques Rousseau** emphasized popular sovereignty and the “general will,” suggesting that political authority should rest on the consent of the people rather than tradition alone. + +These ideas circulated through salons, pamphlets, newspapers, Masonic lodges, and educated urban networks. They did not cause revolution by themselves, but they gave critics a language of **rights**, **liberty**, **reason**, **representation**, and **constitutional reform**. Enlightenment arguments made the privileges of the Old Regime appear irrational and unjust. + +## The Society of Estates + +French society was legally divided into three estates. The **First Estate** was the clergy, which owned land, collected tithes, and held major influence in education and welfare. The **Second Estate** was the nobility, with honorific status, seigneurial rights, and access to high military and administrative offices. The **Third Estate** included everyone else: wealthy bourgeois lawyers and merchants, artisans, urban laborers, and the peasantry. + +Most taxes and feudal dues fell on the Third Estate, while many privileged groups defended exemptions. This unequal structure created resentment because economic importance, legal status, and political influence no longer aligned. The fiscal crisis, Enlightenment critique, and estate-based inequality together made the Old Regime unstable before 1789. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-estates-general.md b/datasets/domain-bilingual-v2/corpus/french-rev-estates-general.md new file mode 100644 index 0000000..1b71e36 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-estates-general.md @@ -0,0 +1,29 @@ +--- +title: The Estates-General and 1789 +language: en +source: openai-codex-synthetic +--- + +# The Estates-General and 1789 + +The year 1789 turned a royal political assembly into a revolutionary break with old authority. Its key sequence ran from the convocation of the Estates-General at Versailles, to the formation of the National Assembly, to the fall of the Bastille in Paris, and finally to the Declaration of the Rights of Man and of the Citizen. + +## Convocation of the Estates-General + +Louis XVI summoned the Estates-General to meet on 5 May 1789 at Versailles, the first such meeting since 1614. Deputies gathered in the hall of the Menus-Plaisirs, organized as the First Estate clergy, the Second Estate nobility, and the Third Estate commoners. About 1,200 deputies attended, with the Third Estate granted roughly twice the representation of either privileged order. + +Across the kingdom, electoral assemblies prepared *cahiers de doléances*, or lists of grievances, which gave the meeting a public and national character. The central procedural dispute was whether votes would be counted by order, giving each estate one collective vote, or by head, allowing the larger Third Estate to prevail with allies from the clergy and nobility. + +## From Estates to National Assembly + +When negotiations over voting stalled, Third Estate deputies asserted that they represented the nation itself. On 17 June 1789 they declared themselves the National Assembly. Three days later, after finding their meeting hall closed, they moved to an indoor tennis court and swore the Tennis Court Oath, promising not to separate until France had a constitution. + +Jean-Sylvain Bailly presided over the oath, while figures such as Honoré-Gabriel Riqueti, comte de Mirabeau, defended the Assembly’s authority. On 27 June, Louis XVI ordered the remaining clergy and noble deputies to join the new body. On 9 July it took the name National Constituent Assembly. + +## The Bastille + +Tension in Paris sharpened after royal troops gathered near the capital and Jacques Necker was dismissed on 11 July. On 14 July 1789, crowds seized arms from the Hôtel des Invalides and moved to the Bastille, seeking gunpowder. The fortress, commanded by Bernard-René de Launay, held only seven prisoners, but it symbolized arbitrary royal power. Its capture made Paris a decisive revolutionary force and helped bring the National Guard, associated with the marquis de Lafayette, into prominence. + +## Declaration of Rights + +On 26 August 1789, the National Constituent Assembly adopted the Declaration of the Rights of Man and of the Citizen. It proclaimed that sovereignty resided in the nation, that citizens were equal before the law, and that liberty, property, security, and resistance to oppression were natural rights. The Declaration became the core statement of 1789’s political principles. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-napoleon-rise.md b/datasets/domain-bilingual-v2/corpus/french-rev-napoleon-rise.md new file mode 100644 index 0000000..bf85843 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-napoleon-rise.md @@ -0,0 +1,27 @@ +--- +title: Napoleon's rise to power +language: en +source: openai-codex-synthetic +--- + +# Napoleon's rise to power + +Napoleon Bonaparte’s ascent from celebrated general to ruler of France unfolded during the unstable final phase of the French Revolution. His rise depended on military prestige, constitutional maneuvering, and a carefully managed transfer of authority from republican institutions to personal rule. + +## The Brumaire Coup + +The decisive break came in the coup of **18–19 Brumaire, Year VIII** of the revolutionary calendar (**9–10 November 1799**). At the time, the **Directory** appeared divided and unable to command confidence. Napoleon joined with political figures including **Emmanuel-Joseph Sieyès**, **Roger Ducos**, and his brother **Lucien Bonaparte**, who was president of the **Council of Five Hundred**. + +The conspirators arranged for the legislative councils to move from Paris to **Saint-Cloud**, claiming that the republic faced a security threat. There, pressure was placed on the **Council of Ancients** and the **Council of Five Hundred** to accept a new executive arrangement. When resistance mounted, troops loyal to Napoleon intervened and dispersed the deputies. The coup ended the Directory and presented the change as a defense of order, law, and the Revolution’s achievements rather than a simple military seizure. + +## The Consulate + +After Brumaire, France was reorganized under the **Constitution of Year VIII**. The new government, the **Consulate**, placed executive power in the hands of three consuls, but real authority belonged to Napoleon as **First Consul**. Sieyès and Ducos were soon replaced by **Jean-Jacques-Régis de Cambacérès** and **Charles-François Lebrun**, making the regime appear collective while centralizing decisions around Napoleon. + +The Consulate used plebiscites, administrative reform, and the language of republican stability to strengthen Napoleon’s position. Prefects, centralized ministries, and loyal institutions tied the provinces more closely to Paris. In **1802**, following another plebiscite, Napoleon became **Consul for Life**, a major step from revolutionary republican office toward hereditary sovereignty. + +## From First Consul to Emperor + +The final transformation came in **1804**. The **Senate** approved the creation of the **French Empire**, and on **18 May 1804** Napoleon was proclaimed **Emperor of the French**. The title suggested continuity with national sovereignty while replacing the temporary framework of the Consulate with dynastic rule. + +On **2 December 1804**, Napoleon’s coronation took place at **Notre-Dame Cathedral in Paris**, in the presence of **Pope Pius VII**. By crowning himself, Napoleon symbolized that his authority came not simply from the Church or old monarchy, but from his own power and the revolutionary state he had mastered. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-napoleonic-wars.md b/datasets/domain-bilingual-v2/corpus/french-rev-napoleonic-wars.md new file mode 100644 index 0000000..0990de4 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-napoleonic-wars.md @@ -0,0 +1,27 @@ +--- +title: The Napoleonic Wars +language: en +source: openai-codex-synthetic +--- + +# The Napoleonic Wars + +## Coalitions and Campaigns + +The Napoleonic Wars were a series of European conflicts fought between Napoleonic France and shifting coalitions of rival powers from 1803 to 1815. They grew out of the revolutionary era but became a continent-wide struggle over empire, law, armies, and political authority. Britain was France’s most persistent enemy, while Austria, Russia, Prussia, Sweden, Spain, and Portugal entered coalitions at different moments. + +The Third Coalition, including Britain, Austria, and Russia, was defeated on land at Ulm and Austerlitz in 1805, although the British Royal Navy under Horatio Nelson destroyed the Franco-Spanish fleet at Trafalgar. The Fourth Coalition followed in 1806–07, when Prussia and Russia challenged French dominance. Napoleon won decisive victories at Jena-Auerstedt and Friedland, leading to the Treaties of Tilsit. + +The Fifth Coalition was centered on Austria’s renewed war in 1809 and ended after the French victory at Wagram. Meanwhile, the Peninsular War in Spain and Portugal became a grinding conflict involving guerrilla resistance, British forces under Arthur Wellesley, and French imperial armies. It weakened Napoleon’s authority and tied down large numbers of troops. + +## The Continental System + +The Continental System was Napoleon’s economic campaign against Britain. Established through the Berlin Decree of 1806 and strengthened by the Milan Decree of 1807, it attempted to close European ports to British goods and disrupt British commerce. Napoleon hoped that economic pressure would force Britain to negotiate. + +The system was difficult to enforce. Smuggling continued, neutral shipping was contested, and many European merchants suffered from the blockade. The policy strained relations with allies and dependent states, especially where trade with Britain was economically important. Russia’s growing reluctance to uphold the system helped create the crisis that led to Napoleon’s invasion of Russia in 1812. + +## Defeat and Final Coalition + +The Russian campaign was a turning point. The Grande Armée captured Moscow but was devastated by distance, winter, scorched-earth tactics, and supply failures. In 1813, the Sixth Coalition—Britain, Russia, Prussia, Austria, and Sweden—defeated Napoleon at Leipzig, the “Battle of Nations.” + +After Napoleon returned for a final campaign in 1815, the Seventh Coalition defeated him at Waterloo. The battle brought together the Duke of Wellington’s Anglo-allied army and Gebhard Leberecht von Blücher’s Prussian forces, ending the military phase of Napoleonic domination in Europe. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-terror.md b/datasets/domain-bilingual-v2/corpus/french-rev-terror.md new file mode 100644 index 0000000..4770564 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-terror.md @@ -0,0 +1,25 @@ +--- +title: The Reign of Terror +language: en +source: openai-codex-synthetic +--- + +# The Reign of Terror + +## Jacobin Ascendancy + +The Reign of Terror was the emergency phase of the French Revolution in 1793–1794, when revolutionary government claimed that liberty, the republic and public safety could be defended only by severe repression. Its political center was the Jacobin Club, especially the Montagnard deputies in the National Convention. Maximilien Robespierre, Louis Antoine de Saint-Just and Georges Couthon became leading voices for a republic based on virtue, vigilance and punishment of counter-revolution. + +The Jacobins drew support from Parisian sans-culottes and from militants who feared conspiracy, invasion, hoarding and royalist restoration. Their opponents included Girondins, refractory clergy, federalist rebels and alleged enemies of the people. Jacobin language blended popular sovereignty with suspicion: citizenship meant loyalty to the revolution, while neutrality could be treated as treason. + +## The Committee of Public Safety + +The Committee of Public Safety, created by the National Convention in April 1793, became the central executive organ of revolutionary government. By the summer and autumn of 1793 it directed military mobilization, surveillance, economic controls and political discipline across France. It worked alongside the Committee of General Security, the Revolutionary Tribunal in Paris and representatives-on-mission sent to departments and armies. + +Key measures gave the Terror legal form. The Law of Suspects of 17 September 1793 authorized the arrest of people accused of hostility to the republic, including former nobles, priests, hoarders and political dissidents. The Revolutionary Tribunal judged enemies of the revolution with increasing speed. In practice, revolutionary legality and political necessity became inseparable. + +## Robespierre and the Executions of 1793–1794 + +Robespierre did not personally order every execution, but he became the most famous symbol of the Terror. In his speech of 5 February 1794, he described terror as “prompt, severe, inflexible justice,” inseparable from revolutionary virtue. Under this logic, the guillotine became both a punishment and a public warning. + +Notable victims included Marie Antoinette in October 1793, leading Girondins, Olympe de Gouges, Jacques Hébert and the Hébertists, Georges Danton and Camille Desmoulins. The Law of 22 Prairial, passed on 10 June 1794, sharply limited defense rights and broadened the definition of enemies of the people. This accelerated the “Great Terror” in Paris. Across France, roughly 16,000 people were legally executed, with many more killed through shootings, drownings and civil-war repression. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-thermidor.md b/datasets/domain-bilingual-v2/corpus/french-rev-thermidor.md new file mode 100644 index 0000000..620669d --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-thermidor.md @@ -0,0 +1,29 @@ +--- +title: The Thermidorian Reaction +language: en +source: openai-codex-synthetic +--- + +# The Thermidorian Reaction + +## Fall of Robespierre + +The Thermidorian Reaction was the turning point in the French Revolution that brought down Maximilien Robespierre and ended the most intense phase of revolutionary emergency government. It began on **9 Thermidor Year II** in the revolutionary calendar, or **27 July 1794**, inside the **National Convention** in Paris. + +Robespierre, closely associated with the **Committee of Public Safety**, had become politically isolated. Deputies feared that further accusations of treason would soon be directed against them. When Robespierre and his ally **Louis Antoine de Saint-Just** attempted to speak, hostile members of the Convention shouted them down. Robespierre, Saint-Just, **Georges Couthon**, and other supporters were arrested. + +That night, the Paris Commune tried to rally support for Robespierre at the **Hôtel de Ville**, but the Convention declared the rebels outlawed. Forces loyal to the Convention moved against them. On **10 Thermidor Year II** (**28 July 1794**), Robespierre, Saint-Just, Couthon, and their principal allies were executed by guillotine. + +## The End of the Terror + +The fall of Robespierre did not immediately create calm, but it marked the end of the system most closely identified with the **Terror**. The Thermidorians reduced the authority of the Committee of Public Safety, limited revolutionary courts, and moved against the political culture of denunciation. The **Jacobin Club** in Paris was closed in November 1794. + +Many prisoners held as suspects were released, and the laws that had accelerated political trials were weakened or abandoned. Former militants and officials linked to the Terror were attacked in public life, while anti-Jacobin violence known as the **White Terror** appeared in parts of France. The Reaction therefore replaced one form of revolutionary severity with a more conservative republican backlash. + +## The Directory + +The Thermidorian settlement produced a new constitutional order: the **Constitution of Year III**, adopted in **1795**. It created the **Directory**, a republican government designed to prevent both dictatorship and popular radicalism. + +The legislature was divided into the **Council of Five Hundred** and the **Council of Ancients**. Executive power was held by five Directors, including figures such as **Paul Barras**, **Lazare Carnot**, **Jean-François Reubell**, **Louis-Marie de La Révellière-Lépeaux**, and **Étienne-François Letourneur**. + +The Directory governed France from **1795 to 1799**. It faced inflation, food shortages, royalist plots, Jacobin opposition, and dependence on the army to maintain order. Its instability made it a transitional regime between the revolutionary Convention and the later authoritarian phase of French political life. diff --git a/datasets/domain-bilingual-v2/corpus/french-rev-vienna.md b/datasets/domain-bilingual-v2/corpus/french-rev-vienna.md new file mode 100644 index 0000000..7f082d0 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/french-rev-vienna.md @@ -0,0 +1,27 @@ +--- +title: The Congress of Vienna +language: en +source: openai-codex-synthetic +--- + +# The Congress of Vienna + +## Purpose and Principles + +The Congress of Vienna was the diplomatic settlement that reorganized Europe after Napoleon’s fall from power. Meeting formally from 1814 to 1815, it produced the **Final Act of 9 June 1815**, which became the central framework for the post-Napoleon order. Its leading figures included **Klemens von Metternich** of Austria, **Robert Stewart, Viscount Castlereagh** of Britain, **Tsar Alexander I** of Russia, **Frederick William III** of Prussia, and **Charles-Maurice de Talleyrand** representing France. + +The settlement aimed to prevent renewed revolutionary and imperial domination by restoring political stability. Its guiding ideas were **legitimacy**, meaning the return of lawful dynasties where practical; **compensation**, meaning territorial rewards for victorious powers; and the **balance of power**, meaning that no single state should be strong enough to dominate the continent. + +## Territorial Settlement + +The Congress redrew borders without creating a single European empire. France was reduced broadly to its pre-expansion frontiers and placed under the restored **Bourbon monarchy of Louis XVIII**. The settlement did not destroy France, because the negotiators believed a weakened but functioning France was necessary for European stability. + +To contain France, the Congress strengthened neighboring states. The **United Kingdom of the Netherlands** joined the former Dutch Republic with the Austrian Netherlands under **King William I**. **Prussia** gained important territories in the Rhineland and Westphalia. **Austria** received control of **Lombardy-Venetia** and retained major influence in Italy. The **Kingdom of Sardinia-Piedmont** was restored and strengthened, including Genoa. + +In central Europe, the old imperial structure was not revived. Instead, the **German Confederation** was created as a loose association of thirty-nine German states, chaired by Austria. In the east, the Polish-Saxon question was settled by creating a reduced **Kingdom of Poland** linked to Russia, while Prussia and Austria received compensations. **Swiss neutrality** was formally recognized. + +## Restoration and the Vienna System + +The Congress of Vienna established a conservative international order often called the **Vienna System** or the **Concert of Europe**. Its purpose was not simply to restore old rulers, but to manage future crises through consultation among the great powers. The **Quadruple Alliance** of Britain, Austria, Prussia, and Russia supported the settlement, while the **Holy Alliance** expressed a broader monarchical vision. + +The result was a durable balance of power. Although unrest and nationalism continued, the Vienna settlement helped prevent a general European war for decades and defined restoration politics after the French Revolution and Napoleonic era. diff --git a/datasets/domain-bilingual-v2/corpus/macro-cb-policy.md b/datasets/domain-bilingual-v2/corpus/macro-cb-policy.md new file mode 100644 index 0000000..a0b42f1 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-cb-policy.md @@ -0,0 +1,33 @@ +--- +title: Central-bank monetary policy +language: en +source: openai-codex-synthetic +--- + +# Central-bank monetary policy + +Central-bank monetary policy is the set of decisions and operating procedures used to guide short-term financial conditions and keep inflation near a publicly stated objective. Institutions such as the U.S. Federal Reserve, the European Central Bank, and the Bank of England use monetary policy to influence spending, saving, credit conditions, and expectations across the economy. + +## The policy rate + +The **policy rate** is the central bank’s main advertised interest-rate instrument. It is the short-term rate, or target range, that policymakers adjust when they want monetary conditions to become tighter or looser. + +In the United States, the Federal Open Market Committee sets a target range for the **federal funds rate**. In the United Kingdom, the Monetary Policy Committee sets **Bank Rate**. In the euro area, the European Central Bank announces rates such as the **deposit facility rate** and main refinancing rate. + +A higher policy rate generally makes overnight and short-term borrowing more expensive, restraining credit and aggregate demand. A lower policy rate generally makes financing cheaper and supports spending. The policy rate also works through expectations: households, firms, and financial markets react not only to today’s rate, but also to the central bank’s guidance about future decisions. + +## Open-market operations + +**Open-market operations** are the buying and selling of financial assets by a central bank to implement its policy stance. In normal operations, this often means transactions in government securities or short-term repurchase agreements. + +For example, the Federal Reserve Bank of New York’s trading desk conducts operations to keep the federal funds rate within the FOMC’s target range. Purchases add liquidity to the banking system, while sales or reverse transactions drain liquidity. The purpose is operational: to align market conditions with the announced policy rate. + +Open-market operations are therefore not just symbolic announcements. They are the practical mechanism by which a central bank transmits its chosen stance into money-market rates and day-to-day financial conditions. + +## Inflation targeting + +**Inflation targeting** is a monetary-policy framework in which the central bank publicly commits to an inflation objective and uses its instruments to achieve it over time. Many advanced-economy central banks use a target near **2 percent**. + +The Federal Reserve states a longer-run goal of 2 percent inflation measured by the personal consumption expenditures price index. The European Central Bank aims for 2 percent inflation over the medium term, measured using the Harmonised Index of Consumer Prices. The Bank of England has a 2 percent CPI inflation target set by the UK government. + +Inflation targeting relies on credibility, forecasts, and communication. If inflation is expected to remain above target, policymakers may raise the policy rate and tighten conditions. If inflation is expected to remain below target, they may lower the policy rate and ease conditions. diff --git a/datasets/domain-bilingual-v2/corpus/macro-hyperinflation.md b/datasets/domain-bilingual-v2/corpus/macro-hyperinflation.md new file mode 100644 index 0000000..59fccce --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-hyperinflation.md @@ -0,0 +1,27 @@ +--- +title: Hyperinflation +language: en +source: openai-codex-synthetic +--- + +# Hyperinflation + +## What makes inflation “hyper” + +Hyperinflation is an extreme breakdown of money’s function as a store of value and unit of account. A common benchmark, associated with economist Phillip Cagan, is inflation of **50 percent or more per month**. At that speed, the general price level does not merely rise; prices become unstable within weeks, days, or even hours. + +Unlike ordinary inflation, hyperinflation is dominated by a collapse of confidence in the currency. Households, firms, and workers try to avoid holding money because its purchasing power is disappearing. Wages are spent immediately, sellers change price lists repeatedly, and long-term contracts are shortened or indexed. The result is a self-reinforcing process: faster spending raises prices, higher prices reduce trust in money, and reduced trust accelerates spending again. + +## Currency collapse and the feedback loop + +Hyperinflation often appears when a government can no longer reliably finance itself through taxes or borrowing and instead covers spending by issuing rapidly depreciating money. The central issue is not just “too much money” in a mechanical sense, but the expectation that money will keep losing value. Once that expectation takes hold, people shift into foreign currency, barter, gold, durable goods, or any asset expected to hold value better than the domestic currency. + +Currency collapse is visible in everyday behavior: shops quote prices in U.S. dollars or euros, salaries are renegotiated constantly, and banknotes require ever-larger denominations. A currency reform may remove zeros or replace the unit entirely, but redenomination alone does not end hyperinflation unless confidence in fiscal and monetary stability is restored. + +## Historical episodes + +The classic case is **Weimar Germany in 1923**, when the Papiermark collapsed after wartime debts, reparations pressures, political crisis, and money issuance fed a loss of confidence. Prices rose so quickly that workers were sometimes paid more than once per day. + +**Hungary in 1946** experienced one of the most severe recorded hyperinflations, destroying the pengő and leading to the introduction of the forint. In **Zimbabwe in 2008**, the Zimbabwe dollar became nearly unusable; the government issued enormous-denomination notes, including a 100 trillion dollar bill, before transactions shifted heavily to foreign currencies. + +More recently, **Venezuela** suffered hyperinflation during the late 2010s, with repeated bolívar redenominations and widespread dollar pricing. These episodes show that hyperinflation is not simply high inflation; it is a monetary and political collapse in which the public abandons the currency itself. diff --git a/datasets/domain-bilingual-v2/corpus/macro-inflation-causes.md b/datasets/domain-bilingual-v2/corpus/macro-inflation-causes.md new file mode 100644 index 0000000..0edec6a --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-inflation-causes.md @@ -0,0 +1,35 @@ +--- +title: Causes of inflation +language: en +source: openai-codex-synthetic +--- + +# Causes of inflation + +Inflation is a sustained rise in the general price level, not merely a one-time increase in the price of a single good. Economists usually group its causes into two broad categories: demand-pull inflation and cost-push inflation. A monetary framework, the quantity theory of money, explains why persistent inflation is often linked to money growth over time. + +## Demand-pull inflation + +Demand-pull inflation occurs when aggregate demand grows faster than the economy’s capacity to produce goods and services. In simple terms, too much spending is chasing too few goods. Households, firms, government, or foreign buyers may all contribute to stronger demand. + +For example, if consumer spending and business investment rise rapidly while factories, workers, and supply chains are already near capacity, sellers can raise prices without losing many customers. This can happen during a strong recovery, after large fiscal transfers, or when credit and confidence expand quickly. The key feature is that the pressure begins on the demand side: buyers are willing and able to spend more at current prices than the economy can supply. + +Demand-pull inflation is usually associated with rising output in the short run, but if real production cannot keep pace, the result is a higher overall price level. + +## Cost-push inflation + +Cost-push inflation begins on the supply side. It occurs when firms face higher production costs or reduced productive capacity and pass those costs into prices. Common sources include higher energy prices, more expensive imported inputs, droughts that reduce food supply, or wage increases not matched by productivity growth. + +A classic example is the 1973–1974 oil shock, when OPEC’s embargo sharply increased petroleum prices for countries such as the United States and the United Kingdom. Because oil was an input into transportation, manufacturing, and heating, the shock raised costs across many sectors. + +Cost-push inflation can be especially difficult because prices rise even as real output may weaken. Unlike demand-pull inflation, the initial impulse is not excessive spending but a negative supply shock. + +## The quantity theory of money + +The quantity theory of money links the money supply to the price level through the equation of exchange: + +**M × V = P × Y** + +Here, **M** is the money stock, **V** is velocity, **P** is the price level, and **Y** is real output. Irving Fisher helped formalize this relationship, and Milton Friedman later summarized its implication with the phrase that inflation is “always and everywhere a monetary phenomenon.” + +The theory does not claim that every short-run price change is caused by money. Velocity can shift, and real output can be disrupted. Its central message is that, over the long run, if money grows persistently faster than real output, the price level tends to rise. diff --git a/datasets/domain-bilingual-v2/corpus/macro-interest-rates.md b/datasets/domain-bilingual-v2/corpus/macro-interest-rates.md new file mode 100644 index 0000000..7dc7d8b --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-interest-rates.md @@ -0,0 +1,32 @@ +--- +title: Interest rates and the yield curve +language: en +source: openai-codex-synthetic +--- + +# Interest rates and the yield curve + +## Nominal and real interest rates + +A **nominal interest rate** is the stated rate on a loan, deposit, Treasury bill, corporate bond, or mortgage contract. If a one-year U.S. Treasury bill pays 5%, that 5% is the nominal return in dollars before adjusting for changes in the purchasing power of money. + +A **real interest rate** measures the return after expected inflation is removed. The common approximation is: + +**real interest rate ≈ nominal interest rate − expected inflation** + +For example, if a 10-year Treasury note yields 4.5% and expected inflation is 2.5%, the expected real yield is about 2.0%. More exactly, economists use the Fisher relation: +**1 + nominal rate = (1 + real rate) × (1 + expected inflation).** + +Real rates matter because households, firms, and governments care about purchasing power, not just the number of currency units repaid. A high nominal rate can still imply a low real rate if inflation expectations are also high. The U.S. Treasury’s **Treasury Inflation-Protected Securities (TIPS)** market gives one direct market measure of real yields; the gap between a nominal Treasury yield and a comparable TIPS yield is called the **breakeven inflation rate**. + +## The term structure of interest rates + +The **term structure of interest rates** is the pattern of yields on otherwise comparable debt securities with different maturities at a single point in time. In practice, analysts often examine U.S. Treasury securities such as the 3-month bill, 2-year note, 10-year note, and 30-year bond because they are widely traded benchmarks. + +A **yield curve** is the graph of this term structure, with maturity on the horizontal axis and yield on the vertical axis. A normal upward-sloping curve means longer maturities have higher yields than short maturities. A flat curve means short and long yields are similar. An inverted curve means short-term yields exceed long-term yields, as occurred in parts of the U.S. Treasury curve in 2019 and 2022. + +## Why yield curves differ across maturities + +Three forces are central. First, investors form expectations about future short-term rates, which are influenced by inflation expectations, economic growth, and central-bank communication. Second, investors may require a **term premium** for holding long-term bonds, because their prices are more sensitive to interest-rate changes. Third, liquidity and risk characteristics can affect yields, even among high-quality issuers. + +The yield curve is therefore not a money-supply measure or an inflation statistic by itself. It is a market-based map of nominal and real interest rates across time horizons. diff --git a/datasets/domain-bilingual-v2/corpus/macro-money-supply.md b/datasets/domain-bilingual-v2/corpus/macro-money-supply.md new file mode 100644 index 0000000..3ca15cb --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-money-supply.md @@ -0,0 +1,31 @@ +--- +title: The money supply and money creation +language: en +source: openai-codex-synthetic +--- + +# The money supply and money creation + +Money supply measures describe how much money is available in an economy for spending, saving, and payment. In macroeconomics, these measures are called **monetary aggregates**. They are used by institutions such as the **Federal Reserve**, the **Bank of England**, and the **European Central Bank** to track liquidity in the financial system. + +## M0, M1, and M2 + +Definitions vary slightly by country, but the main aggregates are: + +- **M0**, often called the **monetary base**, is the narrowest measure. It includes physical currency in circulation, such as U.S. dollar notes and coins, plus bank reserves held at the central bank. Reserves are money used by commercial banks to settle payments with each other. +- **M1** includes money that can be spent almost immediately. In the United States, this includes currency held by the public, checking account deposits, and other checkable deposits. M1 is therefore broader than M0 because it includes bank deposits used for everyday payments. +- **M2** includes all of M1 plus relatively liquid savings instruments. In the U.S. definition, this includes savings deposits, money market deposit accounts, small-denomination time deposits under $100,000, and retail money market mutual fund balances. + +The key difference is liquidity: M0 is base money, M1 is spendable money, and M2 includes money-like assets that can be converted into spendable balances fairly easily. + +## Fractional-reserve banking + +Most money used by households and firms is not printed currency. It is **deposit money** created by commercial banks such as **JPMorgan Chase**, **HSBC**, or **Barclays**. + +Under fractional-reserve banking, a bank keeps only a fraction of its deposits as cash or reserves and uses the rest of its balance sheet to make loans. When a bank approves a loan, it does not usually hand out pre-existing cash. Instead, it records a new asset, the loan, and a new liability, the borrower’s deposit. That new deposit is part of the money supply. + +For example, if a bank grants a $10,000 business loan, the borrower’s account may increase by $10,000. The economy now has an additional $10,000 of deposit money. When the borrower repays the principal, that deposit money is extinguished. + +## Limits on money creation + +Banks cannot create unlimited money. They need reserves for payment settlement, capital to absorb losses, and borrowers who are creditworthy. Regulations such as **Basel III** liquidity and capital standards also constrain lending. In the United States, reserve requirements on many deposits were set to 0% in 2020, but banks still hold reserves for settlement and risk management. diff --git a/datasets/domain-bilingual-v2/corpus/macro-phillips.md b/datasets/domain-bilingual-v2/corpus/macro-phillips.md new file mode 100644 index 0000000..39d46dd --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-phillips.md @@ -0,0 +1,31 @@ +--- +title: The Phillips curve +language: en +source: openai-codex-synthetic +--- + +# The Phillips curve + +## Short-run inflation–unemployment tradeoff + +The Phillips curve describes an inverse short-run relationship between unemployment and inflation. In 1958, New Zealand economist A. W. Phillips studied British data from 1861 to 1957 and found that periods of low unemployment tended to coincide with faster wage growth. Paul Samuelson and Robert Solow later applied the idea to U.S. price inflation in 1960, making it central to macroeconomic policy debates. + +The basic intuition is that when unemployment is low, workers have more bargaining power and firms face stronger demand for labor. Wages and prices may rise faster. When unemployment is high, weak labor markets put downward pressure on wage growth and inflation. In this short-run setting, policymakers appeared to face a menu: accept somewhat higher inflation to reduce unemployment, or accept higher unemployment to reduce inflation. + +A simple expectations-augmented form is: + +**inflation = expected inflation − responsiveness × unemployment gap + shock** + +where the unemployment gap is the difference between actual unemployment and the economy’s sustainable rate, often called the **natural rate of unemployment** or **NAIRU**. + +## Expectations and the breakdown + +The original Phillips curve broke down as a stable policy tradeoff in the late 1960s and 1970s. Milton Friedman and Edmund Phelps argued that workers and firms care about expected inflation, not just current labor-market conditions. If policymakers repeatedly push unemployment below its sustainable rate, people revise inflation expectations upward. The result is not permanently lower unemployment, but accelerating inflation. + +The most famous breakdown occurred during the 1970s, when the United States and other advanced economies experienced **stagflation**: high inflation together with high unemployment. Oil-price shocks in 1973 and 1979 worsened the problem, but the deeper lesson was that the Phillips curve was not a fixed menu of choices. Once inflation expectations adjusted, the apparent tradeoff shifted. + +## Policy interpretation + +Modern macroeconomics treats the Phillips curve as a short-run relationship, not a long-run bargain. In the long run, unemployment tends to return toward its natural rate, so the long-run Phillips curve is often drawn as vertical. Inflation may change, but permanently higher inflation does not buy permanently lower unemployment. + +The Phillips curve remains useful for analyzing business cycles, wage pressure, spare capacity, and the credibility of anti-inflation policy. Its key warning is that the inflation–unemployment tradeoff can exist temporarily, but it breaks down when expectations become unanchored or when major supply disturbances move inflation and unemployment in the same direction. diff --git a/datasets/domain-bilingual-v2/corpus/macro-quantitative-easing.md b/datasets/domain-bilingual-v2/corpus/macro-quantitative-easing.md new file mode 100644 index 0000000..481ac48 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/macro-quantitative-easing.md @@ -0,0 +1,29 @@ +--- +title: Quantitative easing +language: en +source: openai-codex-synthetic +--- + +# Quantitative easing + +## Purpose at the zero lower bound + +Quantitative easing (QE) is an unconventional monetary policy used when a central bank’s short-term policy rate is at, or very near, the zero lower bound and cannot be cut enough to support demand. Instead of relying mainly on small changes in the overnight rate, the central bank conducts large-scale asset purchases to influence broader financial conditions. + +The modern examples most often cited are the Bank of Japan’s purchases beginning in 2001, the Federal Reserve’s programs after the 2008 financial crisis, the Bank of England’s gilt purchases starting in 2009, and the European Central Bank’s public-sector purchase program announced under Mario Draghi in 2015. Similar tools were expanded again during the COVID-19 market stress of March 2020. + +## What central banks buy + +Under QE, a central bank creates bank reserves and uses them to buy financial assets from the private sector or financial institutions. The assets are typically safe, liquid securities: U.S. Treasury securities, agency mortgage-backed securities, UK gilts, Japanese government bonds, or euro-area sovereign bonds. The purchases are “large-scale” because they expand the central bank balance sheet by hundreds of billions or trillions of dollars, pounds, yen, or euros. + +QE differs from routine open-market operations because of its scale, persistence, and target. Ordinary operations keep the policy rate close to the central bank’s chosen level. QE aims to reduce longer-term borrowing costs and ease credit conditions after the policy rate has already reached its lower bound. + +## Transmission channels and limits + +QE works through several channels. By removing longer-maturity assets from the market, it can lower term premiums and long-term yields. By committing to a larger balance sheet, it can signal that monetary policy will remain accommodative for some time. By improving liquidity in stressed markets, it can support lending and reduce risk spreads. + +QE is not the same as directly giving cash to households or permanently financing government spending. It changes the composition of private-sector portfolios and increases reserve balances in the banking system. Its effect on inflation and output depends on expectations, financial-market conditions, bank and borrower behavior, and the credibility of the central bank. + +## Risks and exit + +Major concerns include distorted asset prices, reduced market functioning, political pressure on central-bank independence, and losses when interest rates rise. Exiting QE usually means slowing purchases, stopping reinvestments, or allowing securities to mature. The central point is that quantitative easing is large-scale asset buying designed for circumstances in which conventional rate cuts are constrained by the zero lower bound. diff --git a/datasets/domain-bilingual-v2/corpus/tang-anshi.md b/datasets/domain-bilingual-v2/corpus/tang-anshi.md new file mode 100644 index 0000000..932987e --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-anshi.md @@ -0,0 +1,19 @@ +--- +title: 安史之乱 +language: zh +source: openai-codex-synthetic +--- + +# 安史之乱 + +## 起因:盛唐制度的裂缝 + +安史之乱指天宝十四载(755)至广德元年(763)安禄山、史思明父子相继发动的叛乱。唐玄宗后期,中央对地方节度使依赖加深,范阳、平卢、河东等边镇兵多财重,兼领三镇的安禄山势力坐大。朝廷内部李林甫、杨国忠相继专权,任人失衡,府兵、赋役和土地秩序的松弛削弱了中央调度能力;玄宗宠信杨贵妃一族,又与安禄山、杨国忠矛盾激化,成为叛乱导火索。 + +## 经过:两京沦陷与平叛 + +755 年,安禄山以“讨杨国忠”为名自范阳起兵,南下河北、河南,攻陷洛阳,建立“大燕”。756 年潼关失守,长安陷落;玄宗西逃蜀地,马嵬驿兵变中杨国忠被杀,杨贵妃死。太子李亨在灵武即位,是为唐肃宗。此后郭子仪、李光弼等组织平叛,并借回纥兵力收复长安、洛阳。安禄山被其子安庆绪所杀,史思明复起后又称帝,最终也为其子史朝义所杀。763 年,史朝义败死,叛乱结束。 + +## 影响:由盛转衰 + +八年战乱使北方户口流亡、田园荒芜,国家财政和赋役基础遭到重创。朝廷为平叛承认大量降将和地方军镇,节度使拥兵自重,藩镇割据成为中晚唐政治顽疾。皇权威望下降,宦官、军将与地方势力共同牵制中央;唐朝虽恢复名义统一,却失去开元以来的强盛格局,安史之乱遂被视为唐代由盛转衰的关键转折。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-founding.md b/datasets/domain-bilingual-v2/corpus/tang-founding.md new file mode 100644 index 0000000..9717bc4 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-founding.md @@ -0,0 +1,19 @@ +--- +title: 唐朝的建立与统一 +language: zh +source: openai-codex-synthetic +--- + +# 唐朝的建立与统一 + +## 太原起兵与 618 年建唐 + +隋末民变四起、东都与江都政局失控,李渊以太原留守身份掌握晋阳兵力。617 年,他在刘文静、裴寂等支持下起兵,李建成、李世民分领军队南下,攻入关中,占据长安,并拥立代王杨侑为帝,以取得政治名义。618 年,隋炀帝在江都被弑后,李渊受禅称帝,国号唐,建元武德,都长安,是为唐高祖。 + +## 唐初统一战争 + +建唐并不等于全国立即统一。武德年间,唐廷先后面对薛举、薛仁杲据陇右,刘武周据并州,王世充据洛阳,窦建德据河北,萧铣据江陵等割据势力。李世民在浅水原、虎牢等战役中击败强敌,621 年擒窦建德、降王世充,奠定中原归唐。随后唐军继续平定江南、淮南等地,至武德后期,全国主要割据力量基本被纳入中央控制。 + +## 关陇集团的政治基础 + +唐初政权的核心基础来自西魏、北周以来形成的关陇军事贵族集团。李氏家族本属陇西高门,又长期参与北周、隋朝政治,与独孤氏、窦氏、长孙氏等门阀通婚结盟。关中地理形胜、长安政治象征、府兵将领网络与旧贵族声望,共同支撑了李渊起兵的合法性和动员能力,也使唐初中央政权具有浓厚的关陇色彩。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-juntian.md b/datasets/domain-bilingual-v2/corpus/tang-juntian.md new file mode 100644 index 0000000..3c75817 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-juntian.md @@ -0,0 +1,19 @@ +--- +title: 唐朝的均田制 +language: zh +source: openai-codex-synthetic +--- + +# 唐朝的均田制 + +## 授田对象 + +唐朝均田制承袭北魏、隋的律令传统,在《唐令·田令》中以州县户籍为本。授田对象首先是编户民中十八岁以上的男口,即中男、丁男;成年妇女通常不另受田,寡妻妾可按减额给田。老男、笃疾、废疾者也给少量口分田。工商户及官户、杂户、奴婢、部曲等,因身份和附籍关系减半或另有额限;不在籍的人口不得授田。 + +## 口分田与永业田 + +一丁标准为一顷,即唐亩百亩:口分田八十亩,永业田二十亩。口分田以“口”为据,不能自由买卖,身死、年老或迁徙时原则上交还州县,再授他人。永业田归户长期占有,可传给子孙,多用于桑麻、果木等经营,买卖亦受令文限制。寡妻妾三十亩、老疾四十亩等减额授田,通常不构成完整的八十与二十结构。 + +## 土地分配原理 + +均田制并非平均分割天下田土,而是律令国家把无主地、公田和可籍田按丁口、年龄、身份定额配置。宽乡地多可足额,狭乡地少则折减;口分田的授还循环形成可再分配的土地池。其目的在于限制兼并、维持小农耕作基础,并使户籍、土地与赋役管理相互衔接。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-keju.md b/datasets/domain-bilingual-v2/corpus/tang-keju.md new file mode 100644 index 0000000..f5ef2ab --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-keju.md @@ -0,0 +1,19 @@ +--- +title: 唐朝的科举制 +language: zh +source: openai-codex-synthetic +--- + +# 唐朝的科举制 + +## 常科与制科 + +唐朝科举是国家选官制度的重要组成部分,服务于中央集权与官僚治理。考试大体分为常科与制科。常科为定期举行的贡举,由州县荐送士人,尚书省礼部主持考试,及第后取得“出身”,仍须经过吏部铨选方能授官。制科则由皇帝临时下诏设置,带有应时取才性质,如“贤良方正”“直言极谏”等名目,常以策问考察政务见识,可用于破格任用或提拔。 + +## 进士与明经 + +唐代科目众多,进士科与明经科最具代表性。明经重在儒家经典,常考帖经、墨义,强调记诵与经义熟习,录取相对较多。进士科声望最高,重诗赋、策论和文辞才识,名额较少,竞争激烈,及第者常被视为清望之选。唐太宗以后科举逐渐制度化,武则天时期扩大取士规模并重视殿廷策问,进一步提高了科举在政治生活中的地位。 + +## 对门阀士族的冲击 + +科举并未立刻消除家世、荐举和社会资源的影响,但它改变了魏晋以来门阀士族凭谱牒与门第垄断官位的局面。寒门庶族士人可凭经学、文章和考试成绩进入仕途,地方人才也能通过贡举进入中央官僚体系。随着进士声望上升,政治评价逐渐从“出身高低”转向“文才与政见”,门阀士族的世袭优势受到持续冲击,唐代官僚结构因此更具流动性。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-liangshui.md b/datasets/domain-bilingual-v2/corpus/tang-liangshui.md new file mode 100644 index 0000000..a9ae2ee --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-liangshui.md @@ -0,0 +1,21 @@ +--- +title: 两税法 +language: zh +source: openai-codex-synthetic +--- + +# 两税法 + +## 改革背景 + +两税法是唐代中后期财政制度的关键转折。建中元年(780 年),唐德宗任用宰相杨炎推行新法,目的在于整顿户籍散乱、赋役名目繁杂和国家财政不足的局面。此前以人丁和旧籍为中心的赋役办法已难维持,中央需要一种更能反映现实财富与土地占有状况的征税制度。 + +## 主要内容 + +两税法的核心,是以资产和土地为主要征税依据,取代原先的租庸调。新法规定“户无主客,以见居为簿”,按现居地登记纳税;“人无丁中,以贫富为差”,依据户等、资产多少和田亩数量核定税额。国家先估算一年财政所需,再分摊到各地征收,体现“量出制入”的原则。 + +征收时间也被制度化为一年两次:夏税和秋税。夏税通常限六月以前完成,秋税限十一月以前完成,因此称“两税”。这一安排减少了旧制中多种徭役、杂税并行的混乱,使财政收入更集中、更便于核算。 + +## 历史影响 + +两税法标志着唐朝赋税制度从按丁征发转向按财产和土地征收,是中国古代税制史上的重大改革。它强化了政府对现实经济资源的掌握,也承认土地兼并和户口流动后的社会变化。此后,土地、资产与货币化征收在国家财政中的地位上升,唐朝后期及后世赋税制度均深受其影响。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-xiyu.md b/datasets/domain-bilingual-v2/corpus/tang-xiyu.md new file mode 100644 index 0000000..b651678 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-xiyu.md @@ -0,0 +1,19 @@ +--- +title: 唐朝与西域的关系 +language: zh +source: openai-codex-synthetic +--- + +# 唐朝与西域的关系 + +## 都护府与安西四镇 + +唐朝经营西域,核心在于以都护府统辖军镇、羁縻州和交通要道。安西都护府先后治西州、高昌、龟兹等地,是朝廷控制天山南路的枢纽;北庭都护府设于庭州,主要镇抚天山北路。安西四镇通常指龟兹、于阗、疏勒、碎叶,早期名目曾有调整,焉耆也一度居其列。四镇既是边防据点,也是管理胡商、使节、僧侣往来的行政网络。 + +## 丝绸之路的经营 + +西域政策服务于长安—河西—西州—龟兹—葱岭一线的丝绸之路经营。唐朝通过驿站、烽燧、屯田、镇兵和互市制度,保障商旅安全,维持贡使往来,并使粟特商人、吐火罗僧侣、波斯货物能够进入内地市场。西州、庭州、龟兹等地因此兼具军事、财政和交通意义。 + +## 与突厥、回纥的关系 + +唐在西域首先面对突厥势力,尤其是西突厥诸部对绿洲城邦的影响。朝廷常以册封、征讨、设置羁縻府州等方式分化其部众,并借都护府维持对天山南北的控制。回纥兴起后,唐与其既有同盟、册封和绢马贸易,也有围绕北庭、河西通道与商路利益的摩擦。西域关系因此不是单纯的战争史,而是军镇制度、交通经营与草原政治共同作用的结果。 diff --git a/datasets/domain-bilingual-v2/corpus/tang-zuyongdiao.md b/datasets/domain-bilingual-v2/corpus/tang-zuyongdiao.md new file mode 100644 index 0000000..12d1464 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tang-zuyongdiao.md @@ -0,0 +1,19 @@ +--- +title: 租庸调制 +language: zh +source: openai-codex-synthetic +--- + +# 租庸调制 + +## 制度基础 + +租庸调制是唐代前期以均田制为土地前提、以户籍和丁男为征收单位的赋役制度。武德年间确立相关法令,贞观时期继续推行,国家通过均田授受、户籍登记和计丁征课,把土地占有、农业生产与财政役使连接起来。其核心不是按田亩多少直接征税,而是依“丁”承担固定的租、庸、调义务。 + +## 租、庸、调的内容 + +“租”主要纳粟,通常每丁岁纳粟二石,是国家取得粮食的重要来源。“调”按地方物产征收绢、绵或布、麻,常见标准为绢二丈、绵三两,或折纳布帛。“庸”本指正役,即成年丁男每年为官府服役一定日数;若不实际服役,则可纳绢布代役,故称“庸”。这种以实物和劳役并行的设计,使唐廷既能获得仓储粮粟、织物贡赋,也能调动人力修筑、运输和官府杂役。 + +## 运行意义与局限 + +租庸调制依赖均田制、编户齐民和基层里坊乡里管理。户籍清楚、土地大体均平时,它有利于固定财政收入,减轻临时摊派,并强化中央对民户的控制。唐代前期政治秩序稳定、农业恢复,与这一制度密切相关。但随着人口流动、逃户增加、土地兼并和户籍失实,按丁征收的基础逐渐松动,租、庸、调难以准确落实,财政压力也随之加重。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-bianzheng.md b/datasets/domain-bilingual-v2/corpus/tcm-bianzheng.md new file mode 100644 index 0000000..53e668a --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-bianzheng.md @@ -0,0 +1,17 @@ +--- +title: 辨证论治 +language: zh +source: openai-codex-synthetic +--- + +# 辨证论治 + +辨证论治是中医诊疗的核心方法:先据望、闻、问、切所得,分析病位、病性与邪正关系,形成“证”;再依法立方施治。同一病名可因证不同而治法不同,同一证候亦可见于不同疾病,故重在“证”的判断。 + +## 八纲辨证 + +八纲即阴、阳、表、里、寒、热、虚、实。《黄帝内经》以来以阴阳为总纲:阳证多见发热、躁动、脉数等偏盛偏亢表现;阴证多见畏寒、沉静、脉迟等偏衰偏寒表现,但不可机械等同。表里辨病位:表证病在肌表,常见恶寒发热、头身疼痛;里证病在内,常见胸腹、二便、神志等变化。寒热辨病性:寒证见冷痛、喜暖、舌淡;热证见口渴、烦躁、舌红。虚实辨邪正盛衰:虚证为正气不足,如倦怠、自汗、脉弱;实证为邪气壅盛,如胀满拒按、痰涎壅盛、脉实。 + +## 论治原则 + +论治须与八纲相应。《素问·至真要大论》提出“寒者热之,热者寒之”;临床又有“虚则补之,实则泻之”“表者解之,里者治之”。如风寒表实宜辛温解表,里热实结宜清热泻下,气虚宜补气扶正,阴虚内热宜滋阴清热。病情复杂时需分清主次,遵循急则治标、缓则治本、标本兼顾;若表里同病、寒热错杂、虚实夹杂,则应解表与清里、温化与清泄、扶正与祛邪配合使用。张仲景《伤寒论》的六经治法虽另有体系,但其“随证治之”的精神正体现辨证论治。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-jingluo.md b/datasets/domain-bilingual-v2/corpus/tcm-jingluo.md new file mode 100644 index 0000000..5ea4bb5 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-jingluo.md @@ -0,0 +1,19 @@ +--- +title: 经络学说 +language: zh +source: openai-codex-synthetic +--- + +# 经络学说 + +## 十二经脉的循行 + +经络学说以《灵枢·经脉》为重要依据,说明气血在人体内外上下的运行通路。十二经脉包括手太阴肺经、手阳明大肠经、足阳明胃经、足太阴脾经、手少阴心经、手太阳小肠经、足太阳膀胱经、足少阴肾经、手厥阴心包经、手少阳三焦经、足少阳胆经、足厥阴肝经。其循行规律为:手三阴从胸走手,手三阳从手走头,足三阳从头走足,足三阴从足走腹胸。气血依次相贯,形成“肺—大肠—胃—脾—心—小肠—膀胱—肾—心包—三焦—胆—肝—复入肺”的循环。 + +## 奇经八脉与统摄作用 + +奇经八脉为督脉、任脉、冲脉、带脉、阴跷脉、阳跷脉、阴维脉、阳维脉。督脉行背部正中,总督一身之阳经;任脉行胸腹正中,联系诸阴经;冲脉起于胞中,循腹上行,为十二经脉气血会聚之处;带脉环腰一周,约束纵行经脉。阴跷、阳跷起于足跟附近,分别循内踝、外踝上行;阴维、阳维联络阴经与阳经,使全身经气相维相系。 + +## 气血运行通路 + +十二经脉为经络系统主干,奇经八脉为调节与蓄溢通道。经气通过经脉、络脉、经筋和皮部,由内达外、由表入里,维持四肢百骸与头面躯干的联系。临床认识经络,不在于单列部位,而在于把循行线、交接点和气血先后贯通起来。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-qixue.md b/datasets/domain-bilingual-v2/corpus/tcm-qixue.md new file mode 100644 index 0000000..c843f2f --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-qixue.md @@ -0,0 +1,19 @@ +--- +title: 气血津液 +language: zh +source: openai-codex-synthetic +--- + +# 气血津液 + +## 生成来源 + +气、血、津液是中医说明人体生命活动的基本物质。《灵枢·决气》将气、血、津、液分论,强调其皆与水谷精微有关。气的生成以肾中精气为根,以脾胃化生的水谷精气为源,并赖肺吸入清气相合,形成元气、宗气、营气、卫气等。血由水谷精微化生,需脾胃运化、心主血脉、肝藏血共同维持。津液来源于饮食水谷,经脾胃吸收转输,肺宣发肃降,肾蒸腾气化,三焦通调水道而布散全身。 + +## 主要功能 + +气的功能概括为推动、温煦、防御、固摄、气化:能推动血行津布,温养脏腑形体,护卫肌表,固摄血液、汗液、尿液等,并参与精微物质转化。血以濡养和化神为主,充养皮肉筋骨,维持面色、爪甲、目睛与神志活动。津液包括较清稀的“津”和较稠厚的“液”:津多布于肌表孔窍,润泽皮毛并化为汗;液多灌注关节、脑髓、骨节,起滋润濡养作用。 + +## 相互关系 + +气与血关系密切,常概括为“气为血之帅,血为气之母”。气能生血、行血、摄血;血能载气、养气。若气虚,则可见血生不足、血行无力或出血不止;若血虚,则气失所依。气与津液相互为用:气能生津、行津、摄津,津液也能载气,津亏可使气少,暴吐暴泻、大汗可致“气随津脱”。血与津液同源于水谷精微,故称“津血同源”;失血可伤津,津亏亦可使血少而运行不畅。三者协调,则脏腑得养、形神相安;失调则形成气虚、血瘀、津亏等病机。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-sizhen.md b/datasets/domain-bilingual-v2/corpus/tcm-sizhen.md new file mode 100644 index 0000000..0c56bb7 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-sizhen.md @@ -0,0 +1,19 @@ +--- +title: 四诊 +language: zh +source: openai-codex-synthetic +--- + +# 四诊 + +## 含义与原则 + +四诊是中医诊察疾病、搜集病情资料的基本方法,见于《黄帝内经》以来的医籍,后世张仲景、李时珍等均重视其临床价值。四诊即望、闻、问、切,核心在“四诊合参”:不凭单一征象定论,而以外在表现、患者主诉、声音气味与按脉触诊相互校验,综合判断正邪盛衰、病位深浅与病势变化,为辨证论治提供依据。 + +## 四诊内容 + +望诊重在观察神色、形态、皮肤、头面五官及舌象,如舌质、舌苔、润燥、胖瘦等。闻诊包括听声音与嗅气味,察语声、咳嗽、呼吸、呃逆,以及口气、汗液、二便和分泌物气味。问诊须追问寒热、汗出、疼痛、饮食口味、睡眠、二便、既往病史、用药史;妇人兼问经带胎产,小儿兼问喂养与起病经过,临床常参照《十问歌》次第询问。切诊以寸口脉为主,三指分候寸、关、尺,察浮沉、迟数、强弱,并可按胸腹、肌肤、手足温度。 + +## 合参方法 + +四诊合参通常先整体望察,再定向询问,以闻诊、切诊复核。记录时应区分“医者所见”“患者所述”和“按切所得”。若舌象、脉象、症状一致,可作为主要诊断依据;若相互矛盾,应复查病程、诱因、饮食及服药情况,避免舍证从脉或舍脉从证。四诊资料互相印证,方能形成可靠的中医诊断判断。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-wuxing.md b/datasets/domain-bilingual-v2/corpus/tcm-wuxing.md new file mode 100644 index 0000000..18cd824 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-wuxing.md @@ -0,0 +1,19 @@ +--- +title: 五行学说 +language: zh +source: openai-codex-synthetic +--- + +# 五行学说 + +## 相生相克 + +五行学说见于《尚书·洪范》及《黄帝内经》,以木、火、土、金、水概括人体生理、病理与自然变化。相生次序为:木生火、火生土、土生金、金生水、水生木,表示资生、促进与接续;相克次序为:木克土、土克水、水克火、火克金、金克木,表示制约、平衡与防止偏亢。生中有克、克中有生,称为制化;制约太过为相乘,反向欺侮为相侮,如木乘土、土侮木,常用于说明病势传变。 + +## 与五脏的配属 + +中医以五脏为核心建立配属:肝属木,心属火,脾属土,肺属金,肾属水。由此可推及五志、五色、五味等线索,如怒多及肝,苦入心,甘入脾,辛入肺,咸入肾。配属不是机械对应,而是用来标明脏与脏之间的资助、约束和易感方向。 + +## 在辨证中的运用 + +辨证时,五行帮助分析本脏病与他脏相及。母病及子如肾水不足不能涵养肝木,可见肝阳偏亢;子病犯母如心火亢盛耗伤肝阴。相乘相侮可解释“肝木乘脾土”所见胁胀、食少、泄泻,或“肝火犯肺”所见咳嗽、胸胁不舒。治法据此立意:虚则补其母,如培土生金、滋水涵木;实则泻其子或抑强扶弱,如抑木扶土、佐金平木。五行辨证强调关系网络,为论治提供方向。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-yinyang.md b/datasets/domain-bilingual-v2/corpus/tcm-yinyang.md new file mode 100644 index 0000000..d12fc03 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-yinyang.md @@ -0,0 +1,19 @@ +--- +title: 阴阳学说 +language: zh +source: openai-codex-synthetic +--- + +# 阴阳学说 + +## 基本含义 + +阴阳学说是中医基础理论的总纲,用以说明人体生命活动与疾病变化的根本规律。《黄帝内经·素问·阴阳应象大论》称:“阴阳者,天地之道也。”在医学中,阳多指运动、温煦、升发、外向、功能;阴多指静守、滋润、下降、内在、物质。二者不是固定物名,而是对相关事物和现象属性的概括。 + +## 对立、互根与消长 + +阴阳首先表现为对立制约,如寒与热、动与静、上与下相互区别并相互约束。其次为互根互用,即“阴在内,阳之守也;阳在外,阴之使也”,没有阴的濡养,阳无所附;没有阳的推动,阴不能化生和运行。阴阳又处于消长变化中:昼夜、四时、人体兴奋与休息,均体现阴消阳长或阳消阴长。正常消长维持动态平衡,过度偏盛偏衰则可成病。 + +## 转化及生理病理应用 + +阴阳转化指在一定条件下,对立双方可向其相反方面变化,《内经》所谓“重阴必阳,重阳必阴”即此意。生理上,温煦与滋润、推动与收藏相互协调,形成“阴平阳秘,精神乃治”。病理上,阳盛可见热象,阴盛可见寒象;阴虚则阳相对偏亢,阳虚则阴相对偏盛。治疗应用重在审察阴阳偏颇,遵循“损其有余,补其不足”,以扶阳、滋阴、清热、散寒等方法恢复平衡。 diff --git a/datasets/domain-bilingual-v2/corpus/tcm-zangfu.md b/datasets/domain-bilingual-v2/corpus/tcm-zangfu.md new file mode 100644 index 0000000..bcd6b61 --- /dev/null +++ b/datasets/domain-bilingual-v2/corpus/tcm-zangfu.md @@ -0,0 +1,23 @@ +--- +title: 脏腑学说(藏象) +language: zh +source: openai-codex-synthetic +--- + +# 脏腑学说(藏象) + +## 总则 + +脏腑学说又称藏象,见于《黄帝内经》《素问·灵兰秘典论》。“藏”指内在脏器及其精气活动,“象”指外在征象。它以功能系统说明人体生理:五脏藏精气而不泻,六腑传化物而不藏。临床分析病位时,常据脏腑功能盛衰及相互联系,而非单纯解剖位置。 + +## 五脏功能 + +心主血脉、藏神,统摄血行和神志;肺主气司呼吸,宣发肃降,通调水道,外合皮毛;脾主运化水谷、升清、统血,为气血生化之源;肝主疏泄、调畅气机,又藏血以濡养筋目;肾藏精,主生长发育与生殖,主水、纳气,关联骨髓耳齿及二便。 + +## 六腑功能 + +胆贮排胆汁,助消化并主决断;胃受纳、腐熟水谷,为水谷之海;小肠受盛化物、泌别清浊;大肠传导糟粕、排出大便;膀胱贮尿、排尿,赖气化运行;三焦主持上中下三部气机与水道,使脏腑之间通行布散。 + +## 表里配属 + +五脏六腑一阴一阳相配:心与小肠、肺与大肠、脾与胃、肝与胆、肾与膀胱相表里,另有心包与三焦相表里。表里关系说明功能协同与病变影响,如脾失运化可及胃纳,肺失宣降可累大肠传导;此为辨识脏腑病位的要点。 diff --git a/datasets/domain-bilingual-v2/dataset.yaml b/datasets/domain-bilingual-v2/dataset.yaml new file mode 100644 index 0000000..b0bd02b --- /dev/null +++ b/datasets/domain-bilingual-v2/dataset.yaml @@ -0,0 +1,12 @@ +name: domain-bilingual-v2 +description: > + Discriminative bilingual (28 zh / 28 en) domain retrieval set: 8 + intra-cluster-confusable topic clusters (~56 docs) where a query's gold doc + must out-rank ~6 near-neighbour siblings, so ndcg_at_10 / mrr carry real + ranking signal — unlike domain-bilingual-v1, whose 24 mutually-distinct topics + saturate vector/hybrid at 1.0. The engine gates one blended flat-threshold set; + the zh/en split is recorded in reports/BASELINES.md (see + docs/dikw-eval-plan.md §2.4). See docs/phase1-domain-bilingual-v2-design.md. + Thresholds are PLACEHOLDER (empty): calibrated at observed - margin from the + single post-#249/#250 real-vector run (parallel-plan Workstream D). +thresholds: {} diff --git a/datasets/domain-bilingual-v2/queries.yaml b/datasets/domain-bilingual-v2/queries.yaml new file mode 100644 index 0000000..4efb3a6 --- /dev/null +++ b/datasets/domain-bilingual-v2/queries.yaml @@ -0,0 +1,172 @@ +# DRAFT — codex-generated confusable queries; gold targets need human +# verification (uniqueness, sibling-wrongness, dedup, balance). See +# docs/phase1-domain-bilingual-v2-design.md §4. +queries: + - id: zh-tang-founding + q: "唐朝建立与统一过程中,李渊太原起兵、618年建唐、唐初统一战争同关陇集团的政治基础之间有什么关系?" + expect_any: [tang-founding] + - id: zh-tang-juntian + q: "唐朝均田制中授田对象是谁,口分田与永业田如何划分,其土地分配原理怎样成为租庸调等赋役制度的基础?" + expect_any: [tang-juntian] + - id: zh-tang-zuyongdiao + q: "唐代以均田制为基础、按丁征收的租庸调制中,租、庸、调分别对应粟、代役和绢布的哪些赋役内容?" + expect_any: [tang-zuyongdiao] + - id: zh-tang-keju + q: "唐朝选官制度里,常科与制科、进士科与明经科各有什么特点,科举为何会冲击门阀士族的政治基础?" + expect_any: [tang-keju] + - id: zh-tang-xiyu + q: "唐朝经营西域时,安西四镇和都护府如何维护丝绸之路,并与突厥、回纥关系相联系?" + expect_any: [tang-xiyu] + - id: zh-tang-anshi + q: "755—763年安禄山、史思明叛乱的起因和经过是什么,为什么说安史之乱使唐朝由盛转衰?" + expect_any: [tang-anshi] + - id: zh-tang-liangshui + q: "780年杨炎推行的两税法如何以资产和土地为征税依据、实行夏秋两征,并取代以均田为基础的租庸调制?" + expect_any: [tang-liangshui] + - id: zh-china-money-jiaozi + q: "北宋四川交子为什么被称为世界最早纸币,它产生时的商业背景和由私商到官府的发行兑换机制是什么?" + expect_any: [china-money-jiaozi] + - id: zh-china-money-song-inflation + q: "北宋交子钱引因超发造成通货膨胀的过程是什么,铜钱与铁钱并行又怎样加剧货币问题?" + expect_any: [china-money-song-inflation] + - id: zh-china-money-silver + q: "明清时期海外白银流入如何推动白银成为主要通货,并促成中国银本位或白银货币化的形成?" + expect_any: [china-money-silver] + - id: zh-china-money-piaohao + q: "晋商票号的异地汇兑业务如何运作,它与钱庄的存款、放款等传统金融功能有何区别?" + expect_any: [china-money-piaohao] + - id: zh-china-money-yanyin + q: "食盐专卖制度下盐引或盐钞为什么能作为有价凭证流通,并怎样被政府用作财政筹款工具?" + expect_any: [china-money-yanyin] + - id: zh-china-money-qingmiao + q: "王安石青苗法如何通过低息官贷在青黄不接时发放贷款,以抑制兼并、增加财政收入,并为何引发争议?" + expect_any: [china-money-qingmiao] + - id: zh-china-money-yitiaobian + q: "张居正推行的一条鞭法怎样把赋役合并并折银征收,它与明代财政改革和白银使用有什么关系?" + expect_any: [china-money-yitiaobian] + - id: zh-tcm-yinyang + q: "在解释五脏气血的生理病理变化时,如何运用阴阳的对立制约、互根互用、消长平衡和相互转化?" + expect_any: [tcm-yinyang] + - id: zh-tcm-wuxing + q: "将五脏配属木火土金水后,怎样依据五行相生相克分析脏腑病变传变并用于辨证?" + expect_any: [tcm-wuxing] + - id: zh-tcm-qixue + q: "气、血、津液分别怎样生成并发挥推动、濡养、滋润等功能,它们之间又如何相互化生和影响运行?" + expect_any: [tcm-qixue] + - id: zh-tcm-jingluo + q: "十二经脉与奇经八脉的循行路线如何构成气血运行通路,并联系脏腑内外表里?" + expect_any: [tcm-jingluo] + - id: zh-tcm-zangfu + q: "五脏六腑各自的生理功能是什么,脏与腑之间如何形成表里配合关系?" + expect_any: [tcm-zangfu] + - id: zh-tcm-sizhen + q: "诊断同一病证时,如何把望、闻、问、切所得的症状体征进行四诊合参,而不是只凭阴阳寒热下结论?" + expect_any: [tcm-sizhen] + - id: zh-tcm-bianzheng + q: "如何用阴阳、表里、寒热、虚实八纲归纳望闻问切资料,并据此确定相应的论治原则?" + expect_any: [tcm-bianzheng] + - id: zh-china-lit-hongloumeng + q: "哪部中国古典小说通过贾府兴衰和宝黛爱情,塑造贾宝玉、林黛玉、薛宝钗等主要人物并揭示封建家族衰败主题?" + expect_any: [china-lit-hongloumeng] + - id: zh-china-lit-sanguo + q: "哪部中国古典小说描写东汉末年群雄割据,并围绕曹操、刘备、诸葛亮等人物与赤壁之战、官渡之战等重大战役展开?" + expect_any: [china-lit-sanguo] + - id: zh-china-lit-shuihu + q: "哪部中国古典小说以梁山好汉聚义为核心,写宋江、林冲、武松等人物在官逼民反主题下反抗官府的故事?" + expect_any: [china-lit-shuihu] + - id: zh-china-lit-xiyouji + q: "哪部中国古典小说讲述唐僧师徒四人西天取经,并通过孙悟空、猪八戒、沙僧等形象表现神魔冒险与修行主题?" + expect_any: [china-lit-xiyouji] + - id: zh-china-lit-tangshi + q: "在中国古典诗词体裁中,哪一类以律诗、绝句为重要形式,并以李白、杜甫等诗人的不同风格为代表?" + expect_any: [china-lit-tangshi] + - id: zh-china-lit-songci + q: "在中国古典诗词中,哪种体裁依照词牌格律填写,形成婉约与豪放风格,并以苏轼、李清照、辛弃疾等词人为代表?" + expect_any: [china-lit-songci] + - id: zh-china-lit-yuanqu + q: "在中国古典文学体裁中,哪一类包括元杂剧与散曲,以关汉卿等作家和通俗化、舞台化的艺术特点区别于唐诗宋词?" + expect_any: [china-lit-yuanqu] + - id: en-crypto-symmetric + q: "How does AES use one shared secret key for both encryption and decryption in a block cipher, rather than using an RSA public/private key pair?" + expect_any: [crypto-symmetric] + - id: en-crypto-public-key + q: "In RSA public-key cryptography, how does encrypting with a public key and decrypting with a private key differ from shared-secret AES encryption?" + expect_any: [crypto-public-key] + - id: en-crypto-hash + q: "Why are SHA-2 cryptographic hash functions considered one-way integrity checks with collision resistance rather than encryption or digital signatures?" + expect_any: [crypto-hash] + - id: en-crypto-signatures + q: "How does a digital signature sign a message digest with a private key to provide authenticity and non-repudiation, unlike merely hashing or encrypting the message?" + expect_any: [crypto-signatures] + - id: en-crypto-diffie-hellman + q: "How does Diffie-Hellman let two parties agree on a shared secret over a public channel without already having a shared AES key?" + expect_any: [crypto-diffie-hellman] + - id: en-crypto-tls + q: "During the TLS handshake, how are server authentication and session key negotiation combined using public-key methods, Diffie-Hellman, hashes, and symmetric encryption?" + expect_any: [crypto-tls] + - id: en-crypto-block-stream + q: "What are the tradeoffs between block ciphers using modes of operation and stream ciphers, especially compared with AES-style symmetric encryption?" + expect_any: [crypto-block-stream] + - id: en-cell-energy-light-reactions + q: "Which photosynthesis stage in the thylakoid membranes, rather than the Calvin carbon-fixation cycle, uses photosystems to split water and make ATP and NADPH from light?" + expect_any: [cell-energy-light-reactions] + - id: en-cell-energy-calvin-cycle + q: "Which light-independent photosynthetic process uses RuBisCO plus ATP and NADPH from the light reactions to fix CO2 and build sugar instead of splitting water?" + expect_any: [cell-energy-calvin-cycle] + - id: en-cell-energy-glycolysis + q: "Which dedicated pathway in cellular respiration splits glucose into pyruvate in the cytosol for only a small net ATP yield before the Krebs cycle and electron transport chain?" + expect_any: [cell-energy-glycolysis] + - id: en-cell-energy-respiration + q: "Which overview ties glycolysis, the Krebs cycle, and the electron transport chain together as the complete oxidation of glucose to CO2 and water?" + expect_any: [cell-energy-respiration] + - id: en-cell-energy-electron-transport + q: "Which inner-membrane electron carrier system pumps protons to build the gradient later used by chemiosmosis and ATP synthase?" + expect_any: [cell-energy-electron-transport] + - id: en-cell-energy-chemiosmosis + q: "Which mechanism uses the proton-motive force from electron transport to drive ATP synthase in phosphorylating ADP into ATP?" + expect_any: [cell-energy-chemiosmosis] + - id: en-cell-energy-photorespiration + q: "Which RuBisCO-related process is wasteful because oxygen is used instead of CO2, competing with the Calvin cycle rather than fixing carbon into sugar?" + expect_any: [cell-energy-photorespiration] + - id: en-french-rev-causes + q: "How did France's pre-1789 fiscal crisis, Enlightenment ideas, and society of estates create the conditions for the French Revolution before the Estates-General assembled?" + expect_any: [french-rev-causes] + - id: en-french-rev-estates-general + q: "What happened in 1789 when the Estates-General was convened, the National Assembly formed, the Bastille was stormed, and the Declaration of Rights was issued?" + expect_any: [french-rev-estates-general] + - id: en-french-rev-terror + q: "How did the Jacobins, the Committee of Public Safety, and Robespierre define the Reign of Terror through executions in 1793–94?" + expect_any: [french-rev-terror] + - id: en-french-rev-thermidor + q: "How did the Thermidorian Reaction bring Robespierre's fall, end the Terror, and lead to the Directory after the Jacobin phase of the Revolution?" + expect_any: [french-rev-thermidor] + - id: en-french-rev-napoleon-rise + q: "How did Napoleon rise from the Brumaire coup to the Consulate and then become Emperor after the French Revolution?" + expect_any: [french-rev-napoleon-rise] + - id: en-french-rev-napoleonic-wars + q: "Which coalitions, major campaigns, and the Continental System shaped the Napoleonic Wars during Napoleon's empire?" + expect_any: [french-rev-napoleonic-wars] + - id: en-french-rev-vienna + q: "How did the 1815 Congress of Vienna create a post-Napoleon settlement based on balance of power and restoration after the Napoleonic Wars?" + expect_any: [french-rev-vienna] + - id: en-macro-money-supply + q: "How do M0, M1, and M2 differ, and how do fractional-reserve commercial banks create new money through lending?" + expect_any: [macro-money-supply] + - id: en-macro-inflation-causes + q: "What is the difference between demand-pull and cost-push inflation, and how does the quantity theory of money explain rising prices?" + expect_any: [macro-inflation-causes] + - id: en-macro-cb-policy + q: "How does a central bank use the policy rate, open-market operations, and inflation targeting to conduct monetary policy?" + expect_any: [macro-cb-policy] + - id: en-macro-interest-rates + q: "What is the difference between nominal and real interest rates, and what does the yield curve reveal about the term structure of rates?" + expect_any: [macro-interest-rates] + - id: en-macro-quantitative-easing + q: "Why do central banks use large-scale asset purchases as quantitative easing when interest rates are at the zero lower bound?" + expect_any: [macro-quantitative-easing] + - id: en-macro-phillips + q: "What does the Phillips curve imply about the short-run inflation-unemployment tradeoff, and why did that relationship break down?" + expect_any: [macro-phillips] + - id: en-macro-hyperinflation + q: "What makes hyperinflation a self-reinforcing episode of very high inflation leading to currency collapse, as seen in historical cases?" + expect_any: [macro-hyperinflation] diff --git a/scripts/generate_domain_bilingual_v2.py b/scripts/generate_domain_bilingual_v2.py new file mode 100644 index 0000000..39405a2 --- /dev/null +++ b/scripts/generate_domain_bilingual_v2.py @@ -0,0 +1,342 @@ +"""Generate the ``domain-bilingual-v2`` discriminative corpus. + +Per ``docs/phase1-domain-bilingual-v2-design.md``: 8 intra-cluster-confusable +topic clusters (~56 docs, 28 zh / 28 en). The whole point is *confusability* — +each doc's prompt names its cluster **siblings** so the model writes documents +that overlap in vocabulary and framing but differ in the one answerable fact, +forcing the retriever to discriminate (so ``ndcg_at_10`` / ``mrr`` carry signal +instead of saturating at 1.0 like ``domain-bilingual-v1``). + +Reuses the factory (``RetryingMiniMaxClient`` + the selected provider's transport, +retries, audit, ``--resume``) and ``generate_bilingual_corpus._write_markdown``. +Run per-cluster to keep slow codex (concurrency 1) batches small and resumable: + + uv run python scripts/generate_domain_bilingual_v2.py --provider codex --cluster crypto + uv run python scripts/generate_domain_bilingual_v2.py --provider codex # all + uv run python scripts/generate_domain_bilingual_v2.py --provider codex --dry-run # $0 plan +""" + +from __future__ import annotations + +import argparse +import asyncio +import json +import sys +from dataclasses import dataclass +from pathlib import Path + +from dikw_data.audit import AuditStore +from dikw_data.llm_client import RetryingMiniMaxClient, TaskResult +from dikw_data.pipeline import add_provider_args, load_config_from_args +from dikw_data.tasks import LLMTask, hash_text + +DATASET = "domain-bilingual-v2" +PROMPT_VERSION = "v2" +STAGE = "generate_domain_bilingual_v2" + +# Frontmatter provenance marker by provider (mirrors generate_bilingual_corpus). +SOURCE_MARKERS = { + "minimax": "minimax-synthetic", + "deepseek": "deepseek-synthetic", + "codex": "openai-codex-synthetic", +} + + +@dataclass(frozen=True) +class DocSpec: + slug: str # stem = "-" + title: str + focus: str # the distinctive, answerable fact this doc must uniquely own + + +@dataclass(frozen=True) +class Cluster: + prefix: str + language: str # "zh" | "en" + topic: str + docs: tuple[DocSpec, ...] + + +# 8 clusters x 7 docs = 56 (28 zh / 28 en). Within a cluster the docs share +# vocabulary (confusable); across clusters they are distinct (recall stays healthy). +CLUSTERS: tuple[Cluster, ...] = ( + Cluster("tang", "zh", "唐朝的制度、改革与重大历史事件", ( + DocSpec("founding", "唐朝的建立与统一", "李渊太原起兵、618 年建唐、唐初统一战争与关陇集团的政治基础"), + DocSpec("juntian", "唐朝的均田制", "均田制的授田对象、口分田与永业田的划分及土地分配原理"), + DocSpec("zuyongdiao", "租庸调制", "以均田为基础、按丁征收的租(粟)庸(代役)调(绢布)赋役制度"), + DocSpec("keju", "唐朝的科举制", "常科与制科、进士与明经科目,以及科举对门阀士族的冲击"), + DocSpec("xiyu", "唐朝与西域的关系", "安西四镇、都护府、丝绸之路经营与对突厥回纥的关系"), + DocSpec("anshi", "安史之乱", "755–763 年安禄山史思明叛乱的起因经过及唐由盛转衰的影响"), + DocSpec("liangshui", "两税法", "780 年杨炎两税法以资产和土地、夏秋两征取代租庸调的改革"), + )), + Cluster("china-money", "zh", "中国古代的货币、白银与财政改革", ( + DocSpec("jiaozi", "交子与纸币的起源", "北宋四川交子作为世界最早纸币的产生背景与发行机制"), + DocSpec("song-inflation", "北宋的通货膨胀", "交子钱引超发引发的通胀与铜钱铁钱并行的货币问题"), + DocSpec("silver", "明清白银货币化", "明代白银成为主要通货、海外白银流入与银本位的形成"), + DocSpec("piaohao", "票号与钱庄", "晋商票号的汇兑业务与钱庄的存放款等传统金融机构运作"), + DocSpec("yanyin", "盐引制度", "食盐专卖下盐引(盐钞)作为有价凭证与财政工具的运作"), + DocSpec("qingmiao", "王安石青苗法", "青苗法的低息官贷设计、抑兼并增财政的目标及其争议"), + DocSpec("yitiaobian", "一条鞭法", "明代张居正一条鞭法将赋役合并、折银征收的改革"), + )), + Cluster("tcm", "zh", "中医的基础理论概念", ( + DocSpec("yinyang", "阴阳学说", "阴阳的对立、互根、消长、转化及其在生理病理中的应用"), + DocSpec("wuxing", "五行学说", "五行的相生相克、与五脏的配属及在辨证中的运用"), + DocSpec("qixue", "气血津液", "气、血、津液的生成、功能与相互关系"), + DocSpec("jingluo", "经络学说", "十二经脉与奇经八脉的循行及气血运行通路"), + DocSpec("zangfu", "脏腑学说(藏象)", "五脏六腑的生理功能与表里关系"), + DocSpec("sizhen", "四诊", "望、闻、问、切四诊合参的诊断方法"), + DocSpec("bianzheng", "辨证论治", "八纲辨证(阴阳表里寒热虚实)与论治原则"), + )), + Cluster("china-lit", "zh", "中国古典小说与诗词体裁", ( + DocSpec("hongloumeng", "红楼梦", "曹雪芹《红楼梦》的贾府兴衰、宝黛爱情及主要人物与主题"), + DocSpec("sanguo", "三国演义", "罗贯中《三国演义》的群雄割据、主要人物与重大战役"), + DocSpec("shuihu", "水浒传", "施耐庵《水浒传》的梁山好汉与官逼民反主题"), + DocSpec("xiyouji", "西游记", "吴承恩《西游记》的取经故事与师徒四人形象"), + DocSpec("tangshi", "唐诗", "唐诗的律诗绝句体裁、李白杜甫等代表诗人及风格"), + DocSpec("songci", "宋词", "宋词的婉约与豪放、词牌格律及代表词人"), + DocSpec("yuanqu", "元曲", "元杂剧与散曲、关汉卿等代表作家及艺术特点"), + )), + Cluster("crypto", "en", "applied cryptography primitives and protocols", ( + DocSpec("symmetric", "Symmetric-key ciphers", "AES and shared-secret block-cipher encryption with one shared key"), + DocSpec("public-key", "Public-key cryptography", "RSA and asymmetric key pairs encrypting with a public, decrypting with a private key"), + DocSpec("hash", "Cryptographic hash functions", "SHA-2 one-way functions, collision resistance, and integrity (not encryption)"), + DocSpec("signatures", "Digital signatures", "signing a digest with a private key for authenticity and non-repudiation"), + DocSpec("diffie-hellman", "Diffie–Hellman key exchange", "agreeing a shared secret over a public channel with no prior shared key"), + DocSpec("tls", "The TLS handshake", "negotiating session keys and authenticating a server by combining the other primitives"), + DocSpec("block-stream", "Block vs stream ciphers", "contrasting block ciphers (and modes) with stream ciphers and their tradeoffs"), + )), + Cluster("cell-energy", "en", "cellular energy: photosynthesis and respiration", ( + DocSpec("light-reactions", "The light-dependent reactions", "thylakoid photosystems splitting water to make ATP and NADPH from light"), + DocSpec("calvin-cycle", "The Calvin cycle", "light-independent carbon fixation by RuBisCO using ATP/NADPH to build sugar"), + DocSpec("glycolysis", "Glycolysis", "splitting glucose to pyruvate in the cytosol for a small net ATP yield"), + DocSpec("respiration", "Cellular respiration overview", "oxidising glucose to CO2 and water across glycolysis, the Krebs cycle and the ETC"), + DocSpec("electron-transport", "The electron transport chain", "inner-membrane electron carriers pumping protons to build a gradient"), + DocSpec("chemiosmosis", "Chemiosmosis and ATP synthase", "the proton-motive force driving ATP synthase to phosphorylate ADP"), + DocSpec("photorespiration", "Photorespiration", "RuBisCO's wasteful oxygenase activity, contrasted with the Calvin cycle"), + )), + Cluster("french-rev", "en", "the French Revolution and Napoleonic era", ( + DocSpec("causes", "Causes of the French Revolution", "the pre-1789 fiscal crisis, Enlightenment ideas and the society of estates"), + DocSpec("estates-general", "The Estates-General and 1789", "the 1789 convocation, the National Assembly, the Bastille and the Declaration of Rights"), + DocSpec("terror", "The Reign of Terror", "the Jacobins, the Committee of Public Safety and Robespierre's 1793–94 executions"), + DocSpec("thermidor", "The Thermidorian Reaction", "the fall of Robespierre, the end of the Terror and the Directory"), + DocSpec("napoleon-rise", "Napoleon's rise to power", "the Brumaire coup, the Consulate and Napoleon becoming Emperor"), + DocSpec("napoleonic-wars", "The Napoleonic Wars", "the coalitions, major campaigns and the Continental System"), + DocSpec("vienna", "The Congress of Vienna", "the 1815 post-Napoleon settlement, balance of power and restoration"), + )), + Cluster("macro", "en", "macroeconomics: money, inflation, and policy", ( + DocSpec("money-supply", "The money supply and money creation", "M0/M1/M2 and how fractional-reserve banks create money"), + DocSpec("inflation-causes", "Causes of inflation", "demand-pull vs cost-push inflation and the quantity theory of money"), + DocSpec("cb-policy", "Central-bank monetary policy", "the policy rate, open-market operations and inflation targeting"), + DocSpec("interest-rates", "Interest rates and the yield curve", "nominal vs real rates and the term structure of interest rates"), + DocSpec("quantitative-easing", "Quantitative easing", "large-scale asset purchases as unconventional policy at the zero lower bound"), + DocSpec("phillips", "The Phillips curve", "the short-run inflation–unemployment tradeoff and its breakdown"), + DocSpec("hyperinflation", "Hyperinflation", "self-reinforcing very high inflation, historical episodes and currency collapse"), + )), +) + + +def _doc_task(cluster: Cluster, doc: DocSpec) -> tuple[LLMTask, str]: + """Build one corpus-doc task; return (task, stem). Siblings are named in the + prompt so the model writes a confusable-but-distinct document.""" + is_zh = cluster.language == "zh" + language_name = "Chinese" if is_zh else "English" + siblings = "; ".join(f"{d.title} ({d.focus})" for d in cluster.docs if d.slug != doc.slug) + length = "350–600 个汉字" if is_zh else "250–450 words" + system = ( + "You write synthetic but internally consistent Markdown documents for a " + "retrieval-evaluation corpus." + ) + user = ( + f"Write one {language_name} Markdown document for a retrieval-evaluation corpus.\n" + f"Cluster topic: {cluster.topic}.\n" + f"Document title: {doc.title}.\n" + f"This document must be THE single authoritative source for: {doc.focus}.\n\n" + f"Sibling documents in the SAME cluster cover: {siblings}.\n" + "Deliberately share general vocabulary, framing and terminology with those siblings so " + "the documents are easily confused by a retriever — BUT make this document the uniquely " + "correct answer for its own focus, and keep each sibling's distinctive facts OUT of it.\n\n" + f"Length: {length}. Include one H1 heading and at least two H2 headings, concrete named " + "entities and specifics. Output Markdown only. Do not output JSON or code fences, and do " + "not mention that the document is synthetic." + ) + stem = f"{cluster.prefix}-{doc.slug}" + source = f"{DATASET}:{stem}:{cluster.language}:{PROMPT_VERSION}" + task = LLMTask( + dataset=DATASET, + stage=STAGE, + source_hash=hash_text(source), + prompt_version=PROMPT_VERSION, + system=system, + user=user, + expected_json=False, + ) + return task, stem + + +def build_tasks(clusters: tuple[Cluster, ...]) -> tuple[list[LLMTask], dict[str, tuple[str, str, str]]]: + """Return (tasks, meta_by_task_id) where meta is (stem, language, title).""" + tasks: list[LLMTask] = [] + meta: dict[str, tuple[str, str, str]] = {} + for cluster in clusters: + for doc in cluster.docs: + task, stem = _doc_task(cluster, doc) + tasks.append(task) + meta[task.task_id] = (stem, cluster.language, doc.title) + return tasks, meta + + +def _materialize(results: list[TaskResult], meta: dict[str, tuple[str, str, str]], marker: str) -> int: + corpus_dir = Path("datasets") / DATASET / "corpus" + corpus_dir.mkdir(parents=True, exist_ok=True) + written = 0 + for result in results: + if result.status != "succeeded" or not isinstance(result.result, dict): + continue + text = result.result.get("text") + if not isinstance(text, str) or not text.strip(): + continue + stem, language, title = meta[result.task_id] + markdown = text.strip() + if not markdown.lstrip().startswith("#"): + markdown = f"# {title}\n\n{markdown}" + frontmatter = f"---\ntitle: {title}\nlanguage: {language}\nsource: {marker}\n---\n\n" + (corpus_dir / f"{stem}.md").write_text(frontmatter + markdown + "\n", encoding="utf-8") + written += 1 + return written + + +def _query_task(cluster: Cluster) -> tuple[LLMTask, str]: + """One JSON-returning task per cluster: ask for a confusable query per doc. + + Per-cluster (not per-doc) so the model sees all siblings at once and crafts + queries that *discriminate* among them — the whole point of v2. + """ + is_zh = cluster.language == "zh" + language_name = "Chinese" if is_zh else "English" + listing = "\n".join(f"- stem={cluster.prefix}-{d.slug} | title={d.title} | focus: {d.focus}" for d in cluster.docs) + n = len(cluster.docs) + system = "You write retrieval-evaluation queries, each with an exact gold document target." + user = ( + f"Below are {n} documents forming one topic cluster ({cluster.topic}). They deliberately " + "overlap in vocabulary. For EACH document write exactly one retrieval query in " + f"{language_name} that:\n" + "- is answered UNIQUELY and correctly by THAT document,\n" + "- is NOT correctly answerable by any sibling in the list, yet\n" + "- is phrased to be lexically tempting toward the siblings (shares terms) so a retriever " + "must discriminate.\n\n" + f"Documents:\n{listing}\n\n" + f"Return ONLY a JSON array of {n} objects, each {{\"stem\": , " + "\"q\": }}. Use each stem exactly once." + ) + source = f"{DATASET}:queries:{cluster.prefix}:{PROMPT_VERSION}" + task = LLMTask( + dataset=DATASET, + stage=f"{STAGE}_queries", + source_hash=hash_text(source), + prompt_version=PROMPT_VERSION, + system=system, + user=user, + expected_json=True, + ) + return task, cluster.prefix + + +def _write_queries(results: list[TaskResult], cluster_by_task: dict[str, Cluster]) -> int: + """Parse per-cluster JSON into a queries.yaml DRAFT (human-verified later). + + id = ``-`` (the ``zh-``/``en-`` prefix the split tool keys on); + expect_any = the single gold stem, kept only if it resolves to a corpus doc. + """ + corpus_dir = Path("datasets") / DATASET / "corpus" + lines = [ + "# DRAFT — codex-generated confusable queries; gold targets need human", + "# verification (uniqueness, sibling-wrongness, dedup, balance). See", + "# docs/phase1-domain-bilingual-v2-design.md §4.", + "queries:", + ] + written = 0 + for result in results: + cluster = cluster_by_task.get(result.task_id) + if cluster is None or result.status != "succeeded": + continue + rows = result.result if isinstance(result.result, list) else [] + valid_stems = {f"{cluster.prefix}-{d.slug}" for d in cluster.docs} + for row in rows: + if not isinstance(row, dict): + continue + stem = str(row.get("stem", "")).strip() + q = str(row.get("q", "")).strip() + if not q or stem not in valid_stems or not (corpus_dir / f"{stem}.md").is_file(): + continue + lines.append(f" - id: {cluster.language}-{stem}") + lines.append(f" q: {json.dumps(q, ensure_ascii=False)}") + lines.append(f" expect_any: [{stem}]") + written += 1 + (Path("datasets") / DATASET / "queries.yaml").write_text("\n".join(lines) + "\n", encoding="utf-8") + return written + + +def main() -> int: + parser = argparse.ArgumentParser(description="Generate the domain-bilingual-v2 confusable corpus.") + parser.add_argument("--queries", action="store_true", help="generate queries.yaml draft (not corpus)") + parser.add_argument("--cluster", help="generate only this cluster prefix (default: all)") + parser.add_argument("--resume", action="store_true") + parser.add_argument("--retry-failed", action="store_true") + parser.add_argument("--max-attempts", type=int, default=None) + parser.add_argument("--concurrency", type=int, default=None) + parser.add_argument("--dry-run", action="store_true") + add_provider_args(parser) + args = parser.parse_args() + + clusters = CLUSTERS + if args.cluster: + clusters = tuple(c for c in CLUSTERS if c.prefix == args.cluster) + if not clusters: + print(f"ERROR: no cluster with prefix {args.cluster!r}; have {[c.prefix for c in CLUSTERS]}") + return 2 + + config = load_config_from_args(args) + kind = "queries" if args.queries else "docs" + print(f"# {kind} for {len(clusters)} cluster(s) via {args.provider} ({config.model})") + + if args.queries: + cluster_by_task = {(_query_task(c)[0].task_id): c for c in clusters} + qtasks = [_query_task(c)[0] for c in clusters] + if args.dry_run: + for c in clusters: + print(f" {c.prefix}: {len(c.docs)} queries [{c.language}]") + return 0 + else: + tasks, meta = build_tasks(clusters) + if args.dry_run: + for task in tasks: + stem, language, _ = meta[task.task_id] + print(f" {stem} [{language}]") + return 0 + + audit = AuditStore(DATASET) + client = RetryingMiniMaxClient(config=config, audit=audit) + run_tasks = qtasks if args.queries else tasks + results = asyncio.run( + client.run_many( + run_tasks, + concurrency=args.concurrency, + resume=args.resume, + retry_failed=args.retry_failed, + max_attempts=args.max_attempts, + ) + ) + for result in results: + print({"task_id": result.task_id[:12], "status": result.status, "attempts": result.attempts}) + + if args.queries: + written = _write_queries(results, cluster_by_task) + print(f"wrote {written} queries to datasets/{DATASET}/queries.yaml") + else: + written = _materialize(results, meta, SOURCE_MARKERS[args.provider]) + print(f"wrote {written} corpus files under datasets/{DATASET}/corpus") + return 1 if any(r.status in {"failed", "needs_manual_review"} for r in results) else 0 + + +if __name__ == "__main__": + sys.exit(main()) From 927716321887e7ac701b8f6b5eb1547638a161d8 Mon Sep 17 00:00:00 2001 From: holo Date: Sun, 28 Jun 2026 22:57:48 +0800 Subject: [PATCH 3/3] eval: calibrate domain-bilingual-v2 + gate at observed-margin (real-vector run) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit One real-vector run (dikw-core 0.6.4, Gitee Qwen3@1024, --retrieval all, cold embed): hit_at_3 1.000 / hit_at_10 1.000 / mrr 0.991 / ndcg_at_10 0.993 / recall_at_100 1.000. Thresholds set at observed-margin (-0.05 hit@k/mrr, -0.03 ndcg/recall): hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.94 / ndcg_at_10 0.96 / recall_at_100 0.97. Honest finding: only PARTIALLY de-saturated. bm25 (ndcg 0.967) carries real intra-cluster signal — the confusable corpus works — but the DRAFT queries over-name their gold (embed the answer term verbatim), so vector/hybrid stay ~0.99 and the zh slice is fully 1.0. A query gold-tightening pass is the discriminative follow-up; recalibrate after it. reports/BASELINES.md records the full per-mode + zh/en split. v2 calibration is independent of #249 (no negatives) and #250 (single fixed config). Co-Authored-By: Claude Opus 4.8 --- datasets/domain-bilingual-v2/dataset.yaml | 15 ++++++-- reports/BASELINES.md | 43 +++++++++++++++++++++++ 2 files changed, 55 insertions(+), 3 deletions(-) diff --git a/datasets/domain-bilingual-v2/dataset.yaml b/datasets/domain-bilingual-v2/dataset.yaml index b0bd02b..931ac5d 100644 --- a/datasets/domain-bilingual-v2/dataset.yaml +++ b/datasets/domain-bilingual-v2/dataset.yaml @@ -7,6 +7,15 @@ description: > saturate vector/hybrid at 1.0. The engine gates one blended flat-threshold set; the zh/en split is recorded in reports/BASELINES.md (see docs/dikw-eval-plan.md §2.4). See docs/phase1-domain-bilingual-v2-design.md. - Thresholds are PLACEHOLDER (empty): calibrated at observed - margin from the - single post-#249/#250 real-vector run (parallel-plan Workstream D). -thresholds: {} + Thresholds calibrated at observed - margin from a real-vector run (2026-06-28, + dikw-core 0.6.4 + Gitee Qwen3@1024) — see reports/BASELINES.md. NOTE: the v1 + queries are DRAFT and over-name their gold, so vector/hybrid only partially + de-saturated (ndcg 0.993; zh slice still 1.0); bm25 (ndcg 0.967) carries the + intra-cluster signal. A query gold-tightening pass is the discriminative + follow-up — recalibrate after it. +thresholds: + hit_at_3: 0.95 + hit_at_10: 0.95 + mrr: 0.94 + ndcg_at_10: 0.96 + recall_at_100: 0.97 diff --git a/reports/BASELINES.md b/reports/BASELINES.md index 130860c..da72fbb 100644 --- a/reports/BASELINES.md +++ b/reports/BASELINES.md @@ -142,3 +142,46 @@ regression-detector). `negatives-ood-v1` observe-only. Recalibrate / promote onc discriminative `domain-bilingual-v2` exists (corpus > ~50 docs, deliberately confusable) — see `docs/dikw-eval-plan.md` §2.3/§3 and `docs/phase1-inhouse-datasets-design.md`. + +## 2026-06-28 — domain-bilingual-v2 calibration (discriminative confusable set) + +**Config.** dikw-core 0.6.4; provider: MiniMax-M3 (LLM, unused — retrieval-only) + +Gitee Qwen3-Embedding-0.6B@1024 (embeddings) + sqlite + jieba. `--retrieval all`, +`--eval retrieval`, `--cache read_write`; cold-embedded the 56-doc corpus once. + +**domain-bilingual-v2** (56 docs / 56 queries; 8 intra-cluster-confusable clusters, +28 zh + 28 en; corpus + queries via codex gpt-5.5 xhigh) — `passed: True`, exit 0. + +Canonical (doc / hybrid): +`hit_at_3 1.000 / hit_at_10 1.000 / mrr 0.991 / ndcg_at_10 0.993 / recall_at_100 1.000` + +Per-mode (`--retrieval all`): + +| mode | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 | +|---|---|---|---|---|---| +| bm25 | 1.000 | 1.000 | 0.955 | 0.967 | 1.000 | +| vector | 1.000 | 1.000 | 0.991 | 0.993 | 1.000 | +| hybrid | 1.000 | 1.000 | 0.991 | 0.993 | 1.000 | + +zh/en split (offline `tools/split_metrics_by_lang.py`, reconciles with the engine's +blended doc metrics): + +| lang | n | hit_at_3 | hit_at_10 | mrr | ndcg_at_10 | recall_at_100 | +|---|---|---|---|---|---|---| +| all | 56 | 1.000 | 1.000 | 0.991 | 0.993 | 1.000 | +| zh | 28 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | +| en | 28 | 1.000 | 1.000 | 0.982 | 0.987 | 1.000 | + +- **Only partially de-saturated.** hybrid/vector `ndcg_at_10 0.993` (vs v1's 1.000) + — barely below saturation, and the **zh slice is fully 1.0**. The confusable + corpus itself works — **bm25 `ndcg 0.967` carries real intra-cluster signal** — but + the **draft queries over-name their gold** (they embed the answer's distinctive + term verbatim), making vector retrieval trivial. A **query gold-tightening pass** + (human verification of `queries.yaml`: describe the answer without naming it) is + the discriminative follow-up; recalibrate after it (vector/hybrid should fall + toward bm25). +- **Gate set at `observed − margin`** (−0.05 hit@k/mrr, −0.03 ndcg/recall): + `hit_at_3 0.95 / hit_at_10 0.95 / mrr 0.94 / ndcg_at_10 0.96 / recall_at_100 0.97`. + A regression-detector floor over **draft** queries — recalibrate after gold-tightening. +- v2's retrieval calibration is independent of dikw-core#249 (no `expect_none` + negatives) and #250 (single fixed retrieval config, no ablation sweep).