[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix by felix-windsor · Pull Request #5 · felix-windsor/Multi-Model-Knowledge-RAG-System

felix-windsor · 2026-05-22T02:11:07Z

Summary

Step 2 of Graph RAG refactor. Lands schema v2 (software-project-doc focus),
bilingual prompt with 7 review-derived rules, and fixes normalizer homonym bug.

Changes

schemas.py: v1 (Chinese IT) → v2 (English canonical, 15 entity types
in 5 tiers, 14 relation types). Preserve v1 as schemas_v1_legacy.py.
prompts.py: bilingual EN+ZH, 7 hard rules, Tier-grouped entity list,
forward-compat Tier Z fallback. 3170 chars.
normalizer.py: homonym fix — (name, type) composite identity.
Disambiguation suffix only when conflict exists.
test_extractor.py: 14 new tests covering 7 Step 2 changes.
Test cleanup: remove dangling-reference test file, update 2 sidecar
tests to v2 drift expectations.

Verification

```
pytest tests/rag_core/ → 20 passed
pytest tests/ (full) → 98 passed / 0 failed
```

Schema-agnostic proof

`schemas_v1_legacy.py` retained as evidence that the main pipeline (normalizer,
scorer, extractor, prompts) only depends on schema interfaces (canonicalize_*,
ENTITY_TYPES, RELATION_COMPATIBILITY), not specific type names. Replacing
schemas.py with v2 required zero changes to the main pipeline.

Next: Step 3 — LightRAGWrapper main-pipeline integration

- Remove handbook.md, STARTUP_GUIDE.md (content merged into README) - Remove standalone comprehensive_test.py (covered by backend/tests) - Remove empty docker-compose.prod.yml stub - Remove deprecated /api/* routes (kept /api/v1/* as the only API surface) - Remove integration test for deprecated routes (test_document_flow.py) - All remaining tests pass

The 200-doc / 420-case synthetic_controlled benchmark is superseded by the public corpus and no longer referenced. Delete the directory wholesale to unclutter the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 2 of Graph RAG refactor — replaces v1 IT-system Chinese schema with v2 software-project-doc English canonical schema, adds bilingual prompts with 7 review-derived extraction rules, and fixes normalizer homonym bug. ## Changes - schemas.py: replace v1 (15 Chinese types) with v2 (15 English canonical types in 5 tiers, 14 relations). Add SCHEMA_VERSION + SCHEMA_APPLICABLE_DOMAIN. Preserve v1 as schemas_v1_legacy.py for schema-agnostic demonstration. - prompts.py: rewrite as bilingual (EN + ZH) prompt with 7 hard rules distilled from 5-document schema review. Add ENTITY_TYPES_BY_TIER grouping with Tier Z fallback for forward-compat. Prompt length 3170 chars. - normalizer.py: fix homonym bug — merging condition upgraded from name-only to (name, type) composite. Same-name-different-type entities preserved as separate nodes with disambiguation suffix (e.g. 'agent[Module]' vs 'agent[Stakeholder]'). Relation endpoints flagged 'homonym_ambiguous' when ambiguous. - test_extractor.py: rewrite with 14 tests covering 7 core Step 2 changes (v2 canonical / aliases / bilingual prompt / tier grouping / homonym preservation / same-type merge / no-suffix defensive guard). Replaces 2 stale self-validating tests anchored to v1 Chinese names. - Test cleanup: delete test_synthetic_controlled_eval_dataset.py (dangling reference to deleted benchmark). Update 2 sidecar tests with v2 drift counts (entity drift 1→3, relation drift 1→2) and set-based comparison to avoid pinning Counter.most_common() ordering. ## Verification - pytest tests/rag_core/: 20 passed / 0 failed - pytest tests/ (full suite): 98 passed / 0 failed - No regression in any preserved test - schemas_v1_legacy.py preserved for schema-agnostic architecture proof

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d720607538

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-22T02:21:10Z

+    for block in blocks:
+        if len(current) + len(block) + 2 <= max_chars:
+            current = f"{current}\n\n{block}".strip()
+            continue


Preserve overflow text when chunking long blocks

When a single paragraph/block exceeds max_chars, assigning current = block[:max_chars] drops the remainder of that block entirely instead of emitting additional chunks. This causes silent data loss in sidecar extraction reports for long paragraphs (common in PRD/SRS docs), so reported entities/relations and drift metrics can be materially incomplete even though processing appears successful.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T02:21:10Z

+        if current:
+            chunks.append({"chunk_id": f"input-{index}", "text": current})
+            index += 1
+        current = block[:max_chars]


Split oversized input blocks instead of truncating

The chunk loader for evaluation repeats the same truncation pattern (current = block[:max_chars]), which discards text after the first max_chars of any long block. In benchmark runs this can corrupt experiment results by evaluating only a prefix of long sections while still counting the case as fully processed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T02:21:10Z

+from app.rag_core.schemas import Entity, Relation
+
+
+WRAPPER_CHARS = "\"'`“”‘’《》〈〉（）()[]【】{}"


Keep bracket suffixes during entity-name normalization

Including [] in WRAPPER_CHARS makes normalize_entity_name strip the trailing ] from homonym-disambiguated names like agent[Module], which the new prompt explicitly encourages. In mixed outputs where entities are auto-disambiguated to bracketed names but relation endpoints already contain bracketed suffixes from the model, endpoint normalization can produce mismatched names and valid relations are marked as missing-endpoint invalid.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T02:21:10Z

-                pdf_path=pdf_path, output_dir=output_dir, lang=lang, **kwargs
-            )
+            text_path = Path(text_path)
+            text = text_path.read_text(encoding="utf-8")


Restore non-UTF8 fallback for text document parsing

This now reads text files with encoding="utf-8" only, so .txt/.md files encoded as GBK/Latin-1/CP1252 fail parsing outright. The previous path handled common fallback encodings, so this is a regression that can block ingestion of legacy enterprise documents even when the content is otherwise valid.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T02:21:10Z

+
+        try:
+            entities, relations, drift = self._coerce_payload(repair_result.payload, chunk_id)
+        except ValidationError:


Skip malformed items instead of dropping whole chunk

A single ValidationError from one bad entity or relation currently aborts the entire chunk extraction and returns an empty ExtractionResult, because the broad except ValidationError wraps the full payload coercion path. With noisy LLM output, one malformed record (for example an empty name/endpoint) can silently discard many otherwise valid entities and relations from the same chunk.

Useful? React with 👍 / 👎.

felix-windsor and others added 6 commits May 21, 2026 16:36

feat: add enterprise RAG evaluation benchmark

d80fe2b

refactor: separate synthetic and public benchmarks

110eae1

refactor: replace public benchmark corpus

5b02aa5

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5

[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5
felix-windsor wants to merge 6 commits into
mainfrom
codex/rag-core-extraction

felix-windsor commented May 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		from app.rag_core.schemas import Entity, Relation


		WRAPPER_CHARS = "\"'`“”‘’《》〈〉（）()[]【】{}"

Conversation

felix-windsor commented May 22, 2026

Summary

Changes

Verification

Schema-agnostic proof

Next: Step 3 — LightRAGWrapper main-pipeline integration

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant