[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5
[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5felix-windsor wants to merge 6 commits into
Conversation
- Remove handbook.md, STARTUP_GUIDE.md (content merged into README) - Remove standalone comprehensive_test.py (covered by backend/tests) - Remove empty docker-compose.prod.yml stub - Remove deprecated /api/* routes (kept /api/v1/* as the only API surface) - Remove integration test for deprecated routes (test_document_flow.py) - All remaining tests pass
The 200-doc / 420-case synthetic_controlled benchmark is superseded by the public corpus and no longer referenced. Delete the directory wholesale to unclutter the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of Graph RAG refactor — replaces v1 IT-system Chinese schema with v2 software-project-doc English canonical schema, adds bilingual prompts with 7 review-derived extraction rules, and fixes normalizer homonym bug. ## Changes - schemas.py: replace v1 (15 Chinese types) with v2 (15 English canonical types in 5 tiers, 14 relations). Add SCHEMA_VERSION + SCHEMA_APPLICABLE_DOMAIN. Preserve v1 as schemas_v1_legacy.py for schema-agnostic demonstration. - prompts.py: rewrite as bilingual (EN + ZH) prompt with 7 hard rules distilled from 5-document schema review. Add ENTITY_TYPES_BY_TIER grouping with Tier Z fallback for forward-compat. Prompt length 3170 chars. - normalizer.py: fix homonym bug — merging condition upgraded from name-only to (name, type) composite. Same-name-different-type entities preserved as separate nodes with disambiguation suffix (e.g. 'agent[Module]' vs 'agent[Stakeholder]'). Relation endpoints flagged 'homonym_ambiguous' when ambiguous. - test_extractor.py: rewrite with 14 tests covering 7 core Step 2 changes (v2 canonical / aliases / bilingual prompt / tier grouping / homonym preservation / same-type merge / no-suffix defensive guard). Replaces 2 stale self-validating tests anchored to v1 Chinese names. - Test cleanup: delete test_synthetic_controlled_eval_dataset.py (dangling reference to deleted benchmark). Update 2 sidecar tests with v2 drift counts (entity drift 1→3, relation drift 1→2) and set-based comparison to avoid pinning Counter.most_common() ordering. ## Verification - pytest tests/rag_core/: 20 passed / 0 failed - pytest tests/ (full suite): 98 passed / 0 failed - No regression in any preserved test - schemas_v1_legacy.py preserved for schema-agnostic architecture proof
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d720607538
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for block in blocks: | ||
| if len(current) + len(block) + 2 <= max_chars: | ||
| current = f"{current}\n\n{block}".strip() | ||
| continue |
There was a problem hiding this comment.
Preserve overflow text when chunking long blocks
When a single paragraph/block exceeds max_chars, assigning current = block[:max_chars] drops the remainder of that block entirely instead of emitting additional chunks. This causes silent data loss in sidecar extraction reports for long paragraphs (common in PRD/SRS docs), so reported entities/relations and drift metrics can be materially incomplete even though processing appears successful.
Useful? React with 👍 / 👎.
| if current: | ||
| chunks.append({"chunk_id": f"input-{index}", "text": current}) | ||
| index += 1 | ||
| current = block[:max_chars] |
There was a problem hiding this comment.
Split oversized input blocks instead of truncating
The chunk loader for evaluation repeats the same truncation pattern (current = block[:max_chars]), which discards text after the first max_chars of any long block. In benchmark runs this can corrupt experiment results by evaluating only a prefix of long sections while still counting the case as fully processed.
Useful? React with 👍 / 👎.
| from app.rag_core.schemas import Entity, Relation | ||
|
|
||
|
|
||
| WRAPPER_CHARS = "\"'`“”‘’《》〈〉()()[]【】{}" |
There was a problem hiding this comment.
Keep bracket suffixes during entity-name normalization
Including [] in WRAPPER_CHARS makes normalize_entity_name strip the trailing ] from homonym-disambiguated names like agent[Module], which the new prompt explicitly encourages. In mixed outputs where entities are auto-disambiguated to bracketed names but relation endpoints already contain bracketed suffixes from the model, endpoint normalization can produce mismatched names and valid relations are marked as missing-endpoint invalid.
Useful? React with 👍 / 👎.
| pdf_path=pdf_path, output_dir=output_dir, lang=lang, **kwargs | ||
| ) | ||
| text_path = Path(text_path) | ||
| text = text_path.read_text(encoding="utf-8") |
There was a problem hiding this comment.
Restore non-UTF8 fallback for text document parsing
This now reads text files with encoding="utf-8" only, so .txt/.md files encoded as GBK/Latin-1/CP1252 fail parsing outright. The previous path handled common fallback encodings, so this is a regression that can block ingestion of legacy enterprise documents even when the content is otherwise valid.
Useful? React with 👍 / 👎.
|
|
||
| try: | ||
| entities, relations, drift = self._coerce_payload(repair_result.payload, chunk_id) | ||
| except ValidationError: |
There was a problem hiding this comment.
Skip malformed items instead of dropping whole chunk
A single ValidationError from one bad entity or relation currently aborts the entire chunk extraction and returns an empty ExtractionResult, because the broad except ValidationError wraps the full payload coercion path. With noisy LLM output, one malformed record (for example an empty name/endpoint) can silently discard many otherwise valid entities and relations from the same chunk.
Useful? React with 👍 / 👎.
Summary
Step 2 of Graph RAG refactor. Lands schema v2 (software-project-doc focus),
bilingual prompt with 7 review-derived rules, and fixes normalizer homonym bug.
Changes
in 5 tiers, 14 relation types). Preserve v1 as schemas_v1_legacy.py.
forward-compat Tier Z fallback. 3170 chars.
Disambiguation suffix only when conflict exists.
tests to v2 drift expectations.
Verification
```
pytest tests/rag_core/ → 20 passed
pytest tests/ (full) → 98 passed / 0 failed
```
Schema-agnostic proof
`schemas_v1_legacy.py` retained as evidence that the main pipeline (normalizer,
scorer, extractor, prompts) only depends on schema interfaces (canonicalize_*,
ENTITY_TYPES, RELATION_COMPATIBILITY), not specific type names. Replacing
schemas.py with v2 required zero changes to the main pipeline.
Next: Step 3 — LightRAGWrapper main-pipeline integration