Skip to content

[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5

Open
felix-windsor wants to merge 6 commits into
mainfrom
codex/rag-core-extraction
Open

[felix] Step 2: Schema v2 landing + bilingual prompts + homonym fix#5
felix-windsor wants to merge 6 commits into
mainfrom
codex/rag-core-extraction

Conversation

@felix-windsor
Copy link
Copy Markdown
Owner

Summary

Step 2 of Graph RAG refactor. Lands schema v2 (software-project-doc focus),
bilingual prompt with 7 review-derived rules, and fixes normalizer homonym bug.

Changes

  • schemas.py: v1 (Chinese IT) → v2 (English canonical, 15 entity types
    in 5 tiers, 14 relation types). Preserve v1 as schemas_v1_legacy.py.
  • prompts.py: bilingual EN+ZH, 7 hard rules, Tier-grouped entity list,
    forward-compat Tier Z fallback. 3170 chars.
  • normalizer.py: homonym fix — (name, type) composite identity.
    Disambiguation suffix only when conflict exists.
  • test_extractor.py: 14 new tests covering 7 Step 2 changes.
  • Test cleanup: remove dangling-reference test file, update 2 sidecar
    tests to v2 drift expectations.

Verification

```
pytest tests/rag_core/ → 20 passed
pytest tests/ (full) → 98 passed / 0 failed
```

Schema-agnostic proof

`schemas_v1_legacy.py` retained as evidence that the main pipeline (normalizer,
scorer, extractor, prompts) only depends on schema interfaces (canonicalize_*,
ENTITY_TYPES, RELATION_COMPATIBILITY), not specific type names. Replacing
schemas.py with v2 required zero changes to the main pipeline.

Next: Step 3 — LightRAGWrapper main-pipeline integration

felix-windsor and others added 6 commits May 21, 2026 16:36
- Remove handbook.md, STARTUP_GUIDE.md (content merged into README)
- Remove standalone comprehensive_test.py (covered by backend/tests)
- Remove empty docker-compose.prod.yml stub
- Remove deprecated /api/* routes (kept /api/v1/* as the only API surface)
- Remove integration test for deprecated routes (test_document_flow.py)
- All remaining tests pass
The 200-doc / 420-case synthetic_controlled benchmark is superseded by the
public corpus and no longer referenced. Delete the directory wholesale to
unclutter the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 2 of Graph RAG refactor — replaces v1 IT-system Chinese schema with
v2 software-project-doc English canonical schema, adds bilingual prompts
with 7 review-derived extraction rules, and fixes normalizer homonym bug.

## Changes

- schemas.py: replace v1 (15 Chinese types) with v2 (15 English canonical
  types in 5 tiers, 14 relations). Add SCHEMA_VERSION + SCHEMA_APPLICABLE_DOMAIN.
  Preserve v1 as schemas_v1_legacy.py for schema-agnostic demonstration.

- prompts.py: rewrite as bilingual (EN + ZH) prompt with 7 hard rules
  distilled from 5-document schema review. Add ENTITY_TYPES_BY_TIER
  grouping with Tier Z fallback for forward-compat. Prompt length 3170 chars.

- normalizer.py: fix homonym bug — merging condition upgraded from
  name-only to (name, type) composite. Same-name-different-type entities
  preserved as separate nodes with disambiguation suffix (e.g. 'agent[Module]'
  vs 'agent[Stakeholder]'). Relation endpoints flagged 'homonym_ambiguous'
  when ambiguous.

- test_extractor.py: rewrite with 14 tests covering 7 core Step 2 changes
  (v2 canonical / aliases / bilingual prompt / tier grouping / homonym
  preservation / same-type merge / no-suffix defensive guard). Replaces
  2 stale self-validating tests anchored to v1 Chinese names.

- Test cleanup: delete test_synthetic_controlled_eval_dataset.py (dangling
  reference to deleted benchmark). Update 2 sidecar tests with v2 drift
  counts (entity drift 1→3, relation drift 1→2) and set-based comparison
  to avoid pinning Counter.most_common() ordering.

## Verification

- pytest tests/rag_core/: 20 passed / 0 failed
- pytest tests/ (full suite): 98 passed / 0 failed
- No regression in any preserved test
- schemas_v1_legacy.py preserved for schema-agnostic architecture proof
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d720607538

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for block in blocks:
if len(current) + len(block) + 2 <= max_chars:
current = f"{current}\n\n{block}".strip()
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve overflow text when chunking long blocks

When a single paragraph/block exceeds max_chars, assigning current = block[:max_chars] drops the remainder of that block entirely instead of emitting additional chunks. This causes silent data loss in sidecar extraction reports for long paragraphs (common in PRD/SRS docs), so reported entities/relations and drift metrics can be materially incomplete even though processing appears successful.

Useful? React with 👍 / 👎.

if current:
chunks.append({"chunk_id": f"input-{index}", "text": current})
index += 1
current = block[:max_chars]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split oversized input blocks instead of truncating

The chunk loader for evaluation repeats the same truncation pattern (current = block[:max_chars]), which discards text after the first max_chars of any long block. In benchmark runs this can corrupt experiment results by evaluating only a prefix of long sections while still counting the case as fully processed.

Useful? React with 👍 / 👎.

from app.rag_core.schemas import Entity, Relation


WRAPPER_CHARS = "\"'`“”‘’《》〈〉()()[]【】{}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep bracket suffixes during entity-name normalization

Including [] in WRAPPER_CHARS makes normalize_entity_name strip the trailing ] from homonym-disambiguated names like agent[Module], which the new prompt explicitly encourages. In mixed outputs where entities are auto-disambiguated to bracketed names but relation endpoints already contain bracketed suffixes from the model, endpoint normalization can produce mismatched names and valid relations are marked as missing-endpoint invalid.

Useful? React with 👍 / 👎.

pdf_path=pdf_path, output_dir=output_dir, lang=lang, **kwargs
)
text_path = Path(text_path)
text = text_path.read_text(encoding="utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore non-UTF8 fallback for text document parsing

This now reads text files with encoding="utf-8" only, so .txt/.md files encoded as GBK/Latin-1/CP1252 fail parsing outright. The previous path handled common fallback encodings, so this is a regression that can block ingestion of legacy enterprise documents even when the content is otherwise valid.

Useful? React with 👍 / 👎.


try:
entities, relations, drift = self._coerce_payload(repair_result.payload, chunk_id)
except ValidationError:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip malformed items instead of dropping whole chunk

A single ValidationError from one bad entity or relation currently aborts the entire chunk extraction and returns an empty ExtractionResult, because the broad except ValidationError wraps the full payload coercion path. With noisy LLM output, one malformed record (for example an empty name/endpoint) can silently discard many otherwise valid entities and relations from the same chunk.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant