Releases: arthurfantaci/graphrag-api-db
Releases · arthurfantaci/graphrag-api-db
v0.2.0
What's New
Community Summary Embeddings (#36)
- New
CommunityEmbedderclass embeds Community node summaries with Voyage AI voyage-4 (1024d) community_summary_embeddingsvector index enables semantic search over communities- Idempotent — only embeds communities missing
summary_embedding
Gleaning Entity Label Fix (#37)
- Fixed
ExtractionGleaner._merge_gleaned_results()MERGE pattern to include__Entity__and__KGBuilder__labels - Gleaned entities are now visible to entity resolver, cross-label dedup, and downstream queries
- Added
examples/backfill_entity_labels.pyutility for repairing existing data - Added
examples/diagnose_concept_anomaly.pydiagnostic script
Rename Jama-Prefixed Classes (#39)
JamaGuideScraper→GuideScraperJamaHTMLLoader→GuideHTMLLoaderJamaKGPipelineConfig→KGPipelineConfigcreate_jama_kg_pipeline()→create_kg_pipeline()- User-Agent updated to
GuideScraper/0.1.0 - All proper nouns (URLs, "Jama Software", "Jama Connect") preserved
Documentation Overhaul
- README.md comprehensively updated: architecture diagram with all 10 post-processing steps, 15 features, complete project structure (42 source files, 13 tests, 4 examples), full schema (18 node types, 16 relationship types), Voyage AI configuration, community search queries
- CLAUDE.md updated with CommunityEmbedder module, gleaning label requirement, renumbered modules
Full Changelog
v0.1.0 — Initial Release
The first release of the GraphRAG Knowledge Graph Pipeline — a complete end-to-end system that scrapes Jama Software's Essential Guide to Requirements Management and Traceability and loads it into a Neo4j knowledge graph using LLM-based entity extraction, vector embeddings, and community detection.
Highlights
- 5-stage pipeline: Scrape → Extract & Embed → Normalize → Supplement → Validate
- Schema-constrained extraction with 10 node types and 10 relationship types via
neo4j_graphrag - Voyage AI
voyage-4embeddings (1024d) with automatic OpenAI fallback - Leiden community detection with LLM-generated community summaries
- Industry taxonomy normalization consolidating 100+ variants into 18 canonical industries
- Comprehensive validation with pass/fail checks and idempotent repair operations
Added
- Async web scraping pipeline with
httpxand optional Playwright for JS-rendered content - Neo4j GraphRAG integration via
neo4j_graphrag.SimpleKGPipeline - LangChain
HTMLHeaderTextSplitterfor hierarchical document chunking - Optional Chonkie
SemanticChunkerwith Savitzky-Golay boundary detection - Entity post-processing pipeline: normalize, deduplicate, cleanup, consolidate, backfill, summarize
- LangExtract augmentation with source grounding (text span provenance)
- Supplementary graph structure: Chapter, Resource (Image/Video/Webinar), and Glossary nodes
- CLI with
scrapeandvalidatesubcommands, dry-run support, and cost estimation - Pre-flight validation before pipeline ingestion
- CI/CD pipeline with linting (Ruff), type checking (ty), unit tests, and integration tests
- PEP 561
py.typedmarker for downstream type checking support - Example scripts in
examples/directory (knowledge graph querying) - Contributing guide, Dependabot configuration, and README badges
Changed
- Consolidated
models_core.pyintomodels/content.pysubpackage for structural consistency - Unified
LLM_EXTRACTED_ENTITY_LABELSas single-sourcefrozensetinextraction/schema.py - Curated public API exports in
postprocessingandvalidationpackages
Fixed
- Cypher double-WHERE syntax in relabel query (use WITH bridge)
- Chunk ordering property name (
chunk_index→index) - Voyage AI dimensions in
.env.example(1536d → 1024d)