Skip to content

Releases: arthurfantaci/graphrag-api-db

v0.2.0

21 Feb 20:09
701e7fb

Choose a tag to compare

What's New

Community Summary Embeddings (#36)

  • New CommunityEmbedder class embeds Community node summaries with Voyage AI voyage-4 (1024d)
  • community_summary_embeddings vector index enables semantic search over communities
  • Idempotent — only embeds communities missing summary_embedding

Gleaning Entity Label Fix (#37)

  • Fixed ExtractionGleaner._merge_gleaned_results() MERGE pattern to include __Entity__ and __KGBuilder__ labels
  • Gleaned entities are now visible to entity resolver, cross-label dedup, and downstream queries
  • Added examples/backfill_entity_labels.py utility for repairing existing data
  • Added examples/diagnose_concept_anomaly.py diagnostic script

Rename Jama-Prefixed Classes (#39)

  • JamaGuideScraperGuideScraper
  • JamaHTMLLoaderGuideHTMLLoader
  • JamaKGPipelineConfigKGPipelineConfig
  • create_jama_kg_pipeline()create_kg_pipeline()
  • User-Agent updated to GuideScraper/0.1.0
  • All proper nouns (URLs, "Jama Software", "Jama Connect") preserved

Documentation Overhaul

  • README.md comprehensively updated: architecture diagram with all 10 post-processing steps, 15 features, complete project structure (42 source files, 13 tests, 4 examples), full schema (18 node types, 16 relationship types), Voyage AI configuration, community search queries
  • CLAUDE.md updated with CommunityEmbedder module, gleaning label requirement, renumbered modules

Full Changelog

v0.1.0...v0.2.0

v0.1.0 — Initial Release

20 Feb 17:32
e794341

Choose a tag to compare

The first release of the GraphRAG Knowledge Graph Pipeline — a complete end-to-end system that scrapes Jama Software's Essential Guide to Requirements Management and Traceability and loads it into a Neo4j knowledge graph using LLM-based entity extraction, vector embeddings, and community detection.

Highlights

  • 5-stage pipeline: Scrape → Extract & Embed → Normalize → Supplement → Validate
  • Schema-constrained extraction with 10 node types and 10 relationship types via neo4j_graphrag
  • Voyage AI voyage-4 embeddings (1024d) with automatic OpenAI fallback
  • Leiden community detection with LLM-generated community summaries
  • Industry taxonomy normalization consolidating 100+ variants into 18 canonical industries
  • Comprehensive validation with pass/fail checks and idempotent repair operations

Added

  • Async web scraping pipeline with httpx and optional Playwright for JS-rendered content
  • Neo4j GraphRAG integration via neo4j_graphrag.SimpleKGPipeline
  • LangChain HTMLHeaderTextSplitter for hierarchical document chunking
  • Optional Chonkie SemanticChunker with Savitzky-Golay boundary detection
  • Entity post-processing pipeline: normalize, deduplicate, cleanup, consolidate, backfill, summarize
  • LangExtract augmentation with source grounding (text span provenance)
  • Supplementary graph structure: Chapter, Resource (Image/Video/Webinar), and Glossary nodes
  • CLI with scrape and validate subcommands, dry-run support, and cost estimation
  • Pre-flight validation before pipeline ingestion
  • CI/CD pipeline with linting (Ruff), type checking (ty), unit tests, and integration tests
  • PEP 561 py.typed marker for downstream type checking support
  • Example scripts in examples/ directory (knowledge graph querying)
  • Contributing guide, Dependabot configuration, and README badges

Changed

  • Consolidated models_core.py into models/content.py subpackage for structural consistency
  • Unified LLM_EXTRACTED_ENTITY_LABELS as single-source frozenset in extraction/schema.py
  • Curated public API exports in postprocessing and validation packages

Fixed

  • Cypher double-WHERE syntax in relabel query (use WITH bridge)
  • Chunk ordering property name (chunk_indexindex)
  • Voyage AI dimensions in .env.example (1536d → 1024d)