AstraGraph

AST (Abstract Syntax Trees) + RAG (Retrieval-Augmented Generation) + GRAPH (Graph-based).

AstraGraph is a powerful GraphRAG for source code that understands the true structure of your codebase. It ingests a repository into a Neo4j property graph and a Qdrant vector store simultaneously, giving LLMs the ability to navigate your code exactly how a developer would—tracing imports, inheritance, and call chains to answer complex questions by combining structural (Cypher) and semantic (embedding) retrieval through a LangGraph agent.

Web interface

showcase.mp4

The UI (/ui) is a single-page app with no build step.

Left panel — chat:

Hybrid/graph/vector mode selector
Send a question, get an answer with source badges
Click a source badge to isolate that node in the graph (fetches its 1-hop neighborhood from the server if not currently visible)

Right panel — graph:

Cytoscape.js force-directed graph of the repo's call/inheritance/import structure
Nodes sized by in-degree (structural importance)
Type filters: function, class, module, package
Node count selector: 25 / 50 / 75 / 100 / 200 / 500
Layout selector: cose, concentric, breadthfirst, circle, grid
File Viewer: Clicking a node in the graph will open the file viewer. For functions, it displays the syntax-highlighted source code with correct line numbers. For classes and modules, it shows a detailed metadata card.
Dark / light theme toggle, persisted in localStorage

Ingest modal:

Paste a GitHub URL (Currently only works for python projects)
Phase pills + progress bar track the ingestion pipeline in real time
Errors surface inline

The Strength of Hybrid Retrieval

Most code search tools struggle because they are purely semantic (embeddings miss exact references and structural hierarchy). AstraGraph leverages abstract syntax trees to extract the true structure of the code. It then runs both semantic and graph retrieval in parallel and fuses the results using Reciprocal Rank Fusion (RRF), unlocking capabilities standard RAG cannot match. Providing your LLM with both structured and unstructured context (and their interdependencies). Astragraph does:

Multi-hop reasoning — Because code is represented as a connected graph, AstraGraph excels at multi-hop questions. It can accurately trace complex dependencies, such as answering "What downstream functions are affected if I change the return type of load_data in utils.py?", by seamlessly traversing CALLS, INHERITS, and IMPORTS relationships.
Graph retrieval — Cypher queries over a property graph ensure perfect accuracy for structural questions: "what calls build_graph?", "what classes inherit from BaseModel?", "what does this module import?"
Vector retrieval — sentence-transformer embeddings over function and class bodies handle conceptual queries: "how does the pipeline handle errors?", "where is retry logic implemented?"
Hybrid precision (default) — both retrievers run at double top_k, then RRF merges and reranks the results. You get the exactness of structural connections combined with the fuzziness of semantic search.

The LLM only sees the most relevant top-k retrieved nodes and their multi-hop context. It never touches the full codebase, preventing context window bloat and reducing hallucinations.

Architecture

Ingestion pipeline (two passes)

Pass 1 — per file:

Parse source with tree-sitter
Extract entities: Repository → Package → Module → Class/Function → Attribute/Parameter
Accumulate all entities in memory
After all files: bulk-write to Neo4j in ~17 round trips (not one per entity)

Between passes: build a repo-wide function lookup by name and qualified name.

Pass 2 — repo-wide:

Resolve CALLS_UNKNOWN edges → upgrade to CALLS where the callee is found in the lookup
Embed all functions and classes with sentence-transformers (all-MiniLM-L6-v2, 384d)
Upsert embeddings into Qdrant

Graph schema

Repository
  └─[PART_OF]─ Package
                └─[CONTAINS]─ Module
                               ├─[DEFINED_IN]─ Class
                               │    ├─[METHOD_OF]─ Function
                               │    └─[DEFINES_ATTR]─ Attribute
                               └─[DEFINED_IN]─ Function
                                    └─[HAS_PARAM]─ Parameter

Cross-cutting edges: CALLS, INHERITS, IMPORTS, IMPORTS_EXTERNAL.
Unresolved edges are written to :Unresolved audit nodes via HAS_UNRESOLVED — never silently dropped.

Agent (LangGraph)

Five-node StateGraph:

route_query → [graph_retrieval, vector_retrieval] → rrf_merge → synthesize

route_query classifies the query as graph, vector, or hybrid (default). For hybrid, both retrievers run at 2×top_k and RRF reranks before the LLM synthesizes an answer.

Storage abstraction

pipeline.py and the agent depend on GraphStore and VectorStore Protocols (storage/protocols.py), not on Neo4j or Qdrant client classes. The concrete implementations (Neo4jStore, QdrantStore) are protocol-conforming facades. Adding a new backend (e.g. KuzuDB + ChromaDB) means one new file per store — no changes to the pipeline or agent.

UUIDs

All node UUIDs are deterministic: MD5(repo_id + "::" + file_path + "::" + qualified_name). Every write is idempotent (MERGE in Neo4j, upsert in Qdrant). The same UUID appears in both stores, so graph and vector results can be cross-referenced.

Tech stack

Layer	Technology
Parsing	tree-sitter (Python, TypeScript, Go)
Graph DB	Neo4j 5
Vector DB	Qdrant
Embeddings	sentence-transformers `all-MiniLM-L6-v2` (384d, CPU)
Agent	LangGraph
LLM	Claude (Anthropic) / Groq (OpenAI-compat) / Ollama
API	FastAPI + Uvicorn
Frontend	Vanilla JS + Cytoscape.js + highlight.js
Container	Docker Compose

Performance

Measured on the FastAPI repo (527 Python files):

Phase	Time
Total ingestion	~45s
Neo4j entity writes	< 1s
Neo4j relationship writes	1.1s
CPU embedding (512 batch)	~42s

Running locally

Prerequisites

Docker and Docker Compose
Python 3.12+ with uv

1. Start the databases

docker compose up neo4j qdrant -d

Wait ~15s for Neo4j to initialise.

2. Install dependencies

uv sync

3. Configure

Copy and edit the environment:

# Minimum required for Anthropic (default provider):
export ANTHROPIC_API_KEY=sk-ant-...

# Or use Groq (free tier, fast):
export LLM_PROVIDER=openai_compat
export OPENAI_COMPAT_URL=https://api.groq.com/openai/v1
export OPENAI_COMPAT_APIKEY=gsk_...
export LLM_MODEL=llama-3.3-70b-versatile

# Or use Ollama (fully local):
export LLM_PROVIDER=ollama
export LLM_MODEL=llama3.2:3b

4. Ingest a repository

# Ingest a local path
uv run python cli.py --repo ./path/to/repo --repo-id my-repo

# Ingest only Python files
uv run python cli.py --repo ./path/to/repo --languages python

# Dry run (extract only, no DB writes)
uv run python cli.py --repo ./path/to/repo --dry-run

Ingestion can also be triggered via the UI or API (see below).

5. Start the API server

uv run uvicorn api.server:app --reload

Open http://localhost:8000/ui for the web interface.

Running with Docker Compose (production)

# Create a .env file:
cat > .env <<EOF
NEO4J_PASSWORD=your-password
LLM_PROVIDER=openai_compat
OPENAI_COMPAT_APIKEY=gsk_...
LLM_MODEL=llama-3.3-70b-versatile
EOF

docker compose up --build -d

The API is bound to 127.0.0.1:8000.

Environment variables

Variable	Default	Description
`NEO4J_URI`	`bolt://localhost:7687`	Neo4j connection
`NEO4J_USER`	`neo4j`	Neo4j username
`NEO4J_PASSWORD`	`password`	Neo4j password
`QDRANT_HOST`	`localhost`	Qdrant host
`QDRANT_PORT`	`6333`	Qdrant port
`COLLECTION`	`codebase`	Qdrant collection name
`EMBED_MODEL`	`all-MiniLM-L6-v2`	sentence-transformers model
`EMBED_BATCH_SIZE`	`512`	Embedding batch size
`LLM_PROVIDER`	`anthropic`	`anthropic` \| `openai_compat` \| `ollama`
`LLM_MODEL`	`claude-sonnet-4-6`	Model name for the chosen provider
`ANTHROPIC_API_KEY`	—	Required when `LLM_PROVIDER=anthropic`
`OPENAI_COMPAT_URL`	`https://api.groq.com/openai/v1`	Base URL for OpenAI-compatible provider
`OPENAI_COMPAT_APIKEY`	—	Required when `LLM_PROVIDER=openai_compat`
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL

API reference

Method	Path	Description
`GET`	`/health`	Liveness check
`GET`	`/config`	Exposes non-secret config fields (embed model, LLM, etc.)
`GET`	`/repos`	List all ingested repositories
`DELETE`	`/repos/{repo_id}`	Delete a repo from Neo4j and Qdrant
`POST`	`/ingest`	Clone a GitHub URL and run ingestion in the background
`GET`	`/ingest/status/{job_id}`	Poll ingestion job status and phase
`POST`	`/query`	Full RAG query — retrieval + LLM synthesis
`POST`	`/retrieve`	Retrieval only — no LLM, used by evals
`GET`	`/graph/{repo_id}`	Top-N nodes by structural importance + edges, for visualisation
`GET`	`/graph/{repo_id}/node/{uuid}`	Single node + 1-hop neighborhood
`GET`	`/ui`	Web interface (served from `static/`)

POST /query

{
  "query": "what calls build_graph?",
  "repo_id": "astragraph",
  "mode": "hybrid",
  "top_k": 5
}

Response includes answer (LLM synthesis) and provenance (list of retrieved nodes with file path, line range, and retrieval score).

POST /ingest

{
  "github_url": "https://github.com/tiangolo/fastapi",
  "repo_id": "fastapi"
}

Returns a job_id immediately. Poll /ingest/status/{job_id} to track phases: cloning → parsing → writing_entities → writing_relationships → resolving_calls → embedding → writing_vectors → done.

Evaluation

Three evaluation layers are planned; Layer 3 is implemented:

Layer 3 — Structural + semantic query set (done)
evals/dataset/structural.json (11 structural Cypher-derived cases) and evals/dataset/semantic.json (11 semantic source-reading cases). Run with:

uv run python evals/structural_eval.py --dataset structural
uv run python evals/structural_eval.py --dataset semantic
uv run python evals/structural_eval.py --dataset both

Results show graph mode wins 11/11 structural queries; all-MiniLM-L6-v2 is the bottleneck on semantic queries.

Layer 1 — CoIR baseline (not implemented)
evals/coir_eval.py exists as a stub. Intended to run CodeSearchNet through the vector retriever and report precision@5 / recall@5 against an external leaderboard.

Layer 2 — RAGAS synthetic dataset (not implemented)
Auto-generate ~50 (query, ground-truth contexts, answer) triples from ingested function bodies using an LLM, then score faithfulness, answer relevancy, context precision, and context recall.

Language support

Language	Extraction	Status
Python	Full — functions, classes, attributes, parameters, imports, calls	Complete
TypeScript	Stub — file parsed, entities not extracted	Partial
Go	Stub — file parsed, entities not extracted	Partial

Not yet implemented

Feature	Description
fastembed	Replace `sentence-transformers` with the ONNX runtime via `fastembed`. Same `all-MiniLM-L6-v2` model, typically 3–5× faster on CPU. Would push embedding from ~134s → ~30–40s.
Community detection	Run Louvain/Leiden on the CALLS graph post-ingestion. Store `community_id` and an LLM-generated cluster label on each `FunctionNode`. Enables "show me the auth cluster."
Process tracing	BFS from entry points (route handlers, `__main__`, CLI commands) through CALLS edges. Store as `Process` nodes with `STEP_IN` edges. Enables "show me the full call chain for POST /users."
Incremental ingestion	Checkpoint file tracking `mtime` per file. Skip unchanged files, `DETACH DELETE` stale nodes for changed/deleted files. Makes re-indexing live repos practical.
TypeScript / Go extraction	Full entity extraction for TypeScript and Go (stubs exist, tree-sitter grammars are wired up).
CoIR baseline eval	Layer 1 eval against CodeSearchNet.
RAGAS synthetic eval	Layer 2 eval with auto-generated test triples.
MCP server	Wrap as an MCP tool for use inside Cursor / Claude Code / Open Code etc...
Embedded store option	Replace Neo4j + Qdrant with KuzuDB + ChromaDB for a single-container, zero-signup deployment (e.g. HuggingFace Spaces). The Protocol abstraction in `storage/` makes this two new files.
Frontend v2	React + Tailwind + shadcn/ui + Reagraph (WebGL) for larger repos and more interactivity.

Project structure

astragraph/
├── agent/
│   ├── graph.py              # LangGraph StateGraph (build_graph)
│   ├── nodes.py              # 5 agent nodes as closures
│   ├── rrf.py                # Reciprocal Rank Fusion
│   ├── state.py              # AgentState TypedDict
│   └── retrievers/
│       ├── graph_retriever.py
│       └── vector_retriever.py
├── api/
│   └── server.py             # FastAPI app, all endpoints
├── config.py                 # All config via env vars
├── evals/
│   ├── coir_eval.py          # CoIR baseline (stub)
│   ├── structural_eval.py    # Layer 3 eval runner
│   └── dataset/
│       ├── structural.json
│       └── semantic.json
├── graph/
│   └── schema.py             # Neo4j constraints + fulltext index
├── ingestion/
│   ├── embedder.py           # sentence-transformers wrapper
│   ├── models.py             # All dataclasses (FunctionNode, ClassNode, …)
│   ├── pipeline.py           # Two-pass ingestion orchestrator
│   ├── relationships.py      # Pass 1 + Pass 2 relationship resolution
│   ├── walker.py             # Repo file walker (respects .gitignore)
│   ├── extractors/
│   │   ├── python.py         # Full Python extraction
│   │   ├── typescript.py     # Stub
│   │   └── go.py             # Stub
│   └── writers/
│       ├── neo4j_writer.py   # Bulk Neo4j entity writes
│       └── vector_writer.py  # Qdrant upsert
├── static/
│   └── index.html            # Single-page UI (Cytoscape.js, highlight.js)
├── storage/
│   ├── protocols.py          # GraphStore + VectorStore Protocols
│   ├── neo4j_store.py        # Neo4jStore (read + write facade)
│   └── qdrant_store.py       # QdrantStore (read + write facade)
├── cli.py                    # CLI entry point
├── docker-compose.yml
└── Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AstraGraph

Web interface

The Strength of Hybrid Retrieval

Architecture

Ingestion pipeline (two passes)

Graph schema

Agent (LangGraph)

Storage abstraction

UUIDs

Tech stack

Performance

Running locally

Prerequisites

1. Start the databases

2. Install dependencies

3. Configure

4. Ingest a repository

5. Start the API server

Running with Docker Compose (production)

Environment variables

API reference

POST /query

POST /ingest

Evaluation

Language support

Not yet implemented

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
agent		agent
api		api
evals		evals
graph		graph
ingestion		ingestion
static		static
storage		storage
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cli.py		cli.py
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

AstraGraph

Web interface

The Strength of Hybrid Retrieval

Architecture

Ingestion pipeline (two passes)

Graph schema

Agent (LangGraph)

Storage abstraction

UUIDs

Tech stack

Performance

Running locally

Prerequisites

1. Start the databases

2. Install dependencies

3. Configure

4. Ingest a repository

5. Start the API server

Running with Docker Compose (production)

Environment variables

API reference

POST /query

POST /ingest

Evaluation

Language support

Not yet implemented

Project structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages