Local-first RAG with verifiable citations, offline execution, and no required cloud APIs.
DocAgent Studio ingests PDFs, Markdown notes, and Notion exports into a local SQLite database, retrieves context with hybrid lexical plus vector search, and answers questions with traceable citations.
- CLI ingest for PDFs, Markdown, and exported Notion content
- SQLite-backed storage with FTS5 lexical search and local embedding sidecars
- FastAPI web UI for ingest, indexing, search, Q&A, and graph exploration
- Citation-grounded answers with local Ollama models
- Lightweight entity graph for GraphRAG-style navigation
- Eval commands for retrieval recall and citation coverage
- Optional MLX LoRA dataset export for on-device fine-tuning
- Web app:
src/docagent/web/server.py - Hybrid retrieval:
src/docagent/index/retriever.py - Eval runner:
src/docagent/eval/run_eval.py - Graph build and query flow:
src/docagent/graph/build.py,src/docagent/graph/query.py - CI workflow:
.github/workflows/ci.yml - Design notes and smoke metrics:
docs/paper.md
flowchart LR
A[PDF / Markdown / Notion export] --> B[Ingest + chunking]
B --> C[(SQLite documents + chunks)]
C --> D[FTS5 index]
C --> E[Embedding sidecars .npy]
D --> F[Hybrid retriever]
E --> F
F --> G[Local LLM answer with citations]
C --> H[Entity graph]
H --> F
G --> I[CLI and FastAPI web UI]
python3 -m venv .venv && source .venv/bin/activate
pip install -e .For the web UI:
pip install -e '.[web]'docagent ingest --input /path/to/your/docs --db ./data/docs.dbSupported inputs:
*.pdf*.md*.markdown*.txt
For Notion exports, unzip the Markdown export and point --input at the extracted folder.
docagent index --db ./data/docs.dbThis creates:
- SQLite FTS data inside the database
docs.db.chunk_ids.npydocs.db.embeddings.npy
ollama pull llama3.2:1b
docagent ask --db ./data/docs.db "What did I write about attachment theory?"docagent serve --db ./data/docs.dbThen open http://127.0.0.1:8000.
Inspect a cited chunk:
docagent show --db ./data/docs.db --source-ref "md:notes.md#L9"Check local dependencies:
docagent doctor --db ./data/docs.dbInspect retrieval directly:
docagent search --db ./data/docs.db "secure base" --k 5Run retrieval-only eval:
docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl --k 8Run eval with answer generation:
docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl --k 8 --generateSmoke results recorded in docs/paper.md:
- retrieval recall@3:
1.000 - citation coverage:
1.000
Those numbers come from a tiny bundled smoke set. They validate the pipeline, not broad benchmark performance.
Build the entity graph:
docagent graph build --db ./data/docs.db
docagent graph query --db ./data/docs.db "Attachment"Export training data:
docagent make-trainset --db ./data/docs.db --out ./train.jsonl --n 500
docagent make-trainset-dir --db ./data/docs.db --out-dir ./data/trainset --n 2000python -m unittest discover -s tests -p 'test_*.py'
python -m compileall -q src- Citation quality still depends on the local model you run through Ollama.
- The graph layer is heuristic and can miss or over-include entities.
- Vector search is intentionally simple and optimized for personal corpora, not large-scale multi-tenant workloads.
- This repo is focused on single-user local retrieval, not shared multi-user knowledge bases.
See docs/paper.md for the design notes, smoke training run, and evaluation summary.
MIT