DocAgent Studio

Local-first RAG with verifiable citations, offline execution, and no required cloud APIs.

DocAgent Studio ingests PDFs, Markdown notes, and Notion exports into a local SQLite database, retrieves context with hybrid lexical plus vector search, and answers questions with traceable citations.

What Is Live Today

CLI ingest for PDFs, Markdown, and exported Notion content
SQLite-backed storage with FTS5 lexical search and local embedding sidecars
FastAPI web UI for ingest, indexing, search, Q&A, and graph exploration
Citation-grounded answers with local Ollama models
Lightweight entity graph for GraphRAG-style navigation
Eval commands for retrieval recall and citation coverage
Optional MLX LoRA dataset export for on-device fine-tuning

Proof In Repo

Web app: src/docagent/web/server.py
Hybrid retrieval: src/docagent/index/retriever.py
Eval runner: src/docagent/eval/run_eval.py
Graph build and query flow: src/docagent/graph/build.py, src/docagent/graph/query.py
CI workflow: .github/workflows/ci.yml
Design notes and smoke metrics: docs/paper.md

Architecture

flowchart LR
  A[PDF / Markdown / Notion export] --> B[Ingest + chunking]
  B --> C[(SQLite documents + chunks)]
  C --> D[FTS5 index]
  C --> E[Embedding sidecars .npy]
  D --> F[Hybrid retriever]
  E --> F
  F --> G[Local LLM answer with citations]
  C --> H[Entity graph]
  H --> F
  G --> I[CLI and FastAPI web UI]

Quick Start

1. Install

python3 -m venv .venv && source .venv/bin/activate
pip install -e .

For the web UI:

pip install -e '.[web]'

2. Ingest documents

docagent ingest --input /path/to/your/docs --db ./data/docs.db

Supported inputs:

*.pdf
*.md
*.markdown
*.txt

For Notion exports, unzip the Markdown export and point --input at the extracted folder.

3. Build the indexes

docagent index --db ./data/docs.db

This creates:

SQLite FTS data inside the database
docs.db.chunk_ids.npy
docs.db.embeddings.npy

4. Ask questions locally

ollama pull llama3.2:1b
docagent ask --db ./data/docs.db "What did I write about attachment theory?"

5. Run the web UI

docagent serve --db ./data/docs.db

Then open http://127.0.0.1:8000.

Verify And Debug

Inspect a cited chunk:

docagent show --db ./data/docs.db --source-ref "md:notes.md#L9"

Check local dependencies:

docagent doctor --db ./data/docs.db

Inspect retrieval directly:

docagent search --db ./data/docs.db "secure base" --k 5

Evaluation And Reliability

Run retrieval-only eval:

docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl --k 8

Run eval with answer generation:

docagent eval --db ./data/docs.db --eval ./eval/sample_eval.jsonl --k 8 --generate

Smoke results recorded in docs/paper.md:

retrieval recall@3: 1.000
citation coverage: 1.000

Those numbers come from a tiny bundled smoke set. They validate the pipeline, not broad benchmark performance.

Graph And Training Extras

Build the entity graph:

docagent graph build --db ./data/docs.db
docagent graph query --db ./data/docs.db "Attachment"

Export training data:

docagent make-trainset --db ./data/docs.db --out ./train.jsonl --n 500
docagent make-trainset-dir --db ./data/docs.db --out-dir ./data/trainset --n 2000

Tests

python -m unittest discover -s tests -p 'test_*.py'
python -m compileall -q src

Current Limitations

Citation quality still depends on the local model you run through Ollama.
The graph layer is heuristic and can miss or over-include entities.
Vector search is intentionally simple and optimized for personal corpora, not large-scale multi-tenant workloads.
This repo is focused on single-user local retrieval, not shared multi-user knowledge bases.

Paper

See docs/paper.md for the design notes, smoke training run, and evaluation summary.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
eval		eval
example_data		example_data
scripts		scripts
src/docagent		src/docagent
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocAgent Studio

What Is Live Today

Proof In Repo

Architecture

Quick Start

1. Install

2. Ingest documents

3. Build the indexes

4. Ask questions locally

5. Run the web UI

Verify And Debug

Evaluation And Reliability

Graph And Training Extras

Tests

Current Limitations

Paper

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocAgent Studio

What Is Live Today

Proof In Repo

Architecture

Quick Start

1. Install

2. Ingest documents

3. Build the indexes

4. Ask questions locally

5. Run the web UI

Verify And Debug

Evaluation And Reliability

Graph And Training Extras

Tests

Current Limitations

Paper

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages