GitHub - ENDEVSOLS/LongParser: Privacy-first document intelligence engine — parse PDFs, DOCX, PPTX, XLSX & CSV into AI-ready chunks for RAG pipelines. Includes HITL review, 3-layer memory chat, and a production FastAPI server.

Privacy-first document intelligence engine for production RAG pipelines.

Parse PDFs, DOCX, PPTX, XLSX & CSV → validated, AI-ready chunks with HITL review.

Features

Feature	Detail
Multi-format extraction	PDF, DOCX, PPTX, XLSX, CSV via Docling
Hybrid chunking	Token-aware, heading-hierarchy-aware, table-aware
HITL review	Human-in-the-Loop block & chunk editing before embedding
LangGraph HITL	`approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer
3-layer memory	Short-term turns + rolling summary + long-term facts
Multi-provider LLM	OpenAI, Gemini, Groq, OpenRouter
Multi-backend vectors	Chroma, FAISS, Qdrant
Production-ready API	FastAPI + Motor (MongoDB) + ARQ + Redis (Queue & Rate Limiting)
Enterprise Security	Tenant isolation, Role-Based Access Control (RBAC), and CORS
LangChain adapters	Drop-in `BaseRetriever` and LlamaIndex `QueryEngine`
Privacy-first	All processing runs locally; no data leaves your infra

Installation

Quick install (recommended)

pip install "longparser[gpu]"

Includes everything — server, embeddings, vector DB, OCR, LangChain, LlamaIndex. Works on CPU machines too; torch just runs in CPU mode automatically.

Core SDK only (no server, no torch)

pip install longparser

Pick only what you need

Extra	What it adds
`server`	FastAPI + MongoDB + Redis + LangChain chat
`embeddings-gpu`	`sentence-transformers` (GPU)
`embeddings-cpu`	`sentence-transformers` (CPU-only torch)
`faiss-gpu`	FAISS GPU vector store
`faiss-cpu`	FAISS CPU vector store
`chroma`	ChromaDB
`qdrant`	Qdrant
`latex-ocr-gpu`	`pix2tex` equation OCR (GPU)
`latex-ocr-cpu`	`pix2tex` equation OCR (CPU)
`langchain`	LangChain core adapter
`llamaindex`	LlamaIndex reader adapter
`gpu`	All of the above — one command
`cpu`	All of the above — CPU-only torch

Advanced: CPU-only install (save ~1.8 GB)

For Docker images, edge devices, or CI environments where CUDA isn't needed:

# Step 1 — CPU torch (~230 MB vs ~2 GB for CUDA)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Step 2 — LongParser CPU bundle
pip install "longparser[cpu]"

Quick Start

Python SDK

from longparser import PipelineOrchestrator, ProcessingConfig

pipeline = PipelineOrchestrator()
result = pipeline.process_file("document.pdf")

print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)

REST API

# 1. Copy and edit configuration
cp .env.example .env

# 2. Start services (MongoDB + Redis)
docker-compose up -d mongo redis

# 3. Start the API
uv run uvicorn longparser.server.app:app --reload --port 8000

# 4. Upload a document
curl -X POST http://localhost:8000/jobs \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf"

# 5. Check job status
curl http://localhost:8000/jobs/{job_id} -H "X-API-Key: your-key"

# 6. Finalize and embed
curl -X POST http://localhost:8000/jobs/{job_id}/finalize \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"finalize_policy": "approve_all_pending"}'

curl -X POST http://localhost:8000/jobs/{job_id}/embed \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"provider": "huggingface", "model": "BAAI/bge-base-en-v1.5", "vector_db": "chroma"}'

# 7. Chat with the document
curl -X POST http://localhost:8000/chat/sessions \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "your-job-id"}'

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "job_id": "...", "question": "What is the refund policy?"}'

Architecture

Document → Extract → Validate → HITL Review → Chunk → Embed → Index
                                                              ↓
                                             Chat → RAG → LLM → Answer

Pipeline Stages

Extract — Docling converts PDF/DOCX/etc. into structured Block objects
Validate — Per-page confidence scoring and RTL detection
HITL Review — Human approves/edits/rejects blocks and chunks via the API
Chunk — HybridChunker builds token-aware RAG chunks with section hierarchy
Embed — Embedding engine (HuggingFace / OpenAI) vectors stored in Chroma/FAISS/Qdrant
Chat — LCEL chain with 3-layer memory and citation validation

Project Structure

src/longparser/
├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/          ← Docling, LaTeX OCR backends
├── chunkers/            ← HybridChunker
├── pipeline/            ← PipelineOrchestrator
├── integrations/        ← LangChain loader & LlamaIndex reader
├── utils/               ← shared helpers (RTL detection, …)
└── server/              ← REST API layer
    ├── app.py           ← FastAPI application (all routes)
    ├── db.py            ← Motor async MongoDB
    ├── queue.py         ← ARQ/Redis job queue
    ├── worker.py        ← ARQ background worker
    ├── embeddings.py    ← HuggingFace / OpenAI embedding engine
    ├── vectorstores.py  ← Chroma / FAISS / Qdrant adapters
    └── chat/            ← RAG chat engine
        ├── engine.py    ← ChatEngine (LCEL + 3-layer memory)
        ├── graph.py     ← LangGraph HITL workflow
        ├── schemas.py   ← chat Pydantic models
        ├── retriever.py ← LangChain BaseRetriever adapter
        ├── llm_chain.py ← multi-provider LLM factory
        └── callbacks.py ← observability callbacks

LangChain Integration

from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
docs = loader.load()  # list[langchain_core.documents.Document]

LlamaIndex Integration

from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
docs = reader.load_data("report.pdf")

Configuration

Copy .env.example to .env and set:

Variable	Default	Description
`LONGPARSER_MONGO_URL`	`mongodb://localhost:27017`	MongoDB connection
`LONGPARSER_REDIS_URL`	`redis://localhost:6379`	Redis for job queue & rate limits
`LONGPARSER_LLM_PROVIDER`	`openai`	LLM provider
`LONGPARSER_LLM_MODEL`	`gpt-5.3`	Model name
`LONGPARSER_EMBED_PROVIDER`	`huggingface`	Embedding provider
`LONGPARSER_VECTOR_DB`	`chroma`	Vector store backend
`LONGPARSER_CORS_ORIGINS`	`*`	Allowed CORS origins
`LONGPARSER_RATE_LIMIT`	`60`	Max RPM per tenant
`LONGPARSER_ADMIN_KEYS`	(empty)	Comma-separated admin API keys

Running with Docker

docker-compose up

API available at http://localhost:8000 · Docs at http://localhost:8000/docs

Testing

# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -v

# Run with coverage
uv run pytest tests/ --cov=src/longparser --cov-report=term-missing

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs		docs
examples		examples
src/longparser		src/longparser
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Installation

Quick install (recommended)

Core SDK only (no server, no torch)

Pick only what you need

Advanced: CPU-only install (save ~1.8 GB)

Quick Start

Python SDK

REST API

Architecture

Pipeline Stages

Project Structure

LangChain Integration

LlamaIndex Integration

Configuration

Running with Docker

Testing

Contributing

Security

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

Quick install (recommended)

Core SDK only (no server, no torch)

Pick only what you need

Advanced: CPU-only install (save ~1.8 GB)

Quick Start

Python SDK

REST API

Architecture

Pipeline Stages

Project Structure

LangChain Integration

LlamaIndex Integration

Configuration

Running with Docker

Testing

Contributing

Security

License

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages