Skip to content

JuieZhao/MiniRAG

Repository files navigation

MiniRAG — Lightweight Document RAG

📄 中文文档

Upload documents, ask questions in natural language, get cited answers grounded in your own files. Minimal models, maximum control.

Why MiniRAG?

  • Lightweight by design — zero GPU required, single embedding model, no Docker, no heavy dependencies
  • Your documents, your knowledge — answers are grounded in your uploaded files, not web search
  • Every answer is traceable — citations link to specific documents and text chunks
  • You control the knowledge base — add/remove files to curate what the AI knows
  • Hybrid search — BM25 keyword + dense vector retrieval fused via RRF for better recall
  • Multi-format support — PDF, DOCX, TXT, Markdown, CSV
  • Content dedup — SHA256 hashing prevents duplicate indexing
  • Flexible chunking — section-aware for reports, fixed-size for general docs
  • Metadata filters — filter by source, author, year

Features

  • 📤 Upload documents → auto-parse, chunk, embed, index
  • 🔀 Hybrid retrieval: BM25 + Dense vector → RRF fusion
  • 🔍 Natural language QA → retrieve relevant chunks → LLM generates cited answers
  • 🌐 Bilingual search (Chinese + English)
  • 🎯 Cross-encoder reranking for better retrieval precision (optional)
  • ⚡ Streaming responses (answers appear token by token)
  • 🗑️ Batch document management (checkbox selection + bulk delete)
  • 📋 Export answers as Markdown with source citations
  • 🏷️ Metadata filtering by author / year / source
  • 📊 PDF table extraction (converted to Markdown)
  • 🔤 Auto GPU detection for embeddings

Quick Start

1. Install dependencies

pip install -r requirements.txt

For semantic retrieval and optional reranking, install the model extras:

pip install -r requirements-optional.txt

Without the optional dependencies, run in BM25-only mode:

RETRIEVAL_MODE=bm25 streamlit run main.py

2. Run the app

streamlit run main.py

Open the URL, paste your DeepSeek API key in the sidebar, upload your files, and start asking questions.

You can also put files directly in documents/ before indexing.

💡 Get a free API key at platform.deepseek.com. The key stays in your browser session — never saved to disk.

Create a .env file (see .env.example) for persistent config.

Project Structure

MiniRAG/
├── main.py                  # Streamlit UI
├── requirements.txt
├── requirements-optional.txt # Local embedding/reranking models
├── .env.example
├── documents/               # Files to upload/index
├── test_pipeline.py         # End-to-end pipeline test
├── src/
│   ├── loader.py            # Document parsing + table extraction + chunking
│   ├── embedder.py          # Text embedding (GPU auto-detect)
│   ├── vector_store.py      # Chroma vector DB + metadata
│   ├── retriever.py         # Hybrid retrieval (BM25 + Dense + RRF)
│   ├── bm25_retriever.py    # BM25 keyword search engine
│   └── generator.py         # LLM answer generation (streaming)
├── data/                    # Local app data, such as Q&A history
└── chroma_db/               # Vector store persistence (git-ignored)

Tech Stack

Layer Technology Notes
UI Streamlit Single local frontend
Parsing PyMuPDF + python-docx PDF table extraction + DOCX paragraphs/tables
Embedding BGE (multiple models) GPU auto-detect, local inference
Keyword Search BM25 (custom impl) No external deps beyond numpy
Vector DB Chroma Persistent, zero-config
LLM DeepSeek API OpenAI-compatible SDK
Reranking ms-marco-MiniLM-L-6-v2 Cross-encoder for precision
Fusion RRF Reciprocal Rank Fusion

Configuration

Create a .env file:

DEEPSEEK_API_KEY=sk-xxx
EMBEDDING_MODEL=english     # Recommended for English documents
# RETRIEVAL_MODE=hybrid     # hybrid (default), dense, or bm25
# ENABLE_RERANK=true        # Optional cross-encoder reranking
# EMBEDDING_DEVICE=cuda     # Auto-detected by default
# HTTP_PROXY=http://127.0.0.1:7890  # Behind firewall

License

MIT

About

RAG-powered academic paper Q&A — upload PDFs, ask questions, get cited answers. Built with Streamlit + Chroma + DeepSeek.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages