A production-ready Retrieval-Augmented Generation (RAG) system with multilingual support for scientific research and knowledge exploration. Built with robust error handling, structured logging, and enterprise-grade features.
- PDF extraction with PyMuPDF (context managers for resource safety)
- Intelligent text cleaning (preserves structure, removes noise)
- Semantic chunking with configurable overlap
- Persistent vector storage via ChromaDB
- 10+ Indian languages + English (Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia)
- Automatic language detection
- Two RAG strategies:
- Strategy A: Direct multilingual reasoning (recommended)
- Strategy B: Translation-enhanced reasoning with NLLB-200
- Cross-lingual semantic search with E5 embeddings
- Google Gemini 3 Flash Preview integration
- Configurable safety settings and generation parameters
- Smart citation extraction from retrieved context
- Empty collection handling with graceful degradation
- Structured logging throughout (no
print()statements) - Robust error handling with detailed error shapes
- API key authentication with secure parsing
- Health checks and Prometheus metrics monitoring (
/metrics) - Ready for HTTPS reverse proxy (Nginx) and Windows Service deployment
- Pydantic v2 validation and type safety
- Safe directory creation with permission checks
purge.py- CLI utility to safely clear PDFs, database, or model cache- Web-based document management - Upload, ingest, and purge via UI
- Comprehensive ingestion pipeline with progress tracking
- Pre-flight checks before server startup
- Test suite for pipeline validation
- Python 3.11+
- Google Gemini API key (Get one here)
- 8GB+ RAM recommended
# Clone the repository
git clone https://github.com/DNSdecoded/IndicRAG.git
cd IndicRAG
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Activate (macOS/Linux)
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Copy environment template
cp .env.example .env
# Edit .env and add your API key
# LLM_API_KEY=your_gemini_api_key_here
# Optional: Configure API authentication
# API_KEYS=key1,key2,key3# Place PDFs in papers/ directory
# Then ingest them:
python ingest.py
# Or specify a directory:
python ingest.py path/to/pdfs# With pre-flight checks
python start_server.py
# Skip checks (for production)
python start_server.py --skip-checks
# Development mode with auto-reload
python start_server.py --dev🎉 That's it! Access the API at:
- Interactive docs: http://localhost:8080/api/docs
- Web Interface: http://localhost:8080
Open http://localhost:8080 and:
- Ask Questions - Enter queries in any supported language
- Manage Documents - Expand the panel to:
- Upload PDFs via drag-and-drop
- View uploaded papers list
- Ingest papers into the vector store
- Purge papers or database (with confirmation)
import requests
response = requests.post('http://localhost:8080/query', json={
"question": "యాంటెన్నాతో ml ను ఎలా అమలు చేయవచ్చు?", # Telugu
"strategy": "A",
"top_k": 5
})
result = response.json()
print(result['answer'])
print(f"Citations: {len(result['citations'])}")import rag
result = rag.answer_question(
"मधुमेह का इलाज क्या है?", # Hindi: diabetes treatment
strategy="B",
top_k=8
)
print(f"Answer ({result['language_name']}): {result['answer']}")
print(f"Used {result['chunks_used']} document chunks")Ask a question and get an AI-powered answer with citations.
Request:
{
"question": "What is quantum computing?",
"strategy": "A",
"top_k": 5
}Response:
{
"answer": "Quantum computing is...",
"language": "en",
"language_name": "English",
"chunks_used": 4,
"citations": [
{"number": "1", "title": "Quantum Computing Basics", "section": "Introduction"}
],
"processing_time": 1.23
}Ingest a PDF document (returns extracted title).
Get vector store statistics.
Health check endpoint.
Upload a PDF file (multipart form).
List all uploaded PDFs with sizes.
Delete all uploaded PDF files.
Clear the vector database (all chunks).
Safely clear indexed data:
# Delete all PDFs
python purge.py --papers
# Clear vector database
python purge.py --db
# Remove cached models (will re-download)
python purge.py --models
# Clear everything (with confirmation)
python purge.py --all
# Non-interactive mode
python purge.py --all --yes# Test with example queries
python examples/example_query.pymultilingual-rag/
├── api_server.py # FastAPI app with authentication
├── config.py # Configuration with ensure_directories()
├── embeddings.py # E5 multilingual embeddings
├── ingest.py # PDF ingestion pipeline
├── lang_utils.py # Language detection
├── pdf_utils.py # PDF processing
├── rag.py # Core RAG logic
├── translation.py # NLLB-200 translation (Strategy B)
├── vector_store.py # ChromaDB wrapper
├── start_server.py # Server launcher with pre-flight checks
├── purge.py # Cleanup utility (NEW!)
│
├── deploy/ # Deployment configurations
│ └── nginx.example.conf # Nginx reverse proxy template
│
├── static/ # Web frontend
│ └── index.html # Modern Ocean UI
│
├── docs/ # Documentation
│ ├── QUICKSTART.md
│ ├── DEPLOY.md
│ ├── ARCHITECTURE.md
│ └── CONTRIBUTING.md
│
├── PRODUCTION.md # Production deployment guide
│
├── examples/ # Example scripts
│ ├── example_ingest.py
│ └── example_query.py
│
├── papers/ # Your PDF documents
├── chroma_db/ # Vector database
└── models/ # Cached ML models
Key settings in config.py:
# Embedding model (multilingual E5)
EMBEDDING_MODEL_NAME = "intfloat/multilingual-e5-base"
EMBEDDING_DIMENSION = 768
# Translation models (Strategy B)
TRANSLATION_MODEL_EN_TO_INDIC = "facebook/nllb-200-distilled-600M"
TRANSLATION_MODEL_INDIC_TO_EN = "facebook/nllb-200-distilled-600M"
# Chunking
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 300
# RAG
DEFAULT_TOP_K = 12
MAX_CONTEXT_CHUNKS = 8
MAX_CONTEXT_LENGTH = 8000
# LLM
LLM_MODEL_NAME = "gemini-3-flash-preview"
LLM_TEMPERATURE = 0.3
LLM_MAX_TOKENS = 2048| Language | Code | Native Name |
|---|---|---|
| English | en | English |
| Hindi | hi | हिंदी |
| Telugu | te | తెలుగు |
| Tamil | ta | தமிழ் |
| Bengali | bn | বাংলা |
| Marathi | mr | मराठी |
| Gujarati | gu | ગુજરાતી |
| Kannada | kn | ಕನ್ನಡ |
| Malayalam | ml | മലയാളം |
| Punjabi | pa | ਪੰਜਾਬੀ |
| Odia | or | ଓଡ଼ିଆ |
| Urdu | ur | اردو |
For detailed evaluation methodology, automated metrics, and per-query qualitative reports, see docs/evaluation.md.
| Metric | Final Score |
|---|---|
| Retrieval Precision | 0.93 |
| Retrieval Recall | 0.91 |
| Faithfulness (Grounding Accuracy) | 0.98 |
| Attribution Accuracy | 0.97 |
| Technical Depth | 0.88 |
| Convergence / Mechanistic Reasoning | 0.86 |
| Cross-Document Discipline | 0.95 |
| Hallucination Rate | < 2% |
| Formatting & Structural Compliance | 0.98 |
Typical query latency (on CPU):
- Strategy A (direct multilingual): ~1-2s
- Strategy B (with translation): ~3-6s (includes NLLB translation time)
ChromaDB retrieval: <100ms for 1000s of documents
Memory usage:
- Base system: ~500MB
- With E5 embeddings: ~2GB
- With NLLB translation: ~4.5GB
- API key authentication with secure parsing
- Input validation with Pydantic v2
- CORS configuration
- Environment-based secrets
- Structured logging across all modules
- Request/response logging in API
- Processing time tracking
- Health check endpoint
- Graceful empty collection handling
- Resource safety (context managers)
- Permission-aware directory creation
- Comprehensive error responses
- Citation extraction from English answers (Strategy B)
- Context length enforcement
- Top-k bounds validation
- Newline preservation in PDF processing
"API key not configured"
# Check .env file
cat .env | grep LLM_API_KEY"No documents indexed"
# Ingest PDFs
python ingest.py"Translation model gated/authentication required"
- The system now uses NLLB-200 which requires no authentication
- First use will download ~2.4GB automatically
- See documentation for manual download if needed
"Out of memory"
# Edit config.py to reduce memory usage
CHUNK_SIZE = 512 # Smaller chunks
MAX_CONTEXT_CHUNKS = 3 # Fewer chunks in contextContributions welcome! See CONTRIBUTING.md
Recent improvements:
- ✅ Complete UI/UX revamp with dark/light themes and robust markdown support
- ✅ Parallel PDF ingestion pipeline with MD5 hash caching
- ✅ Bulk extraction endpoint (
/ingest/all) - ✅ Enhanced math-aware sentence chunking via
regexlibrary - ✅ Stricter system prompts enforcing epistemic honesty and mechanistic rigor
- ✅ Resilient lock-free vector database pruning
- ✅ Production logging migration
- ✅ Robust error handling
- ✅ Pydantic v2 compatibility
- ✅ Resource safety improvements
- ✅ Empty collection handling
- ✅ Citation extraction fixes
- ✅ Purge utility addition
- ✅ Web UI document management (upload, ingest, purge)
- ✅ Server concurrency fix with threadpool
Built with excellent open-source tools:
- Google Gemini - Multilingual LLM
- Sentence Transformers - E5 embeddings
- Facebook NLLB - Translation
- ChromaDB - Vector database
- FastAPI - API framework
- PyMuPDF - PDF processing
MIT License - see LICENSE file for details.
Built with ❤️ for multilingual scientific accessibility
⭐ Star this repo if you find it useful!
