🌐 Multilingual Scientific RAG System

A production-ready Retrieval-Augmented Generation (RAG) system with multilingual support for scientific research and knowledge exploration. Built with robust error handling, structured logging, and enterprise-grade features.

✨ Key Features

🧠 Advanced Document Processing

PDF extraction with PyMuPDF (context managers for resource safety)
Intelligent text cleaning (preserves structure, removes noise)
Semantic chunking with configurable overlap
Persistent vector storage via ChromaDB

🌍 True Multilingual Support

10+ Indian languages + English (Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia)
Automatic language detection
Two RAG strategies:
- Strategy A: Direct multilingual reasoning (recommended)
- Strategy B: Translation-enhanced reasoning with NLLB-200
Cross-lingual semantic search with E5 embeddings

🤖 LLM Integration

Google Gemini 3 Flash Preview integration
Configurable safety settings and generation parameters
Smart citation extraction from retrieved context
Empty collection handling with graceful degradation

🛡️ Production-Ready Infrastructure

Structured logging throughout (no print() statements)
Robust error handling with detailed error shapes
API key authentication with secure parsing
Health checks and Prometheus metrics monitoring (/metrics)
Ready for HTTPS reverse proxy (Nginx) and Windows Service deployment
Pydantic v2 validation and type safety
Safe directory creation with permission checks

🧹 Operational Tools

purge.py - CLI utility to safely clear PDFs, database, or model cache
Web-based document management - Upload, ingest, and purge via UI
Comprehensive ingestion pipeline with progress tracking
Pre-flight checks before server startup
Test suite for pipeline validation

🚀 Quick Start

Prerequisites

Python 3.11+
Google Gemini API key (Get one here)
8GB+ RAM recommended

Installation

# Clone the repository
git clone https://github.com/DNSdecoded/IndicRAG.git
cd IndicRAG

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate
# Activate (macOS/Linux)
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

# Copy environment template
cp .env.example .env

# Edit .env and add your API key
# LLM_API_KEY=your_gemini_api_key_here

# Optional: Configure API authentication
# API_KEYS=key1,key2,key3

Ingest Documents

# Place PDFs in papers/ directory
# Then ingest them:
python ingest.py

# Or specify a directory:
python ingest.py path/to/pdfs

Start Server

# With pre-flight checks
python start_server.py

# Skip checks (for production)
python start_server.py --skip-checks

# Development mode with auto-reload
python start_server.py --dev

🎉 That's it! Access the API at:

Interactive docs: http://localhost:8080/api/docs
Web Interface: http://localhost:8080

📖 Usage

Via Web UI

Open http://localhost:8080 and:

Ask Questions - Enter queries in any supported language
Manage Documents - Expand the panel to:
- Upload PDFs via drag-and-drop
- View uploaded papers list
- Ingest papers into the vector store
- Purge papers or database (with confirmation)

Via REST API

import requests

response = requests.post('http://localhost:8080/query', json={
    "question": "యాంటెన్నాతో ml ను ఎలా అమలు చేయవచ్చు?",  # Telugu
    "strategy": "A",
    "top_k": 5
})

result = response.json()
print(result['answer'])
print(f"Citations: {len(result['citations'])}")

Via Python

import rag

result = rag.answer_question(
    "मधुमेह का इलाज क्या है?",  # Hindi: diabetes treatment
    strategy="B",
    top_k=8
)

print(f"Answer ({result['language_name']}): {result['answer']}")
print(f"Used {result['chunks_used']} document chunks")

🔧 API Reference

`POST /query`

Ask a question and get an AI-powered answer with citations.

Request:

{
  "question": "What is quantum computing?",
  "strategy": "A",
  "top_k": 5
}

Response:

{
  "answer": "Quantum computing is...",
  "language": "en",
  "language_name": "English",
  "chunks_used": 4,
  "citations": [
    {"number": "1", "title": "Quantum Computing Basics", "section": "Introduction"}
  ],
  "processing_time": 1.23
}

`POST /ingest`

Ingest a PDF document (returns extracted title).

`GET /stats`

Get vector store statistics.

`GET /health`

Health check endpoint.

`POST /upload`

Upload a PDF file (multipart form).

`GET /papers`

List all uploaded PDFs with sizes.

`DELETE /purge/papers`

Delete all uploaded PDF files.

`DELETE /purge/database`

Clear the vector database (all chunks).

🧹 Maintenance Tools

Purge Utility

Safely clear indexed data:

# Delete all PDFs
python purge.py --papers

# Clear vector database
python purge.py --db

# Remove cached models (will re-download)
python purge.py --models

# Clear everything (with confirmation)
python purge.py --all

# Non-interactive mode
python purge.py --all --yes

Examples

# Test with example queries
python examples/example_query.py

📁 Project Structure

multilingual-rag/
├── api_server.py          # FastAPI app with authentication
├── config.py              # Configuration with ensure_directories()
├── embeddings.py          # E5 multilingual embeddings
├── ingest.py              # PDF ingestion pipeline
├── lang_utils.py          # Language detection
├── pdf_utils.py           # PDF processing
├── rag.py                 # Core RAG logic
├── translation.py         # NLLB-200 translation (Strategy B)
├── vector_store.py        # ChromaDB wrapper
├── start_server.py        # Server launcher with pre-flight checks
├── purge.py               # Cleanup utility (NEW!)
│
├── deploy/                # Deployment configurations
│   └── nginx.example.conf # Nginx reverse proxy template
│
├── static/                # Web frontend
│   └── index.html         # Modern Ocean UI
│
├── docs/                  # Documentation
│   ├── QUICKSTART.md
│   ├── DEPLOY.md
│   ├── ARCHITECTURE.md
│   └── CONTRIBUTING.md
│
├── PRODUCTION.md          # Production deployment guide
│
├── examples/              # Example scripts
│   ├── example_ingest.py
│   └── example_query.py
│
├── papers/                # Your PDF documents
├── chroma_db/             # Vector database
└── models/                # Cached ML models

⚙️ Configuration

Key settings in config.py:

# Embedding model (multilingual E5)
EMBEDDING_MODEL_NAME = "intfloat/multilingual-e5-base"
EMBEDDING_DIMENSION = 768

# Translation models (Strategy B)
TRANSLATION_MODEL_EN_TO_INDIC = "facebook/nllb-200-distilled-600M"
TRANSLATION_MODEL_INDIC_TO_EN = "facebook/nllb-200-distilled-600M"

# Chunking
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 300

# RAG
DEFAULT_TOP_K = 12
MAX_CONTEXT_CHUNKS = 8
MAX_CONTEXT_LENGTH = 8000

# LLM
LLM_MODEL_NAME = "gemini-3-flash-preview"
LLM_TEMPERATURE = 0.3
LLM_MAX_TOKENS = 2048

🎯 Supported Languages

Language	Code	Native Name
English	en	English
Hindi	hi	हिंदी
Telugu	te	తెలుగు
Tamil	ta	தமிழ்
Bengali	bn	বাংলা
Marathi	mr	मराठी
Gujarati	gu	ગુજરાતી
Kannada	kn	ಕನ್ನಡ
Malayalam	ml	മലയാളം
Punjabi	pa	ਪੰਜਾਬੀ
Odia	or	ଓଡ଼ିଆ
Urdu	ur	اردو

📈 Final KPI Metrics

For detailed evaluation methodology, automated metrics, and per-query qualitative reports, see docs/evaluation.md.

Metric	Final Score
Retrieval Precision	0.93
Retrieval Recall	0.91
Faithfulness (Grounding Accuracy)	0.98
Attribution Accuracy	0.97
Technical Depth	0.88
Convergence / Mechanistic Reasoning	0.86
Cross-Document Discipline	0.95
Hallucination Rate	< 2%
Formatting & Structural Compliance	0.98

📊 Performance

Typical query latency (on CPU):

Strategy A (direct multilingual): ~1-2s
Strategy B (with translation): ~3-6s (includes NLLB translation time)

ChromaDB retrieval: <100ms for 1000s of documents

Memory usage:

Base system: ~500MB
With E5 embeddings: ~2GB
With NLLB translation: ~4.5GB

🔒 Production Features

Security

API key authentication with secure parsing
Input validation with Pydantic v2
CORS configuration
Environment-based secrets

Observability

Structured logging across all modules
Request/response logging in API
Processing time tracking
Health check endpoint

Robustness

Graceful empty collection handling
Resource safety (context managers)
Permission-aware directory creation
Comprehensive error responses

Quality

Citation extraction from English answers (Strategy B)
Context length enforcement
Top-k bounds validation
Newline preservation in PDF processing

🐛 Common Issues & Solutions

"API key not configured"

# Check .env file
cat .env | grep LLM_API_KEY

"No documents indexed"

# Ingest PDFs
python ingest.py

"Translation model gated/authentication required"

The system now uses NLLB-200 which requires no authentication
First use will download ~2.4GB automatically
See documentation for manual download if needed

"Out of memory"

# Edit config.py to reduce memory usage
CHUNK_SIZE = 512  # Smaller chunks
MAX_CONTEXT_CHUNKS = 3  # Fewer chunks in context

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md

Recent improvements:

✅ Complete UI/UX revamp with dark/light themes and robust markdown support
✅ Parallel PDF ingestion pipeline with MD5 hash caching
✅ Bulk extraction endpoint (/ingest/all)
✅ Enhanced math-aware sentence chunking via regex library
✅ Stricter system prompts enforcing epistemic honesty and mechanistic rigor
✅ Resilient lock-free vector database pruning
✅ Production logging migration
✅ Robust error handling
✅ Pydantic v2 compatibility
✅ Resource safety improvements
✅ Empty collection handling
✅ Citation extraction fixes
✅ Purge utility addition
✅ Web UI document management (upload, ingest, purge)
✅ Server concurrency fix with threadpool

🙏 Acknowledgments

Built with excellent open-source tools:

Google Gemini - Multilingual LLM
Sentence Transformers - E5 embeddings
Facebook NLLB - Translation
ChromaDB - Vector database
FastAPI - API framework
PyMuPDF - PDF processing

📄 License

MIT License - see LICENSE file for details.

🆘 Support

Built with ❤️ for multilingual scientific accessibility

⭐ Star this repo if you find it useful!

Report Bug · Request Feature · Documentation

FilesExpand file tree

README.md

Latest commit

History