Skip to content

Latest commit

 

History

History
500 lines (369 loc) · 12.5 KB

File metadata and controls

500 lines (369 loc) · 12.5 KB

🌐 Multilingual Scientific RAG System

License: MIT Python 3.11+ FastAPI Google Gemini Production Ready

INDICRAG.png

A production-ready Retrieval-Augmented Generation (RAG) system with multilingual support for scientific research and knowledge exploration. Built with robust error handling, structured logging, and enterprise-grade features.


✨ Key Features

🧠 Advanced Document Processing

  • PDF extraction with PyMuPDF (context managers for resource safety)
  • Intelligent text cleaning (preserves structure, removes noise)
  • Semantic chunking with configurable overlap
  • Persistent vector storage via ChromaDB

🌍 True Multilingual Support

  • 10+ Indian languages + English (Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia)
  • Automatic language detection
  • Two RAG strategies:
    • Strategy A: Direct multilingual reasoning (recommended)
    • Strategy B: Translation-enhanced reasoning with NLLB-200
  • Cross-lingual semantic search with E5 embeddings

🤖 LLM Integration

  • Google Gemini 3 Flash Preview integration
  • Configurable safety settings and generation parameters
  • Smart citation extraction from retrieved context
  • Empty collection handling with graceful degradation

🛡️ Production-Ready Infrastructure

  • Structured logging throughout (no print() statements)
  • Robust error handling with detailed error shapes
  • API key authentication with secure parsing
  • Health checks and Prometheus metrics monitoring (/metrics)
  • Ready for HTTPS reverse proxy (Nginx) and Windows Service deployment
  • Pydantic v2 validation and type safety
  • Safe directory creation with permission checks

🧹 Operational Tools

  • purge.py - CLI utility to safely clear PDFs, database, or model cache
  • Web-based document management - Upload, ingest, and purge via UI
  • Comprehensive ingestion pipeline with progress tracking
  • Pre-flight checks before server startup
  • Test suite for pipeline validation

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Google Gemini API key (Get one here)
  • 8GB+ RAM recommended

Installation

# Clone the repository
git clone https://github.com/DNSdecoded/IndicRAG.git
cd IndicRAG

# Create virtual environment
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate
# Activate (macOS/Linux)
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Configuration

# Copy environment template
cp .env.example .env

# Edit .env and add your API key
# LLM_API_KEY=your_gemini_api_key_here

# Optional: Configure API authentication
# API_KEYS=key1,key2,key3

Ingest Documents

# Place PDFs in papers/ directory
# Then ingest them:
python ingest.py

# Or specify a directory:
python ingest.py path/to/pdfs

Start Server

# With pre-flight checks
python start_server.py

# Skip checks (for production)
python start_server.py --skip-checks

# Development mode with auto-reload
python start_server.py --dev

🎉 That's it! Access the API at:


📖 Usage

Via Web UI

Open http://localhost:8080 and:

  1. Ask Questions - Enter queries in any supported language
  2. Manage Documents - Expand the panel to:
    • Upload PDFs via drag-and-drop
    • View uploaded papers list
    • Ingest papers into the vector store
    • Purge papers or database (with confirmation)

Via REST API

import requests

response = requests.post('http://localhost:8080/query', json={
    "question": "యాంటెన్నాతో ml ను ఎలా అమలు చేయవచ్చు?",  # Telugu
    "strategy": "A",
    "top_k": 5
})

result = response.json()
print(result['answer'])
print(f"Citations: {len(result['citations'])}")

Via Python

import rag

result = rag.answer_question(
    "मधुमेह का इलाज क्या है?",  # Hindi: diabetes treatment
    strategy="B",
    top_k=8
)

print(f"Answer ({result['language_name']}): {result['answer']}")
print(f"Used {result['chunks_used']} document chunks")

🔧 API Reference

POST /query

Ask a question and get an AI-powered answer with citations.

Request:

{
  "question": "What is quantum computing?",
  "strategy": "A",
  "top_k": 5
}

Response:

{
  "answer": "Quantum computing is...",
  "language": "en",
  "language_name": "English",
  "chunks_used": 4,
  "citations": [
    {"number": "1", "title": "Quantum Computing Basics", "section": "Introduction"}
  ],
  "processing_time": 1.23
}

POST /ingest

Ingest a PDF document (returns extracted title).

GET /stats

Get vector store statistics.

GET /health

Health check endpoint.

POST /upload

Upload a PDF file (multipart form).

GET /papers

List all uploaded PDFs with sizes.

DELETE /purge/papers

Delete all uploaded PDF files.

DELETE /purge/database

Clear the vector database (all chunks).


🧹 Maintenance Tools

Purge Utility

Safely clear indexed data:

# Delete all PDFs
python purge.py --papers

# Clear vector database
python purge.py --db

# Remove cached models (will re-download)
python purge.py --models

# Clear everything (with confirmation)
python purge.py --all

# Non-interactive mode
python purge.py --all --yes

Examples

# Test with example queries
python examples/example_query.py

📁 Project Structure

multilingual-rag/
├── api_server.py          # FastAPI app with authentication
├── config.py              # Configuration with ensure_directories()
├── embeddings.py          # E5 multilingual embeddings
├── ingest.py              # PDF ingestion pipeline
├── lang_utils.py          # Language detection
├── pdf_utils.py           # PDF processing
├── rag.py                 # Core RAG logic
├── translation.py         # NLLB-200 translation (Strategy B)
├── vector_store.py        # ChromaDB wrapper
├── start_server.py        # Server launcher with pre-flight checks
├── purge.py               # Cleanup utility (NEW!)
│
├── deploy/                # Deployment configurations
│   └── nginx.example.conf # Nginx reverse proxy template
│
├── static/                # Web frontend
│   └── index.html         # Modern Ocean UI
│
├── docs/                  # Documentation
│   ├── QUICKSTART.md
│   ├── DEPLOY.md
│   ├── ARCHITECTURE.md
│   └── CONTRIBUTING.md
│
├── PRODUCTION.md          # Production deployment guide
│
├── examples/              # Example scripts
│   ├── example_ingest.py
│   └── example_query.py
│
├── papers/                # Your PDF documents
├── chroma_db/             # Vector database
└── models/                # Cached ML models

⚙️ Configuration

Key settings in config.py:

# Embedding model (multilingual E5)
EMBEDDING_MODEL_NAME = "intfloat/multilingual-e5-base"
EMBEDDING_DIMENSION = 768

# Translation models (Strategy B)
TRANSLATION_MODEL_EN_TO_INDIC = "facebook/nllb-200-distilled-600M"
TRANSLATION_MODEL_INDIC_TO_EN = "facebook/nllb-200-distilled-600M"

# Chunking
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 300

# RAG
DEFAULT_TOP_K = 12
MAX_CONTEXT_CHUNKS = 8
MAX_CONTEXT_LENGTH = 8000

# LLM
LLM_MODEL_NAME = "gemini-3-flash-preview"
LLM_TEMPERATURE = 0.3
LLM_MAX_TOKENS = 2048

🎯 Supported Languages

Language Code Native Name
English en English
Hindi hi हिंदी
Telugu te తెలుగు
Tamil ta தமிழ்
Bengali bn বাংলা
Marathi mr मराठी
Gujarati gu ગુજરાતી
Kannada kn ಕನ್ನಡ
Malayalam ml മലയാളം
Punjabi pa ਪੰਜਾਬੀ
Odia or ଓଡ଼ିଆ
Urdu ur اردو

📈 Final KPI Metrics

For detailed evaluation methodology, automated metrics, and per-query qualitative reports, see docs/evaluation.md.

Metric Final Score
Retrieval Precision 0.93
Retrieval Recall 0.91
Faithfulness (Grounding Accuracy) 0.98
Attribution Accuracy 0.97
Technical Depth 0.88
Convergence / Mechanistic Reasoning 0.86
Cross-Document Discipline 0.95
Hallucination Rate < 2%
Formatting & Structural Compliance 0.98

📊 Performance

Typical query latency (on CPU):

  • Strategy A (direct multilingual): ~1-2s
  • Strategy B (with translation): ~3-6s (includes NLLB translation time)

ChromaDB retrieval: <100ms for 1000s of documents

Memory usage:

  • Base system: ~500MB
  • With E5 embeddings: ~2GB
  • With NLLB translation: ~4.5GB

🔒 Production Features

Security

  • API key authentication with secure parsing
  • Input validation with Pydantic v2
  • CORS configuration
  • Environment-based secrets

Observability

  • Structured logging across all modules
  • Request/response logging in API
  • Processing time tracking
  • Health check endpoint

Robustness

  • Graceful empty collection handling
  • Resource safety (context managers)
  • Permission-aware directory creation
  • Comprehensive error responses

Quality

  • Citation extraction from English answers (Strategy B)
  • Context length enforcement
  • Top-k bounds validation
  • Newline preservation in PDF processing

🐛 Common Issues & Solutions

"API key not configured"

# Check .env file
cat .env | grep LLM_API_KEY

"No documents indexed"

# Ingest PDFs
python ingest.py

"Translation model gated/authentication required"

  • The system now uses NLLB-200 which requires no authentication
  • First use will download ~2.4GB automatically
  • See documentation for manual download if needed

"Out of memory"

# Edit config.py to reduce memory usage
CHUNK_SIZE = 512  # Smaller chunks
MAX_CONTEXT_CHUNKS = 3  # Fewer chunks in context

🤝 Contributing

Contributions welcome! See CONTRIBUTING.md

Recent improvements:

  • ✅ Complete UI/UX revamp with dark/light themes and robust markdown support
  • ✅ Parallel PDF ingestion pipeline with MD5 hash caching
  • ✅ Bulk extraction endpoint (/ingest/all)
  • ✅ Enhanced math-aware sentence chunking via regex library
  • ✅ Stricter system prompts enforcing epistemic honesty and mechanistic rigor
  • ✅ Resilient lock-free vector database pruning
  • ✅ Production logging migration
  • ✅ Robust error handling
  • ✅ Pydantic v2 compatibility
  • ✅ Resource safety improvements
  • ✅ Empty collection handling
  • ✅ Citation extraction fixes
  • ✅ Purge utility addition
  • ✅ Web UI document management (upload, ingest, purge)
  • ✅ Server concurrency fix with threadpool

🙏 Acknowledgments

Built with excellent open-source tools:


📄 License

MIT License - see LICENSE file for details.


🆘 Support


Built with ❤️ for multilingual scientific accessibility

⭐ Star this repo if you find it useful!

Report Bug · Request Feature · Documentation