Skip to content

Marwan947/ColPali-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Modal Document Intelligence (RAG-Based QA System)

A dual-path Multi-Modal Retrieval-Augmented Generation system that combines ColPali visual retrieval with pdfplumber structured extraction, using Gemini for multimodal answer generation.

Architecture

                         ┌───────────────┐
                         │  PDF Document  │
                         └───────┬───────┘
                                 │
                 ┌───────────────┼───────────────┐
                 │                               │
         ┌───────▼───────┐               ┌───────▼───────┐
         │   Path A       │               │   Path B       │
         │  (Traditional) │               │   (ColPali)    │
         │                │               │                │
         │  pdfplumber    │               │  pdf2image     │
         │  ├─ Text       │               │  → page images │
         │  └─ Tables     │               │                │
         │       │        │               │  ColPali       │
         │       ▼        │               │  (PaliGemma)   │
         │  Semantic      │               │  multi-vector  │
         │  Chunking      │               │  embeddings    │
         │  (512 chars,   │               │       │        │
         │   64 overlap)  │               │       ▼        │
         │       │        │               │  Qdrant        │
         │       ▼        │               │  (MaxSim)      │
         │  Per-page text │               │  native        │
         │  store         │               │  multi-vector  │
         └───────┬───────┘               └───────┬───────┘
                 │                               │
                 │  ┌────────────────────────────┘
                 │  │  Query: ColPali retrieves top pages
                 │  │
                 ▼  ▼
         ┌───────────────────────────────────────┐
         │  Gemini (multimodal LLM)               │
         │                                        │
         │  Receives BOTH:                        │
         │  • Page images (visual: charts, layout)│
         │  • Extracted text/tables (structured)  │
         │                                        │
         │  → Answer with [Page X] citations      │
         └────────────────────────────────────────┘

Why dual-path?

Path Strength Weakness
Path A (pdfplumber) Precise text/table extraction, structured data Loses layout, misses charts/figures
Path B (ColPali) Understands full visual layout, charts, figures natively No structured text for the LLM
Combined Best of both — ColPali retrieves the right pages, pdfplumber provides clean text

Modules

Module Purpose
src/ingestion/pdf_parser.py Dual-path parser: pdfplumber text/tables + pdf2image pages
src/ingestion/chunker.py Semantic chunking with sentence boundaries + table preservation
src/embeddings/colpali_embedder.py ColPali/ColSmol page + query embedding
src/retrieval/colpali_retriever.py Qdrant index with native MaxSim multi-vector search
src/generation/generator.py Gemini multimodal generation (images + structured text)
src/evaluation/evaluator.py Benchmark suite with keyword recall + ROUGE scoring
src/pipeline.py Dual-path pipeline orchestrator

Setup

Prerequisites

  • Python 3.10+
  • Poppler: brew install poppler (macOS) or apt install poppler-utils (Ubuntu)
  • Docker (for Qdrant)
  • GPU recommended (falls back to ColSmol-256M on CPU)

Installation

pip install -r requirements.txt

Start Qdrant

docker run -d -p 6333:6333 qdrant/qdrant

Configuration

cp .env.example .env
# Edit .env and add your Google API key (for Gemini)

Usage

Streamlit Web App

streamlit run app.py

CLI

# Ingest a document (both paths run automatically)
python cli.py ingest documents/report.pdf

# Ask a question
python cli.py query "What is the GDP growth rate?"

# Interactive chat
python cli.py chat

# Run evaluation
python cli.py evaluate --output eval_report.json

Design Choices

Choice Rationale
ColPali for retrieval Embeds pages visually — handles text, tables, charts without extraction
MaxSim late interaction ColBERT-style token-level matching via Qdrant native multi-vector support
pdfplumber for text Provides structured text/tables to Gemini alongside page images
Semantic chunking Sentence-boundary splitting with overlap; tables preserved as single chunks
Gemini for generation Multimodal LLM that takes both images and text in a single prompt
ColSmol fallback Lightweight CPU-compatible model when no GPU is available
Keyword + ROUGE eval Keyword recall for heuristic scoring, ROUGE when reference answers exist

Evaluation

The evaluation suite benchmarks across:

  • Text comprehension (easy/medium)
  • Table data extraction (easy/medium)
  • Image/chart interpretation (medium)
  • Cross-modal reasoning (hard)

Metrics:

  • Overall score (heuristic: answer presence + citations + keyword recall)
  • ROUGE-1/2/L (when reference answers provided)
  • Keyword recall rate
  • Citation rate
  • Extracted text usage rate
  • Per-query latency

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages