Name	Name	Last commit message	Last commit date
parent directory ..
data	data
results	results
.env.example	.env.example
.gitignore	.gitignore
README.md	README.md
pyproject.toml	pyproject.toml
uv.lock	uv.lock

Proxy-Pointer MultiModal -- Answer with Visuals, not just text 🖼️

MultiModal Retrieval without Multimodal Embeddings RAG Pipeline — An extension of the Proxy-Pointer architecture that performs unified reasoning across text and graphics. It retrieves full document sections and automatically surfaces the associated figures, tables, and charts for multimodal synthesis.

How It Works

graph TD
    A[Multimodal Documents] -->|Adobe Extract| B[Markdown + Figures/Tables]
    B -->|Skeleton Tree Builder| C[Structure Trees w/ Image Anchors]
    C -->|LLM Noise Filter| D[Content Nodes]
    B --> E[Text Chunks]
    D --> E
    E -->|Gemini Embed 1536d| F[FAISS Index]

    G[User Query] -->|Gemini Embed| H[Vector Search k=200]
    F --> H
    H -->|Dedup by doc_id,node_id| I[Top 50 Unique Candidates]
    I -->|Semantic Re-Ranker w/ Snippets| J[Top 5 Finalist Nodes]
    J -->|Load Full Text + Image Paths| K[Rich Multimodal Context]
    K -->|Gemini 3.1 Synthesis| L[Grounded Answer + Visuals]
    L --> M[Streamlit UI Rendering]

Instead of retrieving small, context-less chunks, Proxy-Pointer MultiModal:

Builds a structural tree of each document, automatically mapping local image paths (figures/tables) to their specific parent sections.
Filters noise (TOC, abbreviations, etc.) using an LLM to ensure the re-ranker stays focused on technical content.
Uses Semantic Snippets: The re-ranker sees a 150-character preview of each section, allowing it to find technical topics (like "QK-Normalization") even if they aren't in the heading.
Performs Anchor-Aware Retrieval: If a query mentions "Figure 5" or "Table I", the system prioritizes sections that physically contain those anchors.
Loads Full Multimodal Payloads: The synthesizer sees the complete section text and the actual images paths, enabling high-fidelity "Visual Grounding" without using multimodal embeddings or a Vision model.

Architecture Deep Dive

For the full technical story behind the architecture:

Proxy-Pointer RAG: Multimodal Answers Without Multimodal Embeddings — Structure is all you need
Proxy-Pointer RAG: Structure Meets Scale — 100% Accuracy with Smarter Retrieval — Scaling to multi-document, LLM re-ranking, and benchmark results
Proxy-Pointer RAG: Achieving Vectorless Accuracy at Vector RAG Scale and Cost — Core architecture & the pointer-based retrieval idea

5-Minute Quickstart

A sample dataset containing five complex research papers (CLIP, GaLore, NemoBot, VectorFusion, and VectorPainter) is included. You just need to build the index and start querying.

1. Clone

git clone https://github.com/Proxy-Pointer/Proxy-Pointer-RAG.git
cd Proxy-Pointer-RAG
cd MultiModal

2. Create Virtual Environment & Install Dependencies

We strongly recommend creating a virtual environment first:

python -m venv venv
# Windows: venv\Scripts\activate | macOS/Linux: source venv/bin/activate

You can then install dependencies using standard pip or using uv (recommended for developers).

Option A: Standard pip

Install the package:

pip install "pprag[multimodal]"

Option B: For Developers (using uv)

If you want to tinker with the code, this project uses uv for lightning-fast dependency management.

pip install uv
uv sync --project MultiModal

# Remember to prefix commands with `uv run` if you use this method!

3. Configure API keys

cp .env.example .env
# Edit .env → add:
# 1. GOOGLE_API_KEY
# Note: Also review other commented variables, especially the FAISS trust settings required for local index loading!

4. Build the index

# This builds the FAISS index and structure trees for all 5 papers
# (Prefix with `uv run` if you installed via Option B)
pprag multimodal index --fresh

5. Start the UI

# Prefix with `uv run` if you installed via Option B
pprag multimodal serve

Try a query like:

User >> Tell me about the VectorPainter pipeline
or a more advanced comparative one:
User >> Compare GaLore's approach to memory efficiency with the standard Low-Rank Adaptation (LoRA) mentioned in generative model fine-tuning.

Running Benchmarks

The MultiModal version includes an automated evaluation script that runs a 20-query technical suite across all 5 papers.

# Run the technical evaluation
# (Prefix with `uv run` if you installed via Option B)
pprag multimodal benchmark

Benchmark Results (20 Queries · 5 Papers · 0 Failures)

Q	Paper(s)	Category	Text	Images	Score	Reason
1	Galore	Precise Data	🟢	🟢	100%	Exact hyperparams extracted; Table 7 surfaced
2	VectorPainter	Precise Data	🟢	🟢	100%	Stroke extraction + baseline comparison; pipeline + human eval table
3	Galore	Precise Data	🟢	🟢	100%	Exact perplexity values (15.56 / 15.64 / 19.21); Table 2 retrieved
4	CLIP	Precise Data	🟢	🟢	100%	Exact HM scores (81.06 / 78.55 / 71.66); Table 1 retrieved
5	VectorFusion	Visual	🟢	🟢	100%	All 3 pipeline stages described; Figure 3 + Figure 5 retrieved
6	CLIP	Visual	🟢	🟢	100%	All 3 loss components correct; Figure 2 framework diagram retrieved
7	NemoBot	Visual	🟢	🟢	100%	System flow + UI described; Figure 3 + Figure 4 retrieved
8	VectorFusion	Visual	🟢	🟢	100%	Exact CLIP R-Precision scores; Table 1 comparison retrieved
9	NemoBot	Visual	🟢	🔴	75%	Michie's algorithm correctly explained; no images retrieved (relevant images were in a sub-section that missed the Top-5 node cutoff)
10	Galore	Structural	🟢	🟢	100%	LLaMA 7B results + memory breakdown; Table 3 + Table 1 retrieved
11	VectorPainter	Structural	🟢	🟢	95%	Pipeline explained; honestly states "Vector Dropout not found"; Figure 3 + Figure 4
12	CLIP	Structural	🟢	🟢	100%	KL divergence formula correct; Figure 5 equation image retrieved
13	VectorFusion	Complex	🟢	🟢	100%	Model details
14	NemoBot	Complex	🟢	🟢	100%	All 3 Shannon categories with game examples; Table I + 5 supporting figures
15	CLIP + VF	Cross-Doc	🟢	🟡	90%	CLIP metrics compared
16	Galore + CLIP	Cross-Doc	🟢	🟢	100%	Gradient projection vs fine-tuning contrasted; GaLore Table 1 + CLIP Table 5
17	VP + VF	Cross-Doc	🟢	🟡	90%	Diffusion + SVG strategies contrasted
18	NemoBot	Anchor-Aware	🟢	🟢	100%	All games listed with Shannon categories; Table I + 5 supporting figures
19	VectorFusion	Anchor-Aware	🟢	🟢	100%	Bezier path effects correctly described; Figure 9 ablation chart retrieved
20	NemoBot + Galore	Cross-Doc	🟢	🟢	100%	LLM interaction vs memory efficiency contrasted; Figure 3 + Table 1

20 / 20 text responses correct · 17 / 20 image retrievals perfect · Weighted average: 96%

Full results are available in results/test_log.json.

Adding Your Own Documents

Option A: You already have processed folders

If you have folders containing .md files and associated images (e.g., from a previous extraction):

Place the folders inside data/extracted_papers/.

Run the indexer to process the new text and anchors:

# Prefix with `uv run` if you installed via Option B
pprag multimodal index

Option B: Start from PDFs

Add your ADOBE_CLIENT_ID and ADOBE_CLIENT_SECRET to your .env file.
Place your raw PDFs in data/pdf/.

Extract to vision-ready Markdown and Figures:

# This uses Adobe Extract PDF API to generate MD + Image assets
# (Prefix with `uv run` if you installed via Option B)
pprag multimodal extract

Build the index:

# Prefix with `uv run` if you installed via Option B
pprag multimodal index

Project Structure

Proxy-Pointer/
├── MultiModal/                        # Workspace subproject
│   ├── data/                          # Unified Data Hub
│   │   ├── extracted_papers/          # Processed Markdown & Figures
│   │   └── pdf/                       # Original Source PDFs
│   ├── results/                       # Benchmarking Hub
│   │   ├── test_log.json              # 20-query results & metrics
│   │   └── test_queries.json          # Benchmark questions
│   ├── README.md                      # This file
│   └── pyproject.toml                 # Workspace package configuration
├── src/
│   └── pprag_multimodal/              # Package source code (installed via pip)
│       ├── config.py                  # Model selection (Gemini 3.1 Flash Lite)
│       ├── agent/
│       │   └── mm_rag_bot.py          # MultiModal RAG Logic
│       ├── indexing/
│       │   ├── md_tree_builder.py     # Structure Tree generator
│       │   └── build_md_index.py      # Vector index builder
│       ├── extraction/
│       │   └── extract_pdf.py         # Adobe PDF Extraction to MD logic
│       └── run_test_suite.py          # Automated benchmark runner

Configuration

All configuration is centralized in src/pprag_multimodal/config.py. Override via environment variables:

Variable	Default	Description
`GOOGLE_API_KEY`	(required)	Gemini API key
`ADOBE_CLIENT_ID`	(required)	Adobe PDF Services ID for extraction
`ADOBE_CLIENT_SECRET`	(required)	Adobe PDF Services Secret
`PP_DATA_DIR`	`data/extracted_papers/`	Markdown source directory
`PP_TREES_DIR`	`data/trees/`	Structure tree directory
`PP_INDEX_DIR`	`data/index/`	FAISS index directory
`PP_RESULTS_DIR`	`results/`	Benchmark results directory

Design Decisions

Why Semantic Snippets?

Standard vector RAG returns chunks by embedding similarity. However, technical deep-dives (like "gradient projection" in the GaLore paper) are often buried in sections with generic headers like "Memory-Efficient Training". By giving the re-ranker a Semantic Snippet (first 150 chars), the LLM can identify the technical answer even when the header is unhelpful.

Why Anchor-Aware Re-ranking?

Users often ask about specific "ghost anchors" (e.g., "What does Table 1 show?"). Since multiple papers might have a "Table 1," the re-ranker performs a structural sweep of all candidates to find the specific section that physically contains the anchor mentioned in the query.

Why Full-Section Loading?

The indexed chunk is just a fragment. The synthesizer needs the complete section (including headers, full tables, and every associated image) for accurate multimodal synthesis. The chunk acts as a pointer; the full section + its images is the payload.

Author

Partha Sarkar

Contact

GitHub Issues: For bug reports.
General Questions: For general questions, ideas, and enhancement requests, reach out to me on LinkedIn or Email.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Proxy-Pointer MultiModal -- Answer with Visuals, not just text 🖼️

How It Works

Architecture Deep Dive

5-Minute Quickstart

1. Clone

2. Create Virtual Environment & Install Dependencies

Option A: Standard pip

Option B: For Developers (using uv)

3. Configure API keys

4. Build the index

5. Start the UI

Running Benchmarks

Benchmark Results (20 Queries · 5 Papers · 0 Failures)

Adding Your Own Documents

Option A: You already have processed folders

Option B: Start from PDFs

Project Structure

Configuration

Design Decisions

Why Semantic Snippets?

Why Anchor-Aware Re-ranking?

Why Full-Section Loading?

Author

Contact

License

FilesExpand file tree

MultiModal

Directory actions

More options

Directory actions

More options

Latest commit

History

MultiModal

Folders and files

parent directory

README.md

Proxy-Pointer MultiModal -- Answer with Visuals, not just text 🖼️

How It Works

Architecture Deep Dive

5-Minute Quickstart

1. Clone

2. Create Virtual Environment & Install Dependencies

Option A: Standard pip

Option B: For Developers (using uv)

3. Configure API keys

4. Build the index

5. Start the UI

Running Benchmarks

Benchmark Results (20 Queries · 5 Papers · 0 Failures)

Adding Your Own Documents

Option A: You already have processed folders

Option B: Start from PDFs

Project Structure

Configuration

Design Decisions

Why Semantic Snippets?

Why Anchor-Aware Re-ranking?

Why Full-Section Loading?

Author

Contact

License