PGVectorRAGIndexer v2.11.7

Start here:

🟢 Most Users (Desktop App): INSTALL_DESKTOP_APP.md

🏢 Teams & Organizations: DEPLOYMENT.md — one shared server, many desktop clients

⚡ Quick 5-Minute Setup (Docker): QUICK_START.md

🔵 Usage & API Reference: USAGE_GUIDE.md

Commercial use?
PGVectorRAGIndexer is free for personal use, education, research, and evaluation.
Production, CI/CD, or ongoing internal company use requires a commercial license.
See COMMERCIAL.md.

🤔 In Plain English (What does this actually do?)

Imagine a magic bookshelf that reads and understands every book, document, and note you put on it. When you have a question, you don't have to search for keywords yourself—you just ask the bookshelf in plain English (like "How do I fix the printer?" or "What was our revenue last year?"), and it instantly hands you the exact page with the answer.

Best of all? It lives 100% on your own hardware. Your documents never get sent to the "cloud" or some stranger's server. Whether on your laptop or your company's server, your data stays yours.

Most search bars are dumb—they only find exact word matches. If you search for "dog", they won't find "puppy".

This system is smart. It reads your documents, understands the meaning behind the text, and lets you search using natural language.

🔒 Designed for Trust, Not the Cloud

What this system does not do is just as important as what it does.

PGVectorRAGIndexer does not upload your documents to external services or rely on any external AI service. Everything runs on your own infrastructure — your laptop for personal use, or your company's server for teams.

This design is intentional. It means:

Your files never leave your infrastructure
The application runtime requires no mandatory accounts, cloud subscriptions, or hidden data flows
You can use it safely with personal, private, or sensitive documents

Policy: PGVectorRAGIndexer is local-only by design — your data stays on your hardware, and we do not offer hosted indexing or storage.

Unlike chat-based tools, this app focuses on search and discovery, not conversation.
It helps you find and rediscover documents by meaning — even when you don’t remember the exact words.

🖥️ Choose Your Deployment

For teams and organizations, run PGVectorRAGIndexer on a shared server and connect Windows/macOS/Linux desktop clients remotely. No Docker needed on client machines.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Desktop     │     │  Desktop     │     │  Desktop     │
│  (Windows)   │     │  (macOS)     │     │  (Linux)     │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │ HTTP
                   ┌────────▼────────┐
                   │  Shared Server  │
                   │  (Docker)       │
                   │  Centralized    │
                   │  indexing,      │
                   │  search, audit  │
                   └─────────────────┘

	Personal / Desktop	Team / Organization Server
Best For	Individual use	Teams, departments, companies
Architecture	Everything on one machine	Shared server + desktop clients
Setup	One-click installer	Deploy server, install desktop clients
Docker	Required locally	Server only — clients need no Docker
Admin Controls	Single user	Users, roles, audit logs, retention
Data	Your machine only	Centralized — IT-managed, backed up
Guide	INSTALL_DESKTOP_APP.md	DEPLOYMENT.md

Both modes run entirely on your hardware — no cloud, no external services, your data never leaves your network.

💡 CLI is also available in both modes for power users:

Batch processing

Scripting and automation

Headless usage

CLI requires the backend running (via Docker or Desktop App).

🋹 What's New in v2.11.7

🆕 Latest Features (v2.11.7)

✅ First-Run Onboarding Wizard: Guided 5-step setup shown automatically on first launch — connect, verify, license, index sample docs, and run a first search. Re-accessible at any time from Settings → Run Setup Wizard. Re-shown automatically after each version upgrade.
✅ Wizard Tab Hints: Completion pages link to the relevant app tab with contextual hints; wizard auto-navigates to Search on finish.
✅ Windows Installer Ref Pinning: Release MSI now installs the exact release tag (e.g. v2.11.7), not main — users installing months later always get the correct versioned code.
✅ App Version Everywhere: Version shown in the title bar, in the Settings header, and tracked per-install for upgrade detection.
✅ Check for Updates: Pulls latest app code (git pull) and Docker images in one click from the Settings tab.
✅ Organization Console: Full admin console for server-side governance — users, roles, permissions, retention, audit logs, API keys, and SCIM provisioning. Adapts to server capabilities with 4-state detection. Admin write operations: user CRUD, API key lifecycle (create/revoke/rotate), retention manual cleanup, and compliance export. Permission-aware gating matches backend authorization.
✅ Identity Endpoint (GET /me): Server returns the current user's identity, role, and resolved permissions. Loopback mode returns effective admin authority.
✅ Settings Integration: Backend URL or API key changes automatically invalidate cached capabilities and trigger a re-probe.

Recent Features (v2.7)

✅ License Validation Fixed: Reverted incorrect public key update that caused valid RSA license signatures to be rejected.
✅ Deep Observability: JSONFormatter for structured JSON logging, system metrics on /health.
✅ Desktop API Client Facade: Decoupled monolithic APIClient into domain clients (SystemClient, DocumentClient, etc.).
✅ Data Retention Orchestration: Automated background execution of retention policies.

Recent Milestones (v2.6)

✅ Windows Installer Parity: Full feature parity with legacy PowerShell scripts, including auto-installation of Rancher/Docker environments.
✅ API Modularization: Refactored monolithic backend into a scalable, domain-driven package architecture.
✅ Asymmetric Licensing (RS256): Zero-config desktop activation using embedded public keys.
✅ Installer Self-Healing: Automated detection and recovery from stuck Docker daemons and system reboots.

Architecture & Scale (v2.4 - v2.5)

✅ Hierarchical Document Tree: Scalable, lazy-loaded tree view replacing the legacy list for 100k+ document navigation.
✅ Privacy-First Analytics: Opt-in anonymous usage tracking with a local audit log and transparency dashboard.
✅ Split-Backend Testing: E2E validation suite for multi-server production deployments.
✅ WSL Native Dialogs: Seamless integration of Windows native file pickers when running under WSL/Linux.
✅ Database Hardening: Context-managed connection pooling across all 14 backend modules to eliminate leaks.

Intelligence & Integration (v2.1 - v2.3)

✅ MCP Server: AI agents (Claude CLI, Claude Desktop, Cursor) can connect directly to your local database
- Uses stdio transport — zero network exposure, no open ports
- Exposes search_documents, index_document, list_documents tools
- See AI Agent Integration (MCP) for setup
✅ Security Documentation: New SECURITY.md with network config guidance + friendly notes in docs
✅ Encrypted PDF Detection: Password-protected PDFs are gracefully detected and skipped
- Returns 403 with error_type: encrypted_pdf for clear error handling
- GET /documents/encrypted endpoint to list all skipped encrypted PDFs
- CLI logs encrypted PDFs to encrypted_pdfs.log for headless mode tracking
✅ Improved Error Panel: Desktop app now shows a resizable dialog with filter tabs (All/Encrypted/Other) and CSV export
✅ Incremental Indexing: Smart content change detection using xxHash. Skips unchanged files, saving bandwidth and processing time.
✅ Wildcard Search: Document type filter now supports * for "All Types".
✅ Dynamic UI: Upload tab automatically fetches available document types from the database.
✅ Document Type System: Organize documents with custom types (policy, resume, report, etc.)
✅ Bulk Delete with Preview: Safely delete multiple documents with preview before action
✅ Export/Backup System: Export documents as JSON backup before deletion
✅ Undo/Restore Functionality: Restore deleted documents from backup
✅ Desktop App Manage Tab: Full GUI for bulk operations with backup/restore
✅ Legacy Word Support: Added .doc (Office 97-2003) file support

ℹ️ Legacy .doc conversion: Automatic conversion relies on LibreOffice. Docker users (recommended) have this pre-installed. Manual/Local developers must install it and expose the soffice binary (e.g., export LIBREOFFICE_PATH in .env).

Major Improvements (v2.0)

✅ Modern Web UI: User-friendly web interface for search, upload, and document management
✅ Modular Architecture: Clean separation of concerns with dedicated modules for config, database, embeddings, and processing
✅ Configuration Management: Pydantic-based configuration with validation and environment variable support
✅ Connection Pooling: Efficient database connection management with automatic retry and health checks
✅ Comprehensive Testing: Full test suite with unit, integration, and end-to-end tests (143 tests!)
✅ Document Deduplication: Automatic detection and prevention of duplicate documents
✅ Metadata Support: Rich metadata storage and filtering capabilities
✅ Hybrid Search: Combine vector similarity with full-text search for better results
✅ REST API: FastAPI-based HTTP API for easy integration
✅ Embedding Cache: In-memory caching for faster repeated queries
✅ Batch Processing: Efficient batch operations for indexing and retrieval
✅ Enhanced Schema: Improved database schema with indexes, triggers, and views
✅ Better Error Handling: Comprehensive error handling with detailed logging
✅ CLI Improvements: Enhanced command-line interface with subcommands

📋 Features

Core Capabilities

Multi-Format Support: PDF, DOC, DOCX, XLSX, TXT, HTML, PPTX, CSV, and web URLs
OCR Support: Extract text from scanned PDFs and images (PNG, JPG, TIFF, BMP) using Tesseract
Semantic Search: Vector similarity search using sentence transformers
Hybrid Search: Combine vector and full-text search with configurable weights
Document Management: Full CRUD operations for indexed documents
Bulk Operations: Preview, export, delete, and restore multiple documents
Backup/Restore: Export documents as JSON and restore with undo functionality
Connection Pooling: Efficient database connection management
Embedding Cache: Speed up repeated queries with in-memory cache
Health Monitoring: Built-in health checks and statistics

User Interface

Search Interface: Intuitive search with semantic and hybrid options
Drag & Drop Upload: Easy file upload with progress tracking
Document Browser: View and manage all indexed documents
Statistics Dashboard: Real-time system health and metrics
Organization Console: Server-side governance console — users, roles, permissions, retention, audit log, API keys, and SCIM visibility with permission-aware write controls

API Features

RESTful API: FastAPI-based HTTP API with automatic OpenAPI documentation
CORS Support: Configurable CORS for web applications
Rate Limiting: Built-in rate limiting support
Error Handling: Comprehensive error responses with detailed messages
Async Support: Asynchronous operations for better performance

Enterprise Capabilities

RBAC and Users: Server-side users, admin/user roles, and permission-checked endpoints
SSO/SAML: Optional SAML login flow for enterprise deployments (Okta-oriented configuration)
SCIM Provisioning: SCIM 2.0 user provisioning endpoints for enterprise identity workflows
Audit and Compliance Controls: Audit events, data retention orchestration, and compliance export support

🔐 Encrypted PDF Handling

Password-protected PDFs cannot be indexed without decryption. Instead of crashing or failing silently, PGVectorRAGIndexer handles them gracefully:

How it works:

When a password-protected PDF is encountered during upload, it's detected immediately
The file is skipped (not indexed) and added to a special "encrypted PDFs" list
Processing continues with the remaining files — one encrypted file won't stop your batch
You can review all encrypted PDFs after the upload completes

Desktop App:

After upload, an "🔒 Encrypted PDFs (N)" button appears if any were detected
Click to see a filterable list of all encrypted files with their paths
You can copy paths or open the folder to manually decrypt files
Re-upload the decrypted versions when ready

API:

GET /documents/encrypted — List all encrypted PDFs encountered
Upload returns 403 with error_type: encrypted_pdf for individual encrypted files

CLI (Headless Mode):

Encrypted PDFs are logged to encrypted_pdfs.log for later review
Processing continues without interruption

Note: The app cannot decrypt PDFs for you — you'll need the password and a PDF tool to unlock them first.

🚀 Getting Started (Desktop App)

This is the recommended way to use the app. It gives you the full experience with the native interface.

🛡️ Security note (public Wi-Fi) If you use the app on public or shared Wi-Fi (cafés, airports, hotels), we recommend using a trusted network or a VPN. On a private home network, no special setup is needed.

Most users only need this guide. The additional documents are for advanced deployment, customization, or maintenance scenarios.

📘 Read the Full Installation Guide

(Includes detailed steps for Windows, macOS, and Linux)

Windows Quick Install:

Download PGVectorRAGIndexer-Setup.exe
Double-click to install
Done!

🤖 AI Agent Integration (MCP)

New in v2.3+: You can connect desktop AI agents (Claude Desktop, Cursor, etc.) directly to your local database using the Model Context Protocol (MCP).

PGVectorRAGIndexer acts as a local MCP tool provider — it does not run an LLM, but exposes your indexed documents as searchable tools to compatible AI agents.

Why use this?

🔒 Zero network exposure: Uses stdio pipes, so no ports are opened.
🧠 Context-aware AI: Your AI assistant can fuzzy-search your private documents to answer questions.

Other compatible clients

Cursor (agent mode)
Claude CLI
Custom MCP-compatible agents

Exposed MCP tools

search_documents — semantic / hybrid search over indexed files
index_document — add or re-index documents
list_documents — enumerate indexed sources

Setup for Claude Desktop:

Add this to your claude_desktop_config.json:

{
  "mcpServers": {
    "local-rag": {
      "command": "/path/to/your/venv/bin/python",
      "args": ["/path/to/PGVectorRAGIndexer/mcp_server.py"]
    }
  }
}

Restart Claude. You will see a 🔌 icon indicating the connection.
Ask: "Search my documents for 'financial report' and summarize the key points."

🛠️ Technical Capability: Web Server Mode (Advanced)

NOTE: Use this ONLY if you want to run the application as a headless server or strictly use the Web UI. For the normal desktop experience, see the section above.

Prerequisites

Docker (required)

One-line Command (Linux/macOS/WSL)

curl -fsSL https://raw.githubusercontent.com/valginer0/PGVectorRAGIndexer/main/docker-run.sh | bash

Services will be available at:

🌐 Web UI: http://localhost:8000
📚 API Docs: http://localhost:8000/docs

Basic Usage

1. Check system health:

curl http://localhost:8000/health

2. Index a document:

# Create a text document in the documents directory
cat > ~/pgvector-rag/documents/sample.txt << 'EOF'
Artificial Intelligence and Machine Learning

Machine learning is a subset of AI that enables computers to learn from data.
Deep learning uses neural networks to process complex patterns.
EOF

# Index it via API
curl -X POST "http://localhost:8000/index" \
  -H "Content-Type: application/json" \
  -d '{"source_uri": "/app/documents/sample.txt"}'

3. Or upload from ANY location:

# Index files from any Windows directory - no copying needed!
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@C:\Users\YourName\Documents\report.pdf"

4. Search for similar content:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "your search query",
    "top_k": 5
  }'

5. View logs and manage:

cd ~/pgvector-rag

# View logs
docker compose logs -f

# Stop services
docker compose down

# Restart services
docker compose restart

# Update to latest version
docker compose pull && docker compose up -d

See USAGE_GUIDE.md for more examples and QUICK_START.md for detailed setup.

Alternative: Developer Setup (For Contributing)

If you want to modify the code:

git clone https://github.com/valginer0/PGVectorRAGIndexer.git
cd PGVectorRAGIndexer
./setup.sh

This sets up a local development environment with Python virtual environment.

Benefits of Docker-only:

✅ No repository clone needed
✅ No Python environment setup
✅ Everything in containers
✅ Easy updates: docker compose pull && docker compose up -d

Manual Setup (Advanced)

Click to expand manual installation steps

Clone and navigate to project:

git clone https://github.com/valginer0/PGVectorRAGIndexer.git
cd PGVectorRAGIndexer

Configure environment:

cp .env.example .env
$EDITOR .env  # Update PROJECT_DIR with your absolute path

Create virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Start database:

./start_database.sh  # Auto-initializes pgvector

Verify setup:

docker ps
docker logs vector_rag_db

📖 Usage

Command-Line Interface

Indexing Documents

# Index a single document
python indexer_v2.py index document.pdf

# Index with Windows path (auto-converted)
python indexer_v2.py index "C:\Users\Name\document.pdf"

# Index a web URL
python indexer_v2.py index https://example.com/article

# Force reindex existing document
python indexer_v2.py index document.pdf --force

# Index with OCR mode (for scanned documents)
python indexer_v2.py index scanned_doc.pdf --ocr-mode auto  # Smart fallback (default)
python indexer_v2.py index native_doc.pdf --ocr-mode skip   # Skip OCR (fastest)
python indexer_v2.py index image.jpg --ocr-mode only        # OCR-only files

# List indexed documents
python indexer_v2.py list

# Show statistics
python indexer_v2.py stats

# Delete a document
python indexer_v2.py delete <document_id>

Searching Documents

# Basic search
python retriever_v2.py "What is machine learning?"

# Search with custom parameters
python retriever_v2.py "Python programming" --top-k 10 --min-score 0.8

# Hybrid search (vector + full-text)
python retriever_v2.py "database optimization" --hybrid

# Verbose output (show full text)
python retriever_v2.py "data science" --verbose

# Get context for RAG
python retriever_v2.py "explain transformers" --context

# Filter by document
python retriever_v2.py "query" --filter-doc doc_id_here

REST API

Start API Server

python api.py

Or with uvicorn:

uvicorn api:app --host 0.0.0.0 --port 8000 --reload

API Documentation

Once running, access:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Example API Calls

Index a document:

curl -X POST "http://localhost:8000/index" \
  -H "Content-Type: application/json" \
  -d '{
    "source_uri": "/path/to/document.pdf",
    "force_reindex": false
  }'

Search documents:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "top_k": 5,
    "use_hybrid": false
  }'

List documents:

curl "http://localhost:8000/documents?limit=10"

Get statistics:

curl "http://localhost:8000/stats"

Health check:

curl "http://localhost:8000/health"

New Metadata & Bulk Operations API (v2.1)

Upload with document type:

curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@document.pdf" \
  -F "document_type=policy"

Upload with OCR mode (for scanned documents/images):

# Auto mode (default) - uses OCR only when native text extraction fails
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@scanned_contract.pdf" \
  -F "ocr_mode=auto"

# Skip mode - faster, never uses OCR (skips scanned docs)
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@native_doc.pdf" \
  -F "ocr_mode=skip"

# Only mode - process only files that require OCR
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@photo_of_text.jpg" \
  -F "ocr_mode=only"

Discover metadata keys:

curl "http://localhost:8000/metadata/keys"
# Returns: ["type", "author", "file_type", "upload_method", ...]

Get values for a metadata key:

curl "http://localhost:8000/metadata/values?key=type"
# Returns: ["policy", "resume", "report", ...]

Search with metadata filter:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "security requirements",
    "top_k": 5,
    "filters": {"type": "policy"}
  }'

Preview bulk delete:

curl -X POST "http://localhost:8000/documents/bulk-delete" \
  -H "Content-Type: application/json" \
  -d '{
    "filters": {"type": "draft"},
    "preview": true
  }'

Export backup before delete:

curl -X POST "http://localhost:8000/documents/export" \
  -H "Content-Type: application/json" \
  -d '{"filters": {"type": "draft"}}' > backup.json

Bulk delete documents:

curl -X POST "http://localhost:8000/documents/bulk-delete" \
  -H "Content-Type: application/json" \
  -d '{
    "filters": {"type": "draft"},
    "preview": false
  }'

Restore from backup (undo):

curl -X POST "http://localhost:8000/documents/restore" \
  -H "Content-Type: application/json" \
  -d @backup.json

Database Management

Manual Database Inspection

Use the interactive database inspection tool:

./inspect_db.sh

Options include:

List all documents
Count chunks per document
Show recent chunks
Search for text in chunks
Show database statistics
List extensions
Show table schema
Interactive psql session
Custom SQL query

Direct Database Access

# View all documents
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT DISTINCT document_id, source_uri, COUNT(*) as chunks 
   FROM document_chunks GROUP BY document_id, source_uri;"

# Search for text
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT document_id, LEFT(text_content, 100) 
   FROM document_chunks WHERE text_content ILIKE '%search_term%';"

# Get statistics
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT COUNT(DISTINCT document_id) as docs, COUNT(*) as chunks 
   FROM document_chunks;"

🔍 Advanced Features

Hybrid Search

Combine vector similarity with full-text search for better results:

# CLI
python retriever_v2.py "query" --hybrid --alpha 0.7

# API
{
  "query": "machine learning",
  "use_hybrid": true,
  "alpha": 0.7  # 0.7 vector + 0.3 full-text
}

v2.2.4+ Improvements:

Optimized for Large Databases: Scales to 150K+ chunks efficiently
Exact-Match Boost: Full-text matches are automatically boosted to top of results
Phrase Support: Mix quoted and unquoted terms in your search:
```
Master Card "Simplicity 9112"
```
Quoted subphrases ("Simplicity 9112") must match as adjacent words; other terms (Master Card) are matched individually.

Metadata Filtering

Add and filter by custom metadata:

# Index with metadata
{
  "source_uri": "/path/to/doc.pdf",
  "metadata": {
    "author": "John Doe",
    "category": "research",
    "year": 2024
  }
}

# Search with filters
{
  "query": "neural networks",
  "filters": {
    "document_id": "abc123"
  }
}

Batch Processing

Process multiple documents efficiently:

from indexer_v2 import DocumentIndexer

indexer = DocumentIndexer()
results = indexer.index_batch([
    "/path/to/doc1.pdf",
    "/path/to/doc2.pdf",
    "https://example.com/article"
])

📊 Performance Optimization

Database Tuning

For production, consider these PostgreSQL settings:

-- Increase work memory for vector operations
SET work_mem = '256MB';

-- Tune HNSW index parameters
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);  -- Higher values = better recall, slower build

Embedding Cache

The system caches embeddings in memory. Monitor cache size:

from embeddings import get_embedding_service

service = get_embedding_service()
print(f"Cache size: {service.get_cache_size()}")
service.clear_cache()  # Clear if needed

Connection Pooling

Adjust pool size based on workload:

DB_POOL_SIZE=20
DB_MAX_OVERFLOW=40

🔒 Security Considerations

Database Credentials: Store in .env file (gitignored)
API Authentication: Add authentication middleware for production
Input Validation: All inputs validated via Pydantic
SQL Injection: Protected via parameterized queries
File Upload: Validate file types and sizes

🐛 Troubleshooting

Database Connection Issues

# Check container status
docker ps

# Check logs
docker logs vector_rag_db

# Restart container
docker compose restart

Import Errors

# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Memory Issues

# Clear embedding cache
python -c "from embeddings import get_embedding_service; get_embedding_service().clear_cache()"

# Reduce batch size
export EMBEDDING_BATCH_SIZE=16

📈 Monitoring

🧪 Testing

Run All Tests

bash scripts/run_all_tests.sh

Run Specific Test Categories

# Unit tests only
pytest -m unit

# Integration tests
pytest -m integration

# With coverage
pytest --cov=. --cov-report=html

Testing Rules

Write tests first: For each bug or feature, add or update tests before changing code.
Separate test database: Tests run against rag_vector_db_test to avoid polluting development data. This is configured in tests/conftest.py.
No manual installs: Do not run pip install <pkg>. Always add dependencies to requirements.txt and install via pip install -r requirements.txt for reproducibility.

Test Structure

tests/
├── __init__.py
├── conftest.py           # Shared fixtures
├── test_config.py        # Configuration tests
├── test_database.py      # Database tests
└── test_embeddings.py    # Embedding tests

🏗️ Architecture

Module Overview

PGVectorRAGIndexer/
├── config.py              # Configuration management
├── database.py            # Database operations & pooling
├── embeddings.py          # Embedding service
├── document_processor.py  # Document loading & chunking
├── indexer_v2.py         # Indexing CLI
├── retriever_v2.py       # Search CLI
├── api.py                # REST API (Orchestrator)
├── api_models.py         # Pydantic models for API
├── routers/              # Modular API router package
├── services.py           # API service layer / factories
├── init-db.sql           # Database schema
├── requirements.txt      # Dependencies
└── tests/                # Test suite

Database Schema

-- Main table with enhanced features
CREATE TABLE document_chunks (
    chunk_id BIGSERIAL PRIMARY KEY,
    document_id TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    text_content TEXT NOT NULL,
    source_uri TEXT NOT NULL,
    embedding VECTOR(384),
    metadata JSONB DEFAULT '{}',
    indexed_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    UNIQUE(document_id, chunk_index)
);

-- Indexes for performance
CREATE INDEX idx_chunks_embedding_hnsw ON document_chunks 
    USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_text_search ON document_chunks 
    USING gin(to_tsvector('english', text_content));
CREATE INDEX idx_chunks_metadata ON document_chunks 
    USING gin(metadata);

Key Design Patterns

Repository Pattern: Clean data access layer
Service Layer: Business logic separation
Factory Pattern: Service instantiation
Singleton Pattern: Global configuration and services
Strategy Pattern: Pluggable document loaders

📊 Performance Optimization

Database Tuning

For production, consider these PostgreSQL settings:

-- Increase work memory for vector operations
SET work_mem = '256MB';

-- Tune HNSW index parameters
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);  -- Higher values = better recall, slower build

Embedding Cache

The system caches embeddings in memory. Monitor cache size:

from embeddings import get_embedding_service

service = get_embedding_service()
print(f"Cache size: {service.get_cache_size()}")
service.clear_cache()  # Clear if needed

Connection Pooling

Adjust pool size based on workload:

DB_POOL_SIZE=20
DB_MAX_OVERFLOW=40

🔒 Security Considerations

Database Credentials: Store in .env file (gitignored)
API Authentication: Add authentication middleware for production
Input Validation: All inputs validated via Pydantic
SQL Injection: Protected via parameterized queries
File Upload: Validate file types and sizes

🐛 Troubleshooting

Database Connection Issues

# Check container status
docker ps

# Check logs
docker logs vector_rag_db

# Restart container
docker compose restart

Import Errors

# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Memory Issues

# Clear embedding cache
python -c "from embeddings import get_embedding_service; get_embedding_service().clear_cache()"

# Reduce batch size
export EMBEDDING_BATCH_SIZE=16

📈 Monitoring

Health Checks

# CLI
python indexer_v2.py stats

# API
curl http://localhost:8000/health

Database Statistics

-- View document statistics
SELECT * FROM document_stats;

-- Check index usage
SELECT * FROM pg_stat_user_indexes WHERE schemaname = 'public';

-- Database size
SELECT pg_size_pretty(pg_database_size(current_database()));

🔄 Migration from v1

The v2 system is backward compatible with v1 data. To migrate:

Backup existing data:

docker exec vector_rag_db pg_dump -U rag_user rag_vector_db > backup.sql

Update schema:

docker exec -i vector_rag_db psql -U rag_user -d rag_vector_db < init-db.sql

Use new CLI:

# Old: python indexer.py document.pdf
# New: python indexer_v2.py index document.pdf

📝 Development

Code Style

# Format code
black .

# Lint code
ruff check .

# Type checking
mypy .

Adding New Document Loaders

Extend DocumentLoader class in document_processor.py:

class CustomLoader(DocumentLoader):
    def can_load(self, source_uri: str) -> bool:
        return source_uri.endswith('.custom')
    
    def load(self, source_uri: str) -> List[Document]:
        # Your loading logic
        pass

💬 Feedback & Suggestions

This project does not accept pull requests or code contributions.

We welcome:

🐛 Bug reports - Create an issue with details
💡 Feature suggestions - Share your ideas via issues
📝 Feedback - Let us know what could be better
📧 Direct contact - valginer0@gmail.com for specific inquiries

See CONTRIBUTING.md for details on how to provide feedback.

All development is maintained exclusively by the copyright holder to ensure code quality and full ownership.

🙏 Acknowledgments

PostgreSQL and pgvector teams
Sentence Transformers library
LangChain community
FastAPI framework

📞 Support

For issues and questions:

Check documentation
Review test examples
Check API docs at /docs
Review logs for error details

Version: 2.4.3
Last Updated: 2026
Status: Production Ready ✅

License

PGVectorRAGIndexer is dual-licensed.

Community License

Free for personal use, education, research, and non-commercial evaluation.
Evaluation within a company is permitted for testing and assessment purposes.

Commercial License

A commercial license is required for:

production use
CI/CD or automated workflows
ongoing internal company use
use as part of a paid, hosted, or client-facing service

See COMMERCIAL.md for details on when a commercial license is required and how to obtain one.

If you are unsure whether your use case requires a commercial license, feel free to reach out.

Supporting the project

If you use PGVectorRAGIndexer personally or for learning and would like to support ongoing development, you can sponsor the project via GitHub Sponsors.

Sponsorships are optional and do not grant commercial usage rights.

⭐ Star the repository
❤️ Sponsor on GitHub
📢 Share it with others who may find it useful

Thank you for supporting independent open-source development.

Name		Name	Last commit message	Last commit date
Latest commit History 733 Commits
.agent/workflows		.agent/workflows
.agents/workflows		.agents/workflows
.github		.github
alembic		alembic
desktop_app		desktop_app
docs		docs
routers		routers
scripts		scripts
static		static
tests		tests
tools/profiling		tools/profiling
unsigned_msi		unsigned_msi
unsigned_msi_v267		unsigned_msi_v267
unsigned_msi_v268/PGVectorRAGIndexer.msi		unsigned_msi_v268/PGVectorRAGIndexer.msi
v2.6.9_download/PGVectorRAGIndexer.msi		v2.6.9_download/PGVectorRAGIndexer.msi
windows_installer		windows_installer
.dockerignore		.dockerignore
.env.example		.env.example
.env.neon-demo		.env.neon-demo
.gitignore		.gitignore
.pgvector-ignore		.pgvector-ignore
Architecture_v2		Architecture_v2
BACKUP_GUIDE.md		BACKUP_GUIDE.md
CHANGELOG.md		CHANGELOG.md
COMMERCIAL.md		COMMERCIAL.md
COMMERCIAL_FAQ.md		COMMERCIAL_FAQ.md
COMMERCIAL_LICENSE_AGREEMENT_TEMPLATE.md		COMMERCIAL_LICENSE_AGREEMENT_TEMPLATE.md
COMMERCIAL_LICENSE_REQUEST_TEMPLATE.md		COMMERCIAL_LICENSE_REQUEST_TEMPLATE.md
CONTRIBUTING.md		CONTRIBUTING.md
DEPLOYMENT.md		DEPLOYMENT.md
DOCKER_BASE_IMAGE.md		DOCKER_BASE_IMAGE.md
DOCUMENTATION_STRUCTURE.md		DOCUMENTATION_STRUCTURE.md
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
Dockerfile.demo		Dockerfile.demo
FAQ.md		FAQ.md
INSTALL_DESKTOP_APP.md		INSTALL_DESKTOP_APP.md
LICENSE		LICENSE
LICENSE_COMMERCIAL.txt		LICENSE_COMMERCIAL.txt
LICENSE_COMMUNITY.txt		LICENSE_COMMUNITY.txt
QUICK_START.md		QUICK_START.md
README.md		README.md
RELEASE_INSTRUCTIONS.md		RELEASE_INSTRUCTIONS.md
SECURITY.md		SECURITY.md
TESTING_DESKTOP_APP.md		TESTING_DESKTOP_APP.md
TESTING_GUIDE.md		TESTING_GUIDE.md
USAGE_GUIDE.md		USAGE_GUIDE.md
VERSION		VERSION
WORKFLOW_GUIDE.md		WORKFLOW_GUIDE.md
activity_log.py		activity_log.py
alembic.ini		alembic.ini
api.py		api.py
api_models.py		api_models.py
auth.py		auth.py
backup_database.sh		backup_database.sh
bootstrap_desktop_app.ps1		bootstrap_desktop_app.ps1
bootstrap_desktop_app.sh		bootstrap_desktop_app.sh
build-base.sh		build-base.sh
build_and_test_local.sh		build_and_test_local.sh
canonical_identity.py		canonical_identity.py
check_libreoffice.sh		check_libreoffice.sh
client_identity.py		client_identity.py
compliance_export.py		compliance_export.py
config.py		config.py
create_desktop_shortcut.ps1		create_desktop_shortcut.ps1
database.py		database.py
demo_doc.txt		demo_doc.txt
docker-compose.demo.yml		docker-compose.demo.yml
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.full.yml		docker-compose.full.yml
docker-compose.yml		docker-compose.yml
docker-run.ps1		docker-run.ps1
docker-run.sh		docker-run.sh
document_locks.py		document_locks.py
document_processor.py		document_processor.py
document_tree.py		document_tree.py
document_visibility.py		document_visibility.py
embeddings.py		embeddings.py
errors.py		errors.py
generate_keys.py		generate_keys.py
generate_license_key.py		generate_license_key.py
indexer_v2.py		indexer_v2.py
indexing_runs.py		indexing_runs.py
init-db.sql		init-db.sql
inspect_db.sh		inspect_db.sh
install-linux.sh		install-linux.sh
install.bat		install.bat
install.command		install.command
installer.ps1		installer.ps1
installer.sh		installer.sh
license.py		license.py
license_utils.py		license_utils.py
logger_setup.py		logger_setup.py
manage.ps1		manage.ps1
manage.sh		manage.sh
mark_slow_tests.sh		mark_slow_tests.sh
mcp_server.py		mcp_server.py
migrate.py		migrate.py
path_utils.py		path_utils.py
pgvector_admin.py		pgvector_admin.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PGVectorRAGIndexer v2.11.7

🤔 In Plain English (What does this actually do?)

🔒 Designed for Trust, Not the Cloud

🖥️ Choose Your Deployment

🋹 What's New in v2.11.7

🆕 Latest Features (v2.11.7)

Recent Features (v2.7)

Recent Milestones (v2.6)

Architecture & Scale (v2.4 - v2.5)

Intelligence & Integration (v2.1 - v2.3)

Major Improvements (v2.0)

📋 Features

Core Capabilities

User Interface

API Features

Enterprise Capabilities

🔐 Encrypted PDF Handling

🚀 Getting Started (Desktop App)

📘 Read the Full Installation Guide

🤖 AI Agent Integration (MCP)

🛠️ Technical Capability: Web Server Mode (Advanced)

Prerequisites

One-line Command (Linux/macOS/WSL)

Basic Usage

Alternative: Developer Setup (For Contributing)

Manual Setup (Advanced)

📖 Usage

Command-Line Interface

Indexing Documents

Searching Documents

REST API

Start API Server

API Documentation

Example API Calls

New Metadata & Bulk Operations API (v2.1)

Database Management

Manual Database Inspection

Direct Database Access

🔍 Advanced Features

Hybrid Search

Metadata Filtering

Batch Processing

📊 Performance Optimization

Database Tuning

Embedding Cache

Connection Pooling

🔒 Security Considerations

🐛 Troubleshooting

Database Connection Issues

Import Errors

Memory Issues

📈 Monitoring

🧪 Testing

Run All Tests

Run Specific Test Categories

Testing Rules

Test Structure

🏗️ Architecture

Module Overview

Database Schema

Key Design Patterns

📊 Performance Optimization

Database Tuning

Embedding Cache

Connection Pooling

🔒 Security Considerations

🐛 Troubleshooting

Database Connection Issues

Import Errors

Memory Issues

📈 Monitoring

Health Checks

Database Statistics

🔄 Migration from v1

📝 Development

Packages