Skip to content

valginer0/PGVectorRAGIndexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

733 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PGVectorRAGIndexer v2.11.7

MCP Compatible

Start here:

Commercial use?
PGVectorRAGIndexer is free for personal use, education, research, and evaluation.
Production, CI/CD, or ongoing internal company use requires a commercial license.
See COMMERCIAL.md.


🤔 In Plain English (What does this actually do?)

Imagine a magic bookshelf that reads and understands every book, document, and note you put on it. When you have a question, you don't have to search for keywords yourself—you just ask the bookshelf in plain English (like "How do I fix the printer?" or "What was our revenue last year?"), and it instantly hands you the exact page with the answer.

Best of all? It lives 100% on your own hardware. Your documents never get sent to the "cloud" or some stranger's server. Whether on your laptop or your company's server, your data stays yours.

Most search bars are dumb—they only find exact word matches. If you search for "dog", they won't find "puppy".

This system is smart. It reads your documents, understands the meaning behind the text, and lets you search using natural language.


🔒 Designed for Trust, Not the Cloud

What this system does not do is just as important as what it does.

PGVectorRAGIndexer does not upload your documents to external services or rely on any external AI service. Everything runs on your own infrastructure — your laptop for personal use, or your company's server for teams.

This design is intentional. It means:

  • Your files never leave your infrastructure
  • The application runtime requires no mandatory accounts, cloud subscriptions, or hidden data flows
  • You can use it safely with personal, private, or sensitive documents

Policy: PGVectorRAGIndexer is local-only by design — your data stays on your hardware, and we do not offer hosted indexing or storage.

Unlike chat-based tools, this app focuses on search and discovery, not conversation.
It helps you find and rediscover documents by meaning — even when you don’t remember the exact words.


🖥️ Choose Your Deployment

For teams and organizations, run PGVectorRAGIndexer on a shared server and connect Windows/macOS/Linux desktop clients remotely. No Docker needed on client machines.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Desktop     │     │  Desktop     │     │  Desktop     │
│  (Windows)   │     │  (macOS)     │     │  (Linux)     │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │ HTTP
                   ┌────────▼────────┐
                   │  Shared Server  │
                   │  (Docker)       │
                   │  Centralized    │
                   │  indexing,      │
                   │  search, audit  │
                   └─────────────────┘
Personal / Desktop Team / Organization Server
Best For Individual use Teams, departments, companies
Architecture Everything on one machine Shared server + desktop clients
Setup One-click installer Deploy server, install desktop clients
Docker Required locally Server only — clients need no Docker
Admin Controls Single user Users, roles, audit logs, retention
Data Your machine only Centralized — IT-managed, backed up
Guide INSTALL_DESKTOP_APP.md DEPLOYMENT.md

Both modes run entirely on your hardware — no cloud, no external services, your data never leaves your network.

💡 CLI is also available in both modes for power users:

  • Batch processing
  • Scripting and automation
  • Headless usage

CLI requires the backend running (via Docker or Desktop App).


🋹 What's New in v2.11.7

🆕 Latest Features (v2.11.7)

  • ✅ First-Run Onboarding Wizard: Guided 5-step setup shown automatically on first launch — connect, verify, license, index sample docs, and run a first search. Re-accessible at any time from Settings → Run Setup Wizard. Re-shown automatically after each version upgrade.
  • ✅ Wizard Tab Hints: Completion pages link to the relevant app tab with contextual hints; wizard auto-navigates to Search on finish.
  • ✅ Windows Installer Ref Pinning: Release MSI now installs the exact release tag (e.g. v2.11.7), not main — users installing months later always get the correct versioned code.
  • ✅ App Version Everywhere: Version shown in the title bar, in the Settings header, and tracked per-install for upgrade detection.
  • ✅ Check for Updates: Pulls latest app code (git pull) and Docker images in one click from the Settings tab.
  • ✅ Organization Console: Full admin console for server-side governance — users, roles, permissions, retention, audit logs, API keys, and SCIM provisioning. Adapts to server capabilities with 4-state detection. Admin write operations: user CRUD, API key lifecycle (create/revoke/rotate), retention manual cleanup, and compliance export. Permission-aware gating matches backend authorization.
  • ✅ Identity Endpoint (GET /me): Server returns the current user's identity, role, and resolved permissions. Loopback mode returns effective admin authority.
  • ✅ Settings Integration: Backend URL or API key changes automatically invalidate cached capabilities and trigger a re-probe.

Recent Features (v2.7)

  • ✅ License Validation Fixed: Reverted incorrect public key update that caused valid RSA license signatures to be rejected.
  • ✅ Deep Observability: JSONFormatter for structured JSON logging, system metrics on /health.
  • ✅ Desktop API Client Facade: Decoupled monolithic APIClient into domain clients (SystemClient, DocumentClient, etc.).
  • ✅ Data Retention Orchestration: Automated background execution of retention policies.

Recent Milestones (v2.6)

  • ✅ Windows Installer Parity: Full feature parity with legacy PowerShell scripts, including auto-installation of Rancher/Docker environments.
  • ✅ API Modularization: Refactored monolithic backend into a scalable, domain-driven package architecture.
  • ✅ Asymmetric Licensing (RS256): Zero-config desktop activation using embedded public keys.
  • ✅ Installer Self-Healing: Automated detection and recovery from stuck Docker daemons and system reboots.

Architecture & Scale (v2.4 - v2.5)

  • ✅ Hierarchical Document Tree: Scalable, lazy-loaded tree view replacing the legacy list for 100k+ document navigation.
  • ✅ Privacy-First Analytics: Opt-in anonymous usage tracking with a local audit log and transparency dashboard.
  • ✅ Split-Backend Testing: E2E validation suite for multi-server production deployments.
  • ✅ WSL Native Dialogs: Seamless integration of Windows native file pickers when running under WSL/Linux.
  • ✅ Database Hardening: Context-managed connection pooling across all 14 backend modules to eliminate leaks.

Intelligence & Integration (v2.1 - v2.3)

  • ✅ MCP Server: AI agents (Claude CLI, Claude Desktop, Cursor) can connect directly to your local database

    • Uses stdio transport — zero network exposure, no open ports
    • Exposes search_documents, index_document, list_documents tools
    • See AI Agent Integration (MCP) for setup
  • ✅ Security Documentation: New SECURITY.md with network config guidance + friendly notes in docs

  • ✅ Encrypted PDF Detection: Password-protected PDFs are gracefully detected and skipped

    • Returns 403 with error_type: encrypted_pdf for clear error handling
    • GET /documents/encrypted endpoint to list all skipped encrypted PDFs
    • CLI logs encrypted PDFs to encrypted_pdfs.log for headless mode tracking
  • ✅ Improved Error Panel: Desktop app now shows a resizable dialog with filter tabs (All/Encrypted/Other) and CSV export

  • ✅ Incremental Indexing: Smart content change detection using xxHash. Skips unchanged files, saving bandwidth and processing time.

  • ✅ Wildcard Search: Document type filter now supports * for "All Types".

  • ✅ Dynamic UI: Upload tab automatically fetches available document types from the database.

  • ✅ Document Type System: Organize documents with custom types (policy, resume, report, etc.)

  • ✅ Bulk Delete with Preview: Safely delete multiple documents with preview before action

  • ✅ Export/Backup System: Export documents as JSON backup before deletion

  • ✅ Undo/Restore Functionality: Restore deleted documents from backup

  • ✅ Desktop App Manage Tab: Full GUI for bulk operations with backup/restore

  • ✅ Legacy Word Support: Added .doc (Office 97-2003) file support

ℹ️ Legacy .doc conversion: Automatic conversion relies on LibreOffice. Docker users (recommended) have this pre-installed. Manual/Local developers must install it and expose the soffice binary (e.g., export LIBREOFFICE_PATH in .env).

Major Improvements (v2.0)

  • ✅ Modern Web UI: User-friendly web interface for search, upload, and document management
  • ✅ Modular Architecture: Clean separation of concerns with dedicated modules for config, database, embeddings, and processing
  • ✅ Configuration Management: Pydantic-based configuration with validation and environment variable support
  • ✅ Connection Pooling: Efficient database connection management with automatic retry and health checks
  • ✅ Comprehensive Testing: Full test suite with unit, integration, and end-to-end tests (143 tests!)
  • ✅ Document Deduplication: Automatic detection and prevention of duplicate documents
  • ✅ Metadata Support: Rich metadata storage and filtering capabilities
  • ✅ Hybrid Search: Combine vector similarity with full-text search for better results
  • ✅ REST API: FastAPI-based HTTP API for easy integration
  • ✅ Embedding Cache: In-memory caching for faster repeated queries
  • ✅ Batch Processing: Efficient batch operations for indexing and retrieval
  • ✅ Enhanced Schema: Improved database schema with indexes, triggers, and views
  • ✅ Better Error Handling: Comprehensive error handling with detailed logging
  • ✅ CLI Improvements: Enhanced command-line interface with subcommands

📋 Features

Core Capabilities

  • Multi-Format Support: PDF, DOC, DOCX, XLSX, TXT, HTML, PPTX, CSV, and web URLs
  • OCR Support: Extract text from scanned PDFs and images (PNG, JPG, TIFF, BMP) using Tesseract
  • Semantic Search: Vector similarity search using sentence transformers
  • Hybrid Search: Combine vector and full-text search with configurable weights
  • Document Management: Full CRUD operations for indexed documents
  • Bulk Operations: Preview, export, delete, and restore multiple documents
  • Backup/Restore: Export documents as JSON and restore with undo functionality
  • Connection Pooling: Efficient database connection management
  • Embedding Cache: Speed up repeated queries with in-memory cache
  • Health Monitoring: Built-in health checks and statistics

User Interface

  • Search Interface: Intuitive search with semantic and hybrid options
  • Drag & Drop Upload: Easy file upload with progress tracking
  • Document Browser: View and manage all indexed documents
  • Statistics Dashboard: Real-time system health and metrics
  • Organization Console: Server-side governance console — users, roles, permissions, retention, audit log, API keys, and SCIM visibility with permission-aware write controls

API Features

  • RESTful API: FastAPI-based HTTP API with automatic OpenAPI documentation
  • CORS Support: Configurable CORS for web applications
  • Rate Limiting: Built-in rate limiting support
  • Error Handling: Comprehensive error responses with detailed messages
  • Async Support: Asynchronous operations for better performance

Enterprise Capabilities

  • RBAC and Users: Server-side users, admin/user roles, and permission-checked endpoints
  • SSO/SAML: Optional SAML login flow for enterprise deployments (Okta-oriented configuration)
  • SCIM Provisioning: SCIM 2.0 user provisioning endpoints for enterprise identity workflows
  • Audit and Compliance Controls: Audit events, data retention orchestration, and compliance export support

🔐 Encrypted PDF Handling

Password-protected PDFs cannot be indexed without decryption. Instead of crashing or failing silently, PGVectorRAGIndexer handles them gracefully:

How it works:

  1. When a password-protected PDF is encountered during upload, it's detected immediately
  2. The file is skipped (not indexed) and added to a special "encrypted PDFs" list
  3. Processing continues with the remaining files — one encrypted file won't stop your batch
  4. You can review all encrypted PDFs after the upload completes

Desktop App:

  • After upload, an "🔒 Encrypted PDFs (N)" button appears if any were detected
  • Click to see a filterable list of all encrypted files with their paths
  • You can copy paths or open the folder to manually decrypt files
  • Re-upload the decrypted versions when ready

API:

  • GET /documents/encrypted — List all encrypted PDFs encountered
  • Upload returns 403 with error_type: encrypted_pdf for individual encrypted files

CLI (Headless Mode):

  • Encrypted PDFs are logged to encrypted_pdfs.log for later review
  • Processing continues without interruption

Note: The app cannot decrypt PDFs for you — you'll need the password and a PDF tool to unlock them first.

🚀 Getting Started (Desktop App)

This is the recommended way to use the app. It gives you the full experience with the native interface.

🛡️ Security note (public Wi-Fi) If you use the app on public or shared Wi-Fi (cafés, airports, hotels), we recommend using a trusted network or a VPN. On a private home network, no special setup is needed.

Most users only need this guide. The additional documents are for advanced deployment, customization, or maintenance scenarios.

(Includes detailed steps for Windows, macOS, and Linux)

Windows Quick Install:

  1. Download PGVectorRAGIndexer-Setup.exe
  2. Double-click to install
  3. Done!

🤖 AI Agent Integration (MCP)

New in v2.3+: You can connect desktop AI agents (Claude Desktop, Cursor, etc.) directly to your local database using the Model Context Protocol (MCP).

PGVectorRAGIndexer acts as a local MCP tool provider — it does not run an LLM, but exposes your indexed documents as searchable tools to compatible AI agents.

Why use this?

  • 🔒 Zero network exposure: Uses stdio pipes, so no ports are opened.
  • 🧠 Context-aware AI: Your AI assistant can fuzzy-search your private documents to answer questions.

Other compatible clients

  • Cursor (agent mode)
  • Claude CLI
  • Custom MCP-compatible agents

Exposed MCP tools

  • search_documents — semantic / hybrid search over indexed files
  • index_document — add or re-index documents
  • list_documents — enumerate indexed sources

Setup for Claude Desktop:

  1. Add this to your claude_desktop_config.json:
    {
      "mcpServers": {
        "local-rag": {
          "command": "/path/to/your/venv/bin/python",
          "args": ["/path/to/PGVectorRAGIndexer/mcp_server.py"]
        }
      }
    }
  2. Restart Claude. You will see a 🔌 icon indicating the connection.
  3. Ask: "Search my documents for 'financial report' and summarize the key points."

🛠️ Technical Capability: Web Server Mode (Advanced)

NOTE: Use this ONLY if you want to run the application as a headless server or strictly use the Web UI. For the normal desktop experience, see the section above.

Prerequisites

  • Docker (required)

One-line Command (Linux/macOS/WSL)

curl -fsSL https://raw.githubusercontent.com/valginer0/PGVectorRAGIndexer/main/docker-run.sh | bash

Services will be available at:

  • 🌐 Web UI: http://localhost:8000
  • 📚 API Docs: http://localhost:8000/docs

Basic Usage

1. Check system health:

curl http://localhost:8000/health

2. Index a document:

# Create a text document in the documents directory
cat > ~/pgvector-rag/documents/sample.txt << 'EOF'
Artificial Intelligence and Machine Learning

Machine learning is a subset of AI that enables computers to learn from data.
Deep learning uses neural networks to process complex patterns.
EOF

# Index it via API
curl -X POST "http://localhost:8000/index" \
  -H "Content-Type: application/json" \
  -d '{"source_uri": "/app/documents/sample.txt"}'

3. Or upload from ANY location:

# Index files from any Windows directory - no copying needed!
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@C:\Users\YourName\Documents\report.pdf"

4. Search for similar content:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "your search query",
    "top_k": 5
  }'

5. View logs and manage:

cd ~/pgvector-rag

# View logs
docker compose logs -f

# Stop services
docker compose down

# Restart services
docker compose restart

# Update to latest version
docker compose pull && docker compose up -d

See USAGE_GUIDE.md for more examples and QUICK_START.md for detailed setup.


Alternative: Developer Setup (For Contributing)

If you want to modify the code:

git clone https://github.com/valginer0/PGVectorRAGIndexer.git
cd PGVectorRAGIndexer
./setup.sh

This sets up a local development environment with Python virtual environment.

Benefits of Docker-only:

  • ✅ No repository clone needed
  • ✅ No Python environment setup
  • ✅ Everything in containers
  • ✅ Easy updates: docker compose pull && docker compose up -d

Manual Setup (Advanced)

Click to expand manual installation steps
  1. Clone and navigate to project:
git clone https://github.com/valginer0/PGVectorRAGIndexer.git
cd PGVectorRAGIndexer
  1. Configure environment:
cp .env.example .env
$EDITOR .env  # Update PROJECT_DIR with your absolute path
  1. Create virtual environment:
python3 -m venv venv
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Start database:
./start_database.sh  # Auto-initializes pgvector
  1. Verify setup:
docker ps
docker logs vector_rag_db

📖 Usage

Command-Line Interface

Indexing Documents

# Index a single document
python indexer_v2.py index document.pdf

# Index with Windows path (auto-converted)
python indexer_v2.py index "C:\Users\Name\document.pdf"

# Index a web URL
python indexer_v2.py index https://example.com/article

# Force reindex existing document
python indexer_v2.py index document.pdf --force

# Index with OCR mode (for scanned documents)
python indexer_v2.py index scanned_doc.pdf --ocr-mode auto  # Smart fallback (default)
python indexer_v2.py index native_doc.pdf --ocr-mode skip   # Skip OCR (fastest)
python indexer_v2.py index image.jpg --ocr-mode only        # OCR-only files

# List indexed documents
python indexer_v2.py list

# Show statistics
python indexer_v2.py stats

# Delete a document
python indexer_v2.py delete <document_id>

Searching Documents

# Basic search
python retriever_v2.py "What is machine learning?"

# Search with custom parameters
python retriever_v2.py "Python programming" --top-k 10 --min-score 0.8

# Hybrid search (vector + full-text)
python retriever_v2.py "database optimization" --hybrid

# Verbose output (show full text)
python retriever_v2.py "data science" --verbose

# Get context for RAG
python retriever_v2.py "explain transformers" --context

# Filter by document
python retriever_v2.py "query" --filter-doc doc_id_here

REST API

Start API Server

python api.py

Or with uvicorn:

uvicorn api:app --host 0.0.0.0 --port 8000 --reload

API Documentation

Once running, access:

Example API Calls

Index a document:

curl -X POST "http://localhost:8000/index" \
  -H "Content-Type: application/json" \
  -d '{
    "source_uri": "/path/to/document.pdf",
    "force_reindex": false
  }'

Search documents:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "top_k": 5,
    "use_hybrid": false
  }'

List documents:

curl "http://localhost:8000/documents?limit=10"

Get statistics:

curl "http://localhost:8000/stats"

Health check:

curl "http://localhost:8000/health"

New Metadata & Bulk Operations API (v2.1)

Upload with document type:

curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@document.pdf" \
  -F "document_type=policy"

Upload with OCR mode (for scanned documents/images):

# Auto mode (default) - uses OCR only when native text extraction fails
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@scanned_contract.pdf" \
  -F "ocr_mode=auto"

# Skip mode - faster, never uses OCR (skips scanned docs)
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@native_doc.pdf" \
  -F "ocr_mode=skip"

# Only mode - process only files that require OCR
curl -X POST "http://localhost:8000/upload-and-index" \
  -F "file=@photo_of_text.jpg" \
  -F "ocr_mode=only"

Discover metadata keys:

curl "http://localhost:8000/metadata/keys"
# Returns: ["type", "author", "file_type", "upload_method", ...]

Get values for a metadata key:

curl "http://localhost:8000/metadata/values?key=type"
# Returns: ["policy", "resume", "report", ...]

Search with metadata filter:

curl -X POST "http://localhost:8000/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "security requirements",
    "top_k": 5,
    "filters": {"type": "policy"}
  }'

Preview bulk delete:

curl -X POST "http://localhost:8000/documents/bulk-delete" \
  -H "Content-Type: application/json" \
  -d '{
    "filters": {"type": "draft"},
    "preview": true
  }'

Export backup before delete:

curl -X POST "http://localhost:8000/documents/export" \
  -H "Content-Type: application/json" \
  -d '{"filters": {"type": "draft"}}' > backup.json

Bulk delete documents:

curl -X POST "http://localhost:8000/documents/bulk-delete" \
  -H "Content-Type: application/json" \
  -d '{
    "filters": {"type": "draft"},
    "preview": false
  }'

Restore from backup (undo):

curl -X POST "http://localhost:8000/documents/restore" \
  -H "Content-Type: application/json" \
  -d @backup.json

Database Management

Manual Database Inspection

Use the interactive database inspection tool:

./inspect_db.sh

Options include:

  • List all documents
  • Count chunks per document
  • Show recent chunks
  • Search for text in chunks
  • Show database statistics
  • List extensions
  • Show table schema
  • Interactive psql session
  • Custom SQL query

Direct Database Access

# View all documents
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT DISTINCT document_id, source_uri, COUNT(*) as chunks 
   FROM document_chunks GROUP BY document_id, source_uri;"

# Search for text
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT document_id, LEFT(text_content, 100) 
   FROM document_chunks WHERE text_content ILIKE '%search_term%';"

# Get statistics
docker exec vector_rag_db psql -U rag_user -d rag_vector_db -c \
  "SELECT COUNT(DISTINCT document_id) as docs, COUNT(*) as chunks 
   FROM document_chunks;"

🔍 Advanced Features

Hybrid Search

Combine vector similarity with full-text search for better results:

# CLI
python retriever_v2.py "query" --hybrid --alpha 0.7

# API
{
  "query": "machine learning",
  "use_hybrid": true,
  "alpha": 0.7  # 0.7 vector + 0.3 full-text
}

v2.2.4+ Improvements:

  • Optimized for Large Databases: Scales to 150K+ chunks efficiently
  • Exact-Match Boost: Full-text matches are automatically boosted to top of results
  • Phrase Support: Mix quoted and unquoted terms in your search:
    Master Card "Simplicity 9112"
    
    Quoted subphrases ("Simplicity 9112") must match as adjacent words; other terms (Master Card) are matched individually.

Metadata Filtering

Add and filter by custom metadata:

# Index with metadata
{
  "source_uri": "/path/to/doc.pdf",
  "metadata": {
    "author": "John Doe",
    "category": "research",
    "year": 2024
  }
}

# Search with filters
{
  "query": "neural networks",
  "filters": {
    "document_id": "abc123"
  }
}

Batch Processing

Process multiple documents efficiently:

from indexer_v2 import DocumentIndexer

indexer = DocumentIndexer()
results = indexer.index_batch([
    "/path/to/doc1.pdf",
    "/path/to/doc2.pdf",
    "https://example.com/article"
])

📊 Performance Optimization

Database Tuning

For production, consider these PostgreSQL settings:

-- Increase work memory for vector operations
SET work_mem = '256MB';

-- Tune HNSW index parameters
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);  -- Higher values = better recall, slower build

Embedding Cache

The system caches embeddings in memory. Monitor cache size:

from embeddings import get_embedding_service

service = get_embedding_service()
print(f"Cache size: {service.get_cache_size()}")
service.clear_cache()  # Clear if needed

Connection Pooling

Adjust pool size based on workload:

DB_POOL_SIZE=20
DB_MAX_OVERFLOW=40

🔒 Security Considerations

  1. Database Credentials: Store in .env file (gitignored)
  2. API Authentication: Add authentication middleware for production
  3. Input Validation: All inputs validated via Pydantic
  4. SQL Injection: Protected via parameterized queries
  5. File Upload: Validate file types and sizes

🐛 Troubleshooting

Database Connection Issues

# Check container status
docker ps

# Check logs
docker logs vector_rag_db

# Restart container
docker compose restart

Import Errors

# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Memory Issues

# Clear embedding cache
python -c "from embeddings import get_embedding_service; get_embedding_service().clear_cache()"

# Reduce batch size
export EMBEDDING_BATCH_SIZE=16

📈 Monitoring

🧪 Testing

Run All Tests

bash scripts/run_all_tests.sh

Run Specific Test Categories

# Unit tests only
pytest -m unit

# Integration tests
pytest -m integration

# With coverage
pytest --cov=. --cov-report=html

Testing Rules

  • Write tests first: For each bug or feature, add or update tests before changing code.
  • Separate test database: Tests run against rag_vector_db_test to avoid polluting development data. This is configured in tests/conftest.py.
  • No manual installs: Do not run pip install <pkg>. Always add dependencies to requirements.txt and install via pip install -r requirements.txt for reproducibility.

Test Structure

tests/
├── __init__.py
├── conftest.py           # Shared fixtures
├── test_config.py        # Configuration tests
├── test_database.py      # Database tests
└── test_embeddings.py    # Embedding tests

🏗️ Architecture

Module Overview

PGVectorRAGIndexer/
├── config.py              # Configuration management
├── database.py            # Database operations & pooling
├── embeddings.py          # Embedding service
├── document_processor.py  # Document loading & chunking
├── indexer_v2.py         # Indexing CLI
├── retriever_v2.py       # Search CLI
├── api.py                # REST API (Orchestrator)
├── api_models.py         # Pydantic models for API
├── routers/              # Modular API router package
├── services.py           # API service layer / factories
├── init-db.sql           # Database schema
├── requirements.txt      # Dependencies
└── tests/                # Test suite

Database Schema

-- Main table with enhanced features
CREATE TABLE document_chunks (
    chunk_id BIGSERIAL PRIMARY KEY,
    document_id TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    text_content TEXT NOT NULL,
    source_uri TEXT NOT NULL,
    embedding VECTOR(384),
    metadata JSONB DEFAULT '{}',
    indexed_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    UNIQUE(document_id, chunk_index)
);

-- Indexes for performance
CREATE INDEX idx_chunks_embedding_hnsw ON document_chunks 
    USING hnsw (embedding vector_cosine_ops);
CREATE INDEX idx_chunks_text_search ON document_chunks 
    USING gin(to_tsvector('english', text_content));
CREATE INDEX idx_chunks_metadata ON document_chunks 
    USING gin(metadata);

Key Design Patterns

  • Repository Pattern: Clean data access layer
  • Service Layer: Business logic separation
  • Factory Pattern: Service instantiation
  • Singleton Pattern: Global configuration and services
  • Strategy Pattern: Pluggable document loaders

📊 Performance Optimization

Database Tuning

For production, consider these PostgreSQL settings:

-- Increase work memory for vector operations
SET work_mem = '256MB';

-- Tune HNSW index parameters
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);  -- Higher values = better recall, slower build

Embedding Cache

The system caches embeddings in memory. Monitor cache size:

from embeddings import get_embedding_service

service = get_embedding_service()
print(f"Cache size: {service.get_cache_size()}")
service.clear_cache()  # Clear if needed

Connection Pooling

Adjust pool size based on workload:

DB_POOL_SIZE=20
DB_MAX_OVERFLOW=40

🔒 Security Considerations

  1. Database Credentials: Store in .env file (gitignored)
  2. API Authentication: Add authentication middleware for production
  3. Input Validation: All inputs validated via Pydantic
  4. SQL Injection: Protected via parameterized queries
  5. File Upload: Validate file types and sizes

🐛 Troubleshooting

Database Connection Issues

# Check container status
docker ps

# Check logs
docker logs vector_rag_db

# Restart container
docker compose restart

Import Errors

# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Memory Issues

# Clear embedding cache
python -c "from embeddings import get_embedding_service; get_embedding_service().clear_cache()"

# Reduce batch size
export EMBEDDING_BATCH_SIZE=16

📈 Monitoring

Health Checks

# CLI
python indexer_v2.py stats

# API
curl http://localhost:8000/health

Database Statistics

-- View document statistics
SELECT * FROM document_stats;

-- Check index usage
SELECT * FROM pg_stat_user_indexes WHERE schemaname = 'public';

-- Database size
SELECT pg_size_pretty(pg_database_size(current_database()));

🔄 Migration from v1

The v2 system is backward compatible with v1 data. To migrate:

  1. Backup existing data:
docker exec vector_rag_db pg_dump -U rag_user rag_vector_db > backup.sql
  1. Update schema:
docker exec -i vector_rag_db psql -U rag_user -d rag_vector_db < init-db.sql
  1. Use new CLI:
# Old: python indexer.py document.pdf
# New: python indexer_v2.py index document.pdf

📝 Development

Code Style

# Format code
black .

# Lint code
ruff check .

# Type checking
mypy .

Adding New Document Loaders

Extend DocumentLoader class in document_processor.py:

class CustomLoader(DocumentLoader):
    def can_load(self, source_uri: str) -> bool:
        return source_uri.endswith('.custom')
    
    def load(self, source_uri: str) -> List[Document]:
        # Your loading logic
        pass

💬 Feedback & Suggestions

This project does not accept pull requests or code contributions.

We welcome:

  • 🐛 Bug reports - Create an issue with details
  • 💡 Feature suggestions - Share your ideas via issues
  • 📝 Feedback - Let us know what could be better
  • 📧 Direct contact - valginer0@gmail.com for specific inquiries

See CONTRIBUTING.md for details on how to provide feedback.

All development is maintained exclusively by the copyright holder to ensure code quality and full ownership.

🙏 Acknowledgments

  • PostgreSQL and pgvector teams
  • Sentence Transformers library
  • LangChain community
  • FastAPI framework

📞 Support

For issues and questions:

  • Check documentation
  • Review test examples
  • Check API docs at /docs
  • Review logs for error details

Version: 2.4.3
Last Updated: 2026
Status: Production Ready ✅


License

PGVectorRAGIndexer is dual-licensed.

Community License

Free for personal use, education, research, and non-commercial evaluation.
Evaluation within a company is permitted for testing and assessment purposes.

Commercial License

A commercial license is required for:

  • production use
  • CI/CD or automated workflows
  • ongoing internal company use
  • use as part of a paid, hosted, or client-facing service

See COMMERCIAL.md for details on when a commercial license is required and how to obtain one.

If you are unsure whether your use case requires a commercial license, feel free to reach out.


Supporting the project

If you use PGVectorRAGIndexer personally or for learning and would like to support ongoing development, you can sponsor the project via GitHub Sponsors.

Sponsorships are optional and do not grant commercial usage rights.

  • ⭐ Star the repository
  • ❤️ Sponsor on GitHub
  • 📢 Share it with others who may find it useful

Thank you for supporting independent open-source development.

About

Private, local semantic document search (RAG) using pgvector — desktop app + Docker. Commercial licensing available. A robust Retrieval-Augmented Generation (RAG) system that turns PostgreSQL into a high-performance vector database using the pgvector extension, indexing documents from local files, URLs, and other sources for fast contextual search.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
license.py

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors