Skip to content

supriya-gdptl/upload_validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Upload Validator — AI-Powered Content Matching

Validates that a user's text description actually matches what they uploaded.
100% local — no ChatGPT, no external AI APIs. All models run on your machine.

┌─────────────────────────────────────────┐
│  Score ≥ 75 %  →  ✅ Approved            │
│  50 – 74 %     →  ⚠️  Warning (user can  │
│                      improve or submit) │
│  Score < 50 %  →  ❌ Rejected            │
└─────────────────────────────────────────┘

Tech Stack

Layer Technology Purpose
Backend Python · FastAPI · Uvicorn REST API, file handling
CV/OCR Tesseract · PyMuPDF · OpenCV Text extraction from files
Embeddings CLIP (ViT-B/32, local weights) Visual ↔ text similarity
NLP Sentence-BERT (all-MiniLM-L6-v2) Semantic text ↔ OCR similarity
Frontend React 18 · Vite · JSX Drag-and-drop upload UI
Container Docker · Docker Compose One-command deployment

Project Layout

upload-validator/
├── backend/
│   ├── main.py              # FastAPI app — all CV/NLP logic
│   └── requirements.txt
├── frontend/
│   ├── src/
│   │   ├── UploadValidator.jsx   # Main component
│   │   └── main.jsx
│   ├── index.html
│   ├── package.json
│   └── vite.config.js
├── tests/
│   ├── test_validator.py    # Pytest suite (unit + integration)
│   └── generate_samples.py  # Creates sample_data/ test files
├── sample_data/             # Generated by generate_samples.py
└── docker/
    ├── Dockerfile.backend
    ├── Dockerfile.frontend
    └── docker-compose.yml

Quick Start (Docker — recommended)

Prerequisites

  • Docker Desktop ≥ 4.x
# 1. Clone / unzip the project
cd upload-validator

# 2. Build and start both services
docker compose -f docker/docker-compose.yml up --build

# Frontend → http://localhost:3000
# Backend  → http://localhost:8000

First run downloads ~600 MB of model weights (CLIP + SBERT).
They are cached in the container layer; subsequent starts are instant.


Quick Start (Local / No Docker)

Prerequisites

Tool Version Install
Python 3.10 + python.org
Tesseract OCR 5.x see below
Node.js 18 + nodejs.org

Install Tesseract:

# macOS
brew install tesseract

# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# Windows — download installer from:
# https://github.com/UB-Mannheim/tesseract/wiki
# then add install dir to PATH

Backend

cd upload-validator/backend

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate

pip install -r requirements.txt

uvicorn main:app --reload --port 8000
# → http://localhost:8000
# → http://localhost:8000/docs   (Swagger UI)

Frontend

cd upload-validator/frontend

npm install
npm run dev
# → http://localhost:3000

Running the Tests

cd upload-validator

# Install test deps (if not already done)
pip install pytest httpx PyMuPDF

# Run the full suite
pytest tests/test_validator.py -v

What the tests cover

Class What it tests
TestCombineScores Score weighting math
TestThresholdLogic approved / warning / rejected at exact boundaries
TestHealthEndpoint /health returns 200
TestValidateEndpoint Empty desc → 400, bad MIME → 415, PNG/PDF/MP4 routing
TestScoreAccuracy High similarity → high score, low → low (mocked models)

All tests mock CLIP and SBERT so they run without GPU or internet access.


Manual Testing with Sample Files

# Generate sample files
python tests/generate_samples.py
# Creates: sample_data/red_square.png, invoice.pdf, sample_video.mp4, etc.

Open http://localhost:3000 and try these combinations:

File Good description (expect ≥75%) Bad description (expect <50%)
red_square.png "a red coloured square image" "quarterly financial report"
invoice.pdf "an invoice with payment amount" "a photo of a sunset"
contract.pdf "a service agreement between two companies" "cat video compilation"
blue_square.png "a blue square illustration" "legal contract document"

API test with curl

# Good match — should be approved
curl -X POST http://localhost:8000/validate \
  -F "file=@sample_data/red_square.png" \
  -F "description=a red square image"

# Bad match — should be rejected
curl -X POST http://localhost:8000/validate \
  -F "file=@sample_data/invoice.pdf" \
  -F "description=a video about cats"

How the Scoring Works

Image / PDF:
  CLIP score  (visual content ↔ description)   × 0.6
+ SBERT score (OCR text ↔ description)         × 0.4
= final %

Video (.mp4):
  Keyframes extracted (OpenCV) → same pipeline as images
  CLIP applied to each keyframe → max similarity used

If CLIP is unavailable (import error), the system falls back to OCR + SBERT only.


Adjusting Thresholds

Edit the decision block in backend/main.py:

if score >= 75:      # ← change to e.g. 80 for stricter approval
    decision = "approved"
elif score >= 50:    # ← change to e.g. 60 for stricter warning
    decision = "warning"
else:
    decision = "rejected"

Adjust model weights:

# In analyse_image / analyse_pdf / analyse_video
score = combine_scores(clip_sim, text_sim,
                       clip_weight=0.6,   # ← visual weight
                       text_weight=0.4)   # ← OCR/text weight

Supported File Types

Type Extensions Analysis method
Image .jpg .jpeg .png .webp .gif .bmp CLIP + Tesseract OCR
PDF .pdf PyMuPDF text extraction + page renders → CLIP
Video .mp4 .mov .avi .mkv .webm OpenCV keyframes → CLIP

About

AI-powered document content matching

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors