Multi-Engine OCR, Document Parsing, and NLP Analysis Platform
TEXTOR is a document intelligence platform that extracts, analyzes, and summarizes text from images, PDFs, and handwritten documents. It combines multiple OCR engines with NLP pipelines to handle printed text, handwriting, and complex document layouts.
- PaddleOCR — Primary engine for printed text (multilingual)
- EasyOCR — Secondary engine with broad language support
- Tesseract — Fallback engine (English + Hindi)
- Automatic engine selection based on document type
- Kraken — Page segmentation and baseline detection
- TrOCR (Microsoft) — Transformer-based handwriting recognition
- Specialized for doctor prescriptions and handwritten notes
- Docling — PDF/document layout analysis
- Outputs structured Markdown and JSON
- Handles multi-column layouts, tables, and figures
- Sumy TextRank — Extractive text summarization
- KeyBERT + YAKE — Keyword extraction (hybrid approach)
- Gemini AI — Intelligent text correction and analysis
| Layer | Technology |
|---|---|
| Backend API | FastAPI, Uvicorn, Python |
| OCR Engines | PaddleOCR, EasyOCR, Tesseract |
| Handwriting | Kraken, TrOCR (HuggingFace Transformers) |
| Document AI | Docling (PDF parsing) |
| NLP | Sumy, KeyBERT, YAKE |
| AI | Google Gemini (text correction) |
| Frontend | React 18, TypeScript, Vite, Tailwind CSS, shadcn/ui |
TEXTOR/
├── api/ # FastAPI backend
│ ├── app/
│ │ ├── main.py # API routes and endpoints
│ │ └── settings.py # Configuration
│ ├── scripts/
│ │ └── fetch_models.py # Download model weights
│ └── requirements.txt
├── app/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── pages/ # Route pages
│ │ └── hooks/ # Custom hooks
│ └── package.json
└── notebooks/ # Research & experiments
├── trocr.ipynb # TrOCR handwriting recognition
└── doctors-handwritten-prescription.ipynb
cd api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000cd app
npm install
npm run devAPI docs at http://localhost:8000/docs. Frontend at http://localhost:5173.