Skip to content

rupeshbharambe24/TEXTOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TEXTOR

TEXTOR

Multi-Engine OCR, Document Parsing, and NLP Analysis Platform


Overview

TEXTOR is a document intelligence platform that extracts, analyzes, and summarizes text from images, PDFs, and handwritten documents. It combines multiple OCR engines with NLP pipelines to handle printed text, handwriting, and complex document layouts.

Features

Multi-Engine OCR

  • PaddleOCR — Primary engine for printed text (multilingual)
  • EasyOCR — Secondary engine with broad language support
  • Tesseract — Fallback engine (English + Hindi)
  • Automatic engine selection based on document type

Handwriting Recognition

  • Kraken — Page segmentation and baseline detection
  • TrOCR (Microsoft) — Transformer-based handwriting recognition
  • Specialized for doctor prescriptions and handwritten notes

Document Parsing

  • Docling — PDF/document layout analysis
  • Outputs structured Markdown and JSON
  • Handles multi-column layouts, tables, and figures

NLP Analysis

  • Sumy TextRank — Extractive text summarization
  • KeyBERT + YAKE — Keyword extraction (hybrid approach)
  • Gemini AI — Intelligent text correction and analysis

Tech Stack

Layer Technology
Backend API FastAPI, Uvicorn, Python
OCR Engines PaddleOCR, EasyOCR, Tesseract
Handwriting Kraken, TrOCR (HuggingFace Transformers)
Document AI Docling (PDF parsing)
NLP Sumy, KeyBERT, YAKE
AI Google Gemini (text correction)
Frontend React 18, TypeScript, Vite, Tailwind CSS, shadcn/ui

Project Structure

TEXTOR/
├── api/                    # FastAPI backend
│   ├── app/
│   │   ├── main.py         # API routes and endpoints
│   │   └── settings.py     # Configuration
│   ├── scripts/
│   │   └── fetch_models.py # Download model weights
│   └── requirements.txt
├── app/                    # React frontend
│   ├── src/
│   │   ├── components/     # UI components
│   │   ├── pages/          # Route pages
│   │   └── hooks/          # Custom hooks
│   └── package.json
└── notebooks/              # Research & experiments
    ├── trocr.ipynb         # TrOCR handwriting recognition
    └── doctors-handwritten-prescription.ipynb

Getting Started

Backend

cd api
python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Frontend

cd app
npm install
npm run dev

API docs at http://localhost:8000/docs. Frontend at http://localhost:5173.

About

Multi-engine OCR and document intelligence platform with PaddleOCR, EasyOCR, TrOCR handwriting recognition, Docling PDF parsing, and NLP analysis. FastAPI + React.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages