Converts scanned PDF files to plain text using Tesseract OCR.
- PDF → individual JPEG frames (via
pdf2image) - Each frame → Tesseract OCR → raw text
- Text chunks assembled into a single output file per PDF
.
├── input/ # place input PDF files here
├── temp/ # intermediate JPEGs (auto-cleaned after run)
├── output/ # extracted .txt files land here
├── ocr.py # main script
└── README.txt # setup notes and common errors
sudo apt-get install tesseract-ocr poppler-utils # Linux
brew install tesseract poppler # macOS
pip install pytesseract pdf2image Pillow# Place PDFs in input/, then:
python3 ocr.py
# Output .txt files appear in output/- Works best on scanned documents with clean, high-contrast text
- Multi-column layouts may need post-processing
- Language pack: defaults to English (
eng); setTESSDATA_PREFIXfor others