Skip to content

gkalidas/python-ocr-pdftotext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python-ocr-pdftotext

Converts scanned PDF files to plain text using Tesseract OCR.

How it works

  1. PDF → individual JPEG frames (via pdf2image)
  2. Each frame → Tesseract OCR → raw text
  3. Text chunks assembled into a single output file per PDF

Project layout

.
├── input/          # place input PDF files here
├── temp/           # intermediate JPEGs (auto-cleaned after run)
├── output/         # extracted .txt files land here
├── ocr.py          # main script
└── README.txt      # setup notes and common errors

Requirements

sudo apt-get install tesseract-ocr poppler-utils   # Linux
brew install tesseract poppler                      # macOS
pip install pytesseract pdf2image Pillow

Usage

# Place PDFs in input/, then:
python3 ocr.py
# Output .txt files appear in output/

Notes

  • Works best on scanned documents with clean, high-contrast text
  • Multi-column layouts may need post-processing
  • Language pack: defaults to English (eng); set TESSDATA_PREFIX for others

About

pdf to text coverter using tessaract.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages