python-ocr-pdftotext

Converts scanned PDF files to plain text using Tesseract OCR.

How it works

PDF → individual JPEG frames (via pdf2image)
Each frame → Tesseract OCR → raw text
Text chunks assembled into a single output file per PDF

Project layout

.
├── input/          # place input PDF files here
├── temp/           # intermediate JPEGs (auto-cleaned after run)
├── output/         # extracted .txt files land here
├── ocr.py          # main script
└── README.txt      # setup notes and common errors

Requirements

sudo apt-get install tesseract-ocr poppler-utils   # Linux
brew install tesseract poppler                      # macOS
pip install pytesseract pdf2image Pillow

Usage

# Place PDFs in input/, then:
python3 ocr.py
# Output .txt files appear in output/

Notes

Works best on scanned documents with clean, high-contrast text
Multi-column layouts may need post-processing
Language pack: defaults to English (eng); set TESSDATA_PREFIX for others

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
input		input
temp		temp
.gitignore		.gitignore
README.md		README.md
README.txt		README.txt
ocr.py		ocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

python-ocr-pdftotext

How it works

Project layout

Requirements

Usage

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

python-ocr-pdftotext

How it works

Project layout

Requirements

Usage

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages