Break multi-page PDFs into clean, OCR-ready images — from the command line or a self-hosted web UI.
- CLI-first — run
python main.py document.pdfwith no server required - Watch folder — drop PDFs into an inbox directory; PageForge processes them automatically on each sync cycle
- Web UI with drag-and-drop upload, file browser, coverage calendar, and per-file ZIP download
- Google Drive optional — connect a Drive folder to pull PDFs automatically (requires a GCP service account)
- High-contrast B&W output — autocontrast, double sharpen, and binarize for maximum OCR accuracy
- 300 DPI by default — configurable from 72 to 600 DPI
- Configurable — all settings adjustable via environment variables, a JSON config file, or the web Settings tab
- Self-hosted — runs entirely on your machine; no external services required unless you enable Drive sync
# Single file — output next to the source
python main.py document.pdf
# Glob — batch convert
python main.py *.pdf
# Custom output directory
python main.py scan.pdf --output /path/to/images/
# JPEG output with quality control (much smaller files)
python main.py scan.pdf --format jpeg --quality 85
# Grayscale — no binarization, good for photos or mixed documents
python main.py scan.pdf --mode grayscale
# Full color output
python main.py scan.pdf --mode color --format jpeg
# Override DPI and threshold inline
python main.py scan.pdf --dpi 150 --threshold 180Output images are written to {stem}_pages/ by default, or the directory specified with --output.
docker run --rm -v $(pwd)/data:/data -p 8000:8000 pageforgePlace PDFs in ./data/inbox/ and open http://localhost:8000 in your browser.
cp docker-compose.example.yml docker-compose.yml
docker compose up -dEdit docker-compose.yml to uncomment and set environment variables as needed.
Each PDF page passes through a five-step image processing pipeline:
- Grayscale render — PyMuPDF renders the page at the configured DPI directly into a grayscale pixel buffer, skipping color conversion overhead.
- Autocontrast — Pillow's
ImageOps.autocontraststretches the histogram with a 2% cutoff, compensating for yellowed paper or uneven scan exposure. - Double sharpen — Two passes of
ImageFilter.SHARPENenhance edge definition so character strokes are crisp. - Binarize — A point transform converts every pixel below the threshold to pure black and every pixel at or above it to pure white, producing a true 1-bit-style image in an 8-bit container.
- PNG optimize — The result is saved with
optimize=True, letting the PNG encoder find the smallest lossless encoding of the high-contrast data.
The output is a sequence of small, high-contrast PNG files that OCR engines and AI vision models handle reliably.
All settings are optional. PageForge works with zero configuration.
| Name | Default | Description |
|---|---|---|
PAGEFORGE_INBOX |
/data/inbox |
Local folder watched for incoming PDFs |
PAGEFORGE_OUTPUT |
/data/output |
Root directory where page images are saved |
PAGEFORGE_DPI |
300 |
Render resolution; higher values increase file size and quality |
PAGEFORGE_THRESHOLD |
160 |
Binarize cutoff (0–255); lower values keep more pixels black |
PAGEFORGE_MODE |
bw |
bw (binarized, best for OCR), grayscale (no binarize), or color (full color) |
PAGEFORGE_FORMAT |
png |
Output format: png (lossless) or jpeg (smaller files) |
PAGEFORGE_JPEG_QUALITY |
85 |
JPEG quality 1–95; only applies when format is jpeg |
PAGEFORGE_ENABLE_UPLOAD |
true |
Set to false to disable the web upload endpoint |
SYNC_INTERVAL_MINUTES |
30 |
How often (in minutes) to poll the inbox and Drive folder |
DRIVE_FOLDER_ID |
(empty) | Google Drive folder ID; leave blank to disable Drive sync |
PAGEFORGE_CONFIG |
/data/config.json |
Path to the JSON config file (overrides defaults, overridden by env vars) |
Settings can also be changed at runtime via the web Settings tab, which writes to config.json.
You supply your own credentials. PageForge includes no Google auth of any kind. You create a Google Cloud project, generate a service account key, and share whichever Drive folder you want with that account. Your credentials stay on your machine.
See docs/google-drive-setup.md for the full step-by-step walkthrough, including screenshots guidance, troubleshooting, and how to disable Drive sync without removing the key file.
Short version:
- Create a GCP project and enable the Google Drive API.
- Create a Service Account and download its JSON key.
- Place the key at
./data/credentials.json(mounted as/data/credentials.jsonin the container). - Share your Drive folder with the service account email as Editor.
- Set
DRIVE_FOLDER_IDin your environment or via the web Settings tab.
PageForge will poll the folder on each sync cycle, download new PDFs, convert them, and delete the originals from Drive.
The web interface is available at http://localhost:8000 when running as a server.
- Files tab — shows summary statistics (total files, pages, days, years), an interactive monthly coverage calendar, and a filterable table of all processed documents. Each row has a ZIP download button.
- Upload tab — drag-and-drop or click-to-browse PDF upload. Files are processed in the background; the UI polls for completion and updates the status in real time. Hidden when
PAGEFORGE_ENABLE_UPLOADisfalse. - Settings tab — live configuration editor for all settings. Changes are saved to
config.jsonand take effect immediately without restarting the server.
For a PDF named 2024-03-15_invoice.pdf, PageForge creates:
2024-03-15_invoice_pages/
2024-03-15_invoice_p001.png
2024-03-15_invoice_p002.png
2024-03-15_invoice_p003.png
...
- Images are named
{stem}_p{page:03d}.png(zero-padded to three digits). - Each directory is self-contained and can be zipped for download via the API or web UI.
- When files are named with a
YYYY-MM-DDprefix, the web UI groups and filters them by date, month, and year.
- OCR pipelines — feed the output PNGs directly into Tesseract, EasyOCR, or a cloud OCR API
- AI document ingestion — send high-contrast page images to vision-capable language models for extraction or summarization
- Digitizing paper records — scan documents to PDF, drop them in the inbox, and get clean per-page images automatically
- Batch archival — process hundreds of PDFs overnight using the CLI glob mode or the watch-folder scheduler
MIT