Pull tables out of any PDF — text-based, scanned, or mixed — and get clean JSON and CSV files back.
No complicated setup. No system installs needed for regular PDFs. Just Python and one pip install.
flowchart TD
A([You upload or drop a PDF]) --> B{Does the page have text?}
B -- Yes --> C[Read text directly\nusing pdfplumber]
B -- No / Scanned image --> D[Convert page to image\nusing PyMuPDF]
B -- Has text AND embedded images --> C
B -- Has text AND embedded images --> D
C --> E[Find tables by\nborders or word positions]
D --> F[Find tables by\nimage line detection]
F --> G[Read text in each cell\nusing EasyOCR]
E --> H{Table spans\nmultiple pages?}
G --> H
H -- Yes --> I[Merge into one table]
H -- No --> J[Keep as-is]
I --> K([Save as JSON + CSV])
J --> K
In plain English:
- If the PDF has real text → we read it directly. Fast and accurate.
- If the PDF is a scanned image (or has images inside it) → we recognise the text using AI (EasyOCR).
- If a table runs across two pages → we automatically join them into one.
- The final output is always clean JSON and CSV files.
AiProject/
├── app.py ← Web frontend (Flask)
├── main.py ← Command-line tool
├── config.py ← Settings you can tweak
├── requirements.txt
│
├── input/ ← Drop your PDFs here (for CLI use)
├── output/ ← Results appear here (for CLI use)
├── templates/
│ └── index.html ← The web page
│
├── extractors/
│ ├── pipeline.py ← Decides digital vs OCR per page
│ ├── digital.py ← Reads text-based PDF tables
│ ├── ocr.py ← Reads image/scanned tables
│ └── base.py ← Shared data structures
│
├── exporters/
│ ├── json_exp.py ← Writes JSON output
│ └── csv_exp.py ← Writes CSV output
│
└── utils/
├── analyzer.py ← Detects if a page is scanned
├── image_proc.py ← Image cleanup (deskew, denoise)
├── merger.py ← Joins multi-page tables
└── table_utils.py ← Cell cleaning and validation
Requirements: Python 3.9 or higher.
pip install -r requirements.txtThat installs everything. No Tesseract, no Ghostscript, no Poppler needed.
For scanned PDFs only: EasyOCR will download its language model (~100 MB) the first time it runs. This is automatic — no action needed from you.
python app.pyThen open your browser and go to: http://localhost:5000
- Click Choose File and pick your PDF
- Click Extract Tables
- Wait a moment — progress shows on screen
- Download your files:
- Click the blue button for JSON
- Click the grey button(s) for CSV (one per table)
- Scroll down to see a live JSON preview of the data
python main.py --input ./input --output ./outputThis reads every .pdf in the input/ folder and writes results to output/.
python main.py --input ./input/my_report.pdf --output ./outputpython main.py --input ./input --output ./output --format jsonpython main.py --input ./input --output ./output --format csvpython main.py --input ./input --output ./output --ocr-forceFor a file called report.pdf, the output will be:
output/
└── report/
├── report_tables.json ← All tables in one JSON file
├── report_table_1.csv ← Table 1 as a spreadsheet
├── report_table_2.csv ← Table 2 as a spreadsheet
└── ...
{
"document_id": "report",
"tables": [
{
"table_id": 1,
"page_start": 1,
"page_end": 1,
"title": "Monthly Summary",
"columns": ["Month", "Sales", "Expenses"],
"rows": [
["January", "12,000", "8,500"],
["February", "13,400", "9,100"]
]
}
]
}You rarely need to change these, but they are here if you want to tune behaviour:
| Setting | Default | What it does |
|---|---|---|
SCANNED_PAGE_TEXT_THRESHOLD |
50 | If fewer than 50 characters found, treat page as scanned |
MIN_ROWS |
2 | Ignore anything with fewer than 2 rows |
MIN_COLS |
2 | Ignore anything with fewer than 2 columns |
MIN_FILL |
0.25 | At least 25% of cells must have a value |
OCR_DPI |
300 | Image resolution for scanned pages (higher = slower but more accurate) |
MERGE_SIM |
0.70 | How similar column headers must be to merge multi-page tables |