PDF Table Extractor

Pull tables out of any PDF — text-based, scanned, or mixed — and get clean JSON and CSV files back.
No complicated setup. No system installs needed for regular PDFs. Just Python and one pip install.

How It Works

flowchart TD
    A([You upload or drop a PDF]) --> B{Does the page have text?}

    B -- Yes --> C[Read text directly\nusing pdfplumber]
    B -- No / Scanned image --> D[Convert page to image\nusing PyMuPDF]

    B -- Has text AND embedded images --> C
    B -- Has text AND embedded images --> D

    C --> E[Find tables by\nborders or word positions]
    D --> F[Find tables by\nimage line detection]
    F --> G[Read text in each cell\nusing EasyOCR]

    E --> H{Table spans\nmultiple pages?}
    G --> H

    H -- Yes --> I[Merge into one table]
    H -- No  --> J[Keep as-is]

    I --> K([Save as JSON + CSV])
    J --> K

In plain English:

If the PDF has real text → we read it directly. Fast and accurate.
If the PDF is a scanned image (or has images inside it) → we recognise the text using AI (EasyOCR).
If a table runs across two pages → we automatically join them into one.
The final output is always clean JSON and CSV files.

Project Structure

AiProject/
├── app.py              ← Web frontend (Flask)
├── main.py             ← Command-line tool
├── config.py           ← Settings you can tweak
├── requirements.txt
│
├── input/              ← Drop your PDFs here (for CLI use)
├── output/             ← Results appear here (for CLI use)
├── templates/
│   └── index.html      ← The web page
│
├── extractors/
│   ├── pipeline.py     ← Decides digital vs OCR per page
│   ├── digital.py      ← Reads text-based PDF tables
│   ├── ocr.py          ← Reads image/scanned tables
│   └── base.py         ← Shared data structures
│
├── exporters/
│   ├── json_exp.py     ← Writes JSON output
│   └── csv_exp.py      ← Writes CSV output
│
└── utils/
    ├── analyzer.py     ← Detects if a page is scanned
    ├── image_proc.py   ← Image cleanup (deskew, denoise)
    ├── merger.py       ← Joins multi-page tables
    └── table_utils.py  ← Cell cleaning and validation

Installation

Requirements: Python 3.9 or higher.

pip install -r requirements.txt

That installs everything. No Tesseract, no Ghostscript, no Poppler needed.

For scanned PDFs only: EasyOCR will download its language model (~100 MB) the first time it runs. This is automatic — no action needed from you.

Way 1 — Web Frontend (easier)

Start the server

python app.py

Then open your browser and go to: http://localhost:5000

Steps

Click Choose File and pick your PDF
Click Extract Tables
Wait a moment — progress shows on screen
Download your files:
- Click the blue button for JSON
- Click the grey button(s) for CSV (one per table)
Scroll down to see a live JSON preview of the data

Way 2 — Command Line

Process all PDFs in a folder

python main.py --input ./input --output ./output

This reads every .pdf in the input/ folder and writes results to output/.

Process a single file

python main.py --input ./input/my_report.pdf --output ./output

Output only JSON (skip CSV)

python main.py --input ./input --output ./output --format json

Output only CSV

python main.py --input ./input --output ./output --format csv

Force OCR on every page (use if tables are missing)

python main.py --input ./input --output ./output --ocr-force

Output Files

For a file called report.pdf, the output will be:

output/
└── report/
    ├── report_tables.json      ← All tables in one JSON file
    ├── report_table_1.csv      ← Table 1 as a spreadsheet
    ├── report_table_2.csv      ← Table 2 as a spreadsheet
    └── ...

JSON structure

{
  "document_id": "report",
  "tables": [
    {
      "table_id": 1,
      "page_start": 1,
      "page_end": 1,
      "title": "Monthly Summary",
      "columns": ["Month", "Sales", "Expenses"],
      "rows": [
        ["January", "12,000", "8,500"],
        ["February", "13,400", "9,100"]
      ]
    }
  ]
}

Settings (`config.py`)

You rarely need to change these, but they are here if you want to tune behaviour:

Setting	Default	What it does
`SCANNED_PAGE_TEXT_THRESHOLD`	50	If fewer than 50 characters found, treat page as scanned
`MIN_ROWS`	2	Ignore anything with fewer than 2 rows
`MIN_COLS`	2	Ignore anything with fewer than 2 columns
`MIN_FILL`	0.25	At least 25% of cells must have a value
`OCR_DPI`	300	Image resolution for scanned pages (higher = slower but more accurate)
`MERGE_SIM`	0.70	How similar column headers must be to merge multi-page tables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Table Extractor

How It Works

Project Structure

Installation

Way 1 — Web Frontend (easier)

Start the server

Steps

Way 2 — Command Line

Process all PDFs in a folder

Process a single file

Output only JSON (skip CSV)

Output only CSV

Force OCR on every page (use if tables are missing)

Output Files

JSON structure

Settings (`config.py`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
exporters		exporters
extractors		extractors
input		input
output		output
temp		temp
templates		templates
utils		utils
README.md		README.md
app.py		app.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PDF Table Extractor

How It Works

Project Structure

Installation

Way 1 — Web Frontend (easier)

Start the server

Steps

Way 2 — Command Line

Process all PDFs in a folder

Process a single file

Output only JSON (skip CSV)

Output only CSV

Force OCR on every page (use if tables are missing)

Output Files

JSON structure

Settings (config.py)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Settings (`config.py`)

Packages