Skip to content

TheDhruv0710/DataExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Table Extractor

Pull tables out of any PDF — text-based, scanned, or mixed — and get clean JSON and CSV files back.
No complicated setup. No system installs needed for regular PDFs. Just Python and one pip install.


How It Works

flowchart TD
    A([You upload or drop a PDF]) --> B{Does the page have text?}

    B -- Yes --> C[Read text directly\nusing pdfplumber]
    B -- No / Scanned image --> D[Convert page to image\nusing PyMuPDF]

    B -- Has text AND embedded images --> C
    B -- Has text AND embedded images --> D

    C --> E[Find tables by\nborders or word positions]
    D --> F[Find tables by\nimage line detection]
    F --> G[Read text in each cell\nusing EasyOCR]

    E --> H{Table spans\nmultiple pages?}
    G --> H

    H -- Yes --> I[Merge into one table]
    H -- No  --> J[Keep as-is]

    I --> K([Save as JSON + CSV])
    J --> K
Loading

In plain English:

  • If the PDF has real text → we read it directly. Fast and accurate.
  • If the PDF is a scanned image (or has images inside it) → we recognise the text using AI (EasyOCR).
  • If a table runs across two pages → we automatically join them into one.
  • The final output is always clean JSON and CSV files.

Project Structure

AiProject/
├── app.py              ← Web frontend (Flask)
├── main.py             ← Command-line tool
├── config.py           ← Settings you can tweak
├── requirements.txt
│
├── input/              ← Drop your PDFs here (for CLI use)
├── output/             ← Results appear here (for CLI use)
├── templates/
│   └── index.html      ← The web page
│
├── extractors/
│   ├── pipeline.py     ← Decides digital vs OCR per page
│   ├── digital.py      ← Reads text-based PDF tables
│   ├── ocr.py          ← Reads image/scanned tables
│   └── base.py         ← Shared data structures
│
├── exporters/
│   ├── json_exp.py     ← Writes JSON output
│   └── csv_exp.py      ← Writes CSV output
│
└── utils/
    ├── analyzer.py     ← Detects if a page is scanned
    ├── image_proc.py   ← Image cleanup (deskew, denoise)
    ├── merger.py       ← Joins multi-page tables
    └── table_utils.py  ← Cell cleaning and validation

Installation

Requirements: Python 3.9 or higher.

pip install -r requirements.txt

That installs everything. No Tesseract, no Ghostscript, no Poppler needed.

For scanned PDFs only: EasyOCR will download its language model (~100 MB) the first time it runs. This is automatic — no action needed from you.


Way 1 — Web Frontend (easier)

Start the server

python app.py

Then open your browser and go to: http://localhost:5000

Steps

  1. Click Choose File and pick your PDF
  2. Click Extract Tables
  3. Wait a moment — progress shows on screen
  4. Download your files:
    • Click the blue button for JSON
    • Click the grey button(s) for CSV (one per table)
  5. Scroll down to see a live JSON preview of the data

Way 2 — Command Line

Process all PDFs in a folder

python main.py --input ./input --output ./output

This reads every .pdf in the input/ folder and writes results to output/.

Process a single file

python main.py --input ./input/my_report.pdf --output ./output

Output only JSON (skip CSV)

python main.py --input ./input --output ./output --format json

Output only CSV

python main.py --input ./input --output ./output --format csv

Force OCR on every page (use if tables are missing)

python main.py --input ./input --output ./output --ocr-force

Output Files

For a file called report.pdf, the output will be:

output/
└── report/
    ├── report_tables.json      ← All tables in one JSON file
    ├── report_table_1.csv      ← Table 1 as a spreadsheet
    ├── report_table_2.csv      ← Table 2 as a spreadsheet
    └── ...

JSON structure

{
  "document_id": "report",
  "tables": [
    {
      "table_id": 1,
      "page_start": 1,
      "page_end": 1,
      "title": "Monthly Summary",
      "columns": ["Month", "Sales", "Expenses"],
      "rows": [
        ["January", "12,000", "8,500"],
        ["February", "13,400", "9,100"]
      ]
    }
  ]
}

Settings (config.py)

You rarely need to change these, but they are here if you want to tune behaviour:

Setting Default What it does
SCANNED_PAGE_TEXT_THRESHOLD 50 If fewer than 50 characters found, treat page as scanned
MIN_ROWS 2 Ignore anything with fewer than 2 rows
MIN_COLS 2 Ignore anything with fewer than 2 columns
MIN_FILL 0.25 At least 25% of cells must have a value
OCR_DPI 300 Image resolution for scanned pages (higher = slower but more accurate)
MERGE_SIM 0.70 How similar column headers must be to merge multi-page tables

About

DataExtractor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors