ClearExtract API

ClearExtract API is a backend API for extracting structured data from invoices and receipts. It accepts document uploads, processes them through a background queue, extracts text using local OCR/PDF parsing, and returns clean JSON output through a simple REST API.

This project is currently built as an MVP using local OCR and heuristic parsing. It is designed so the extraction layer can later be upgraded to a cloud document intelligence provider such as Google Document AI, AWS Textract, or Azure AI Document Intelligence.

Overview

ClearExtract API converts business documents into structured JSON.

The first supported document types are:

Invoices
Receipts

The API accepts PDF and image files, stores the uploaded document, creates a processing job, and lets a background worker extract the data.

Basic flow:

Client uploads invoice/receipt
        ↓
API validates request and API key
        ↓
Document is saved in PostgreSQL
        ↓
Job is added to Redis queue
        ↓
Worker processes the document
        ↓
OCR/PDF parser extracts text
        ↓
Normalizer converts text into JSON
        ↓
Client retrieves final result

Features

REST API built with Express.js
API key authentication
Invoice and receipt upload support
File validation using Multer
Background job processing with BullMQ
Redis queue integration
PostgreSQL database storage
Usage tracking by monthly page count
Free-plan page limit support
Local OCR using Tesseract
PDF text extraction using pdf-parse
Clean JSON response format
Worker-based extraction architecture
Modular extraction provider structure
Ready for future cloud OCR integration

Tech Stack

Area	Technology
Runtime	Node.js
API Framework	Express.js
Database	PostgreSQL
Queue	BullMQ
Queue Storage	Redis
File Uploads	Multer
OCR	Tesseract OCR
PDF Text Extraction	pdf-parse
Security Middleware	Helmet
Environment Config	dotenv

System Architecture

┌──────────────┐
│   Client     │
│ Postman/App  │
└──────┬───────┘
       │
       │ HTTP Request
       ▼
┌────────────────────┐
│  Express API        │
│  localhost:4000     │
└──────┬─────────────┘
       │
       ├── Validates API Key
       ├── Accepts File Upload
       ├── Creates Document Record
       │
       ▼
┌────────────────────┐
│ PostgreSQL Database │
│ Documents / Usage   │
└────────────────────┘
       │
       ▼
┌────────────────────┐
│ Redis + BullMQ      │
│ Job Queue           │
└──────┬─────────────┘
       │
       ▼
┌────────────────────┐
│ Background Worker   │
│ OCR + Normalization │
└──────┬─────────────┘
       │
       ▼
┌────────────────────┐
│ Updated JSON Result │
│ Saved in Database   │
└────────────────────┘

Project Structure

ClearExtract-API/
│
├── migrations/
│   └── 001_init.sql
│
├── scripts/
│   ├── createDevApiKey.js
│   └── migrate.js
│
├── src/
│   ├── app.js
│   ├── server.js
│   │
│   ├── config/
│   │   └── env.js
│   │
│   ├── controllers/
│   │   ├── documents.controller.js
│   │   └── usage.controller.js
│   │
│   ├── db/
│   │   └── pool.js
│   │
│   ├── middleware/
│   │   ├── auth.middleware.js
│   │   ├── error.middleware.js
│   │   └── upload.middleware.js
│   │
│   ├── queues/
│   │   └── document.queue.js
│   │
│   ├── routes/
│   │   ├── documents.routes.js
│   │   ├── health.routes.js
│   │   └── usage.routes.js
│   │
│   ├── services/
│   │   ├── extractor.service.js
│   │   ├── usage.service.js
│   │   │
│   │   ├── providers/
│   │   │   └── localOcr.provider.js
│   │   │
│   │   └── normalizers/
│   │       └── localDocument.normalizer.js
│   │
│   ├── utils/
│   │   ├── apiKey.js
│   │   └── httpError.js
│   │
│   └── workers/
│       └── document.worker.js
│
├── uploads/
├── .env.example
├── .gitignore
├── package.json
├── package-lock.json
└── README.md

Requirements

Before running the project, install:

Node.js
PostgreSQL
Redis
Docker, recommended for Redis
Tesseract OCR, required for image OCR

Environment Variables

Create a .env file in the project root.

Use this structure:

PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50

REDIS_HOST=127.0.0.1
REDIS_PORT=6379

EXTRACTION_PROVIDER=local

Also keep a safe .env.example file in the repository:

PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50

REDIS_HOST=127.0.0.1
REDIS_PORT=6379

EXTRACTION_PROVIDER=local

Do not commit your real .env file.

Installation

Clone the repository:

git clone https://github.com/HirulaAbesignha/ClearExtract-API.git
cd ClearExtract-API

Install dependencies:

npm install

Database Setup

Create a PostgreSQL database named:

clear_extract

Example using psql:

CREATE DATABASE clear_extract;

Then run the migration:

npm run db:migrate

Expected output:

Database migrated successfully.

Create Development API Key

Run:

npm run seed:key

This creates a demo user and API key.

Example output:

User: demo@example.com
API Key: sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Save this key. You will not see it again.

Copy this API key. It is required when calling protected endpoints.

Redis Setup

The project uses Redis for background jobs.

Recommended Docker command:

docker run -d --name clearextract-redis -p 6379:6379 redis

If the container already exists:

docker start clearextract-redis

Check if Redis is running:

docker ps

Tesseract OCR Setup

Tesseract OCR is required for image-based documents such as JPG, PNG, JPEG, and WEBP.

After installation, check it using:

tesseract --version

If the command is not recognized on Windows, add the Tesseract installation folder to your system PATH.

Common Windows path:

C:\Program Files\Tesseract-OCR

Running the Project

You need two terminals.

Terminal 1: Start the API server

npm run dev

Expected output:

ClearExtract API running on http://localhost:4000

Terminal 2: Start the background worker

npm run worker

Expected output:

Document worker is running...

Redis must also be running.

API Authentication

Protected endpoints require an API key.

Use Bearer Token authentication:

Authorization: Bearer sk_test_your_api_key_here

In Postman:

Go to the Authorization tab.
Select Bearer Token.
Paste only the API key.
Do not include the word Bearer manually.

API Endpoints

Base URL:

http://localhost:4000/v1

1. Health Check

GET /health

Full URL:

http://localhost:4000/v1/health

Authentication required: No

Example response:

{
  "status": "ok",
  "service": "clear-extract-api"
}

2. Extract Document

POST /documents/extract

Full URL:

http://localhost:4000/v1/documents/extract

Authentication required: Yes

Body type:

form-data

Fields:

Key	Type	Required	Description
document_type	Text	Yes	`invoice` or `receipt`
file	File	Yes	PDF, JPG, JPEG, PNG, or WEBP

Example response:

{
  "document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "status": "processing",
  "message": "Document accepted and queued for extraction.",
  "result_url": "/v1/documents/b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "created_at": "2026-05-27T05:20:00.000Z"
}

3. Get Document Result

GET /documents/:id

Full URL:

http://localhost:4000/v1/documents/YOUR_DOCUMENT_ID

Authentication required: Yes

Example response for invoice:

{
  "document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "document_type": "invoice",
  "status": "completed",
  "pages": 1,
  "confidence": 0.75,
  "data": {
    "vendor_name": "ABC Supplies",
    "invoice_number": "INV-1001",
    "invoice_date": "2026-05-21",
    "due_date": "2026-06-21",
    "currency": "USD",
    "subtotal": 420,
    "tax": 33.6,
    "total": 453.6,
    "line_items": [
      {
        "description": "Cotton fabric roll",
        "quantity": null,
        "unit_price": null,
        "amount": 420,
        "confidence": 0.45
      }
    ]
  },
  "warnings": [
    "Local OCR uses heuristic parsing. Review important fields before using in production."
  ],
  "error_message": null,
  "created_at": "2026-05-27T05:20:00.000Z",
  "completed_at": "2026-05-27T05:20:03.000Z"
}

Example response for receipt:

{
  "document_id": "0c1e5d5e-1d92-45ab-b53c-e832df276221",
  "document_type": "receipt",
  "status": "completed",
  "pages": 1,
  "confidence": 0.67,
  "data": {
    "merchant_name": "Demo Grocery Store",
    "receipt_number": "RCPT-1001",
    "purchase_date": "2026-05-21",
    "currency": "USD",
    "subtotal": 42.5,
    "tax": 3.4,
    "total": 45.9,
    "items": [
      {
        "description": "Notebook",
        "quantity": null,
        "unit_price": null,
        "amount": 20,
        "confidence": 0.45
      }
    ]
  },
  "warnings": [
    "Local OCR uses heuristic parsing. Review important fields before using in production."
  ],
  "error_message": null,
  "created_at": "2026-05-27T05:20:00.000Z",
  "completed_at": "2026-05-27T05:20:03.000Z"
}

4. Get Usage

GET /usage

Full URL:

http://localhost:4000/v1/usage

Authentication required: Yes

Example response:

{
  "plan": "free",
  "pages_used_this_month": 1,
  "monthly_page_limit": 50,
  "remaining_pages": 49
}

Testing with Postman

Health Check

Method:

GET

URL:

http://localhost:4000/v1/health

Upload Invoice or Receipt

Method:

POST

URL:

http://localhost:4000/v1/documents/extract

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Body:

form-data

Fields:

document_type = invoice
file = select a PDF/image file

After sending, copy the document_id.

Get Extraction Result

Method:

GET

URL:

http://localhost:4000/v1/documents/YOUR_DOCUMENT_ID

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Get Usage

Method:

GET

URL:

http://localhost:4000/v1/usage

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Example Invoice JSON Schema

{
  "vendor_name": "string | null",
  "invoice_number": "string | null",
  "invoice_date": "string | null",
  "due_date": "string | null",
  "currency": "string | null",
  "subtotal": "number | null",
  "tax": "number | null",
  "total": "number | null",
  "line_items": [
    {
      "description": "string",
      "quantity": "number | null",
      "unit_price": "number | null",
      "amount": "number | null",
      "confidence": "number"
    }
  ]
}

Example Receipt JSON Schema

{
  "merchant_name": "string | null",
  "receipt_number": "string | null",
  "purchase_date": "string | null",
  "currency": "string | null",
  "subtotal": "number | null",
  "tax": "number | null",
  "total": "number | null",
  "items": [
    {
      "description": "string",
      "quantity": "number | null",
      "unit_price": "number | null",
      "amount": "number | null",
      "confidence": "number"
    }
  ]
}

Usage Tracking

Each processed document records page usage in the usage_events table.

The free plan limit is controlled by:

FREE_PLAN_PAGE_LIMIT=50

When the monthly usage exceeds the limit, the API returns a payment-required style response.

Example:

{
  "error": {
    "message": "Monthly page limit exceeded. Used 50/50 pages.",
    "status": 402
  }
}

Local OCR Notes

This MVP uses local extraction:

pdf-parse for text-based PDFs
tesseract for image files
heuristic parsing for invoice and receipt fields

This works well for:

simple invoices
simple receipts
clean images
text-based PDFs
MVP demos

It may be less accurate for:

blurry images
handwritten receipts
complex invoice layouts
image-based scanned PDFs
complicated line-item tables
low-resolution documents

For production accuracy, the extraction provider should later be upgraded to a dedicated document intelligence service.

Common Errors

`database "clear_extract" does not exist`

Create the database first:

CREATE DATABASE clear_extract;

Then run:

npm run db:migrate

`ECONNREFUSED 127.0.0.1:6379`

Redis is not running.

Fix:

docker start clearextract-redis

Document stays in `processing`

The worker is not running.

Fix:

npm run worker

`Tesseract OCR failed`

Tesseract is not installed or not added to PATH.

Check:

tesseract --version

`Missing API key`

Add your API key in the Authorization header:

Authorization: Bearer sk_test_your_api_key_here

Available Scripts

Script	Description
`npm run dev`	Starts the Express API with nodemon
`npm run worker`	Starts the background worker with nodemon
`npm run start`	Starts the API server using Node
`npm run start:worker`	Starts the worker using Node
`npm run db:migrate`	Runs database migration
`npm run seed:key`	Creates a demo user and development API key

Roadmap

Planned improvements:

Better OCR preprocessing
Scanned PDF support
More accurate line-item extraction
File URL upload support
Batch document extraction
Webhook support
Stripe billing integration
API dashboard
OpenAPI/Swagger documentation
Rate limiting
Cloud OCR provider option
Docker Compose setup
Production deployment guide

Future API Ideas

Possible future endpoints:

POST /v1/documents/batch
POST /v1/documents/from-url
POST /v1/documents/validate
GET  /v1/documents
GET  /v1/usage/events
GET  /v1/plans
POST /v1/webhooks

Security Notes

Do not commit:

.env
API keys
service account credentials
uploaded documents
database dumps
cloud provider credentials

Recommended .gitignore entries:

node_modules/
.env
uploads/
google-credentials.json
*.key
*.pem
logs/
*.log

Current Status

This project is currently an MVP.

Completed:

API server
PostgreSQL setup
API key authentication
File upload
Redis queue
BullMQ worker
Usage tracking
Mock extraction
Local OCR extraction
Basic invoice/receipt normalization

Not yet completed:

Production billing
Cloud OCR integration
Public API dashboard
Advanced OCR accuracy
Deployment setup

License

MIT License

Author

Built as a document extraction API MVP for converting invoices and receipts into structured JSON.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
migrations		migrations
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

ClearExtract API

Table of Contents

Overview

Features

Tech Stack

System Architecture

Project Structure

Requirements

Environment Variables

Installation

Database Setup

Create Development API Key

Redis Setup

Tesseract OCR Setup

Running the Project

Terminal 1: Start the API server

Terminal 2: Start the background worker

API Authentication

API Endpoints

1. Health Check

2. Extract Document

3. Get Document Result

4. Get Usage

Testing with Postman

Health Check

Upload Invoice or Receipt

Get Extraction Result

Get Usage

Example Invoice JSON Schema

Example Receipt JSON Schema

Usage Tracking

Local OCR Notes

Common Errors

database "clear_extract" does not exist

ECONNREFUSED 127.0.0.1:6379

Document stays in processing

Tesseract OCR failed

Missing API key

Available Scripts

Roadmap

Future API Ideas

Security Notes

Current Status

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`database "clear_extract" does not exist`

`ECONNREFUSED 127.0.0.1:6379`

Document stays in `processing`

`Tesseract OCR failed`

`Missing API key`

Packages