Skip to content

HirulaAbesignha/ClearExtract-API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClearExtract API

ClearExtract API is a backend API for extracting structured data from invoices and receipts. It accepts document uploads, processes them through a background queue, extracts text using local OCR/PDF parsing, and returns clean JSON output through a simple REST API.

This project is currently built as an MVP using local OCR and heuristic parsing. It is designed so the extraction layer can later be upgraded to a cloud document intelligence provider such as Google Document AI, AWS Textract, or Azure AI Document Intelligence.


Table of Contents


Overview

ClearExtract API converts business documents into structured JSON.

The first supported document types are:

  • Invoices
  • Receipts

The API accepts PDF and image files, stores the uploaded document, creates a processing job, and lets a background worker extract the data.

Basic flow:

Client uploads invoice/receipt
        ↓
API validates request and API key
        ↓
Document is saved in PostgreSQL
        ↓
Job is added to Redis queue
        ↓
Worker processes the document
        ↓
OCR/PDF parser extracts text
        ↓
Normalizer converts text into JSON
        ↓
Client retrieves final result

Features

  • REST API built with Express.js
  • API key authentication
  • Invoice and receipt upload support
  • File validation using Multer
  • Background job processing with BullMQ
  • Redis queue integration
  • PostgreSQL database storage
  • Usage tracking by monthly page count
  • Free-plan page limit support
  • Local OCR using Tesseract
  • PDF text extraction using pdf-parse
  • Clean JSON response format
  • Worker-based extraction architecture
  • Modular extraction provider structure
  • Ready for future cloud OCR integration

Tech Stack

Area Technology
Runtime Node.js
API Framework Express.js
Database PostgreSQL
Queue BullMQ
Queue Storage Redis
File Uploads Multer
OCR Tesseract OCR
PDF Text Extraction pdf-parse
Security Middleware Helmet
Environment Config dotenv

System Architecture

┌──────────────┐
│   Client     │
│ Postman/App  │
└──────┬───────┘
       │
       │ HTTP Request
       ▼
┌────────────────────┐
│  Express API        │
│  localhost:4000     │
└──────┬─────────────┘
       │
       ├── Validates API Key
       ├── Accepts File Upload
       ├── Creates Document Record
       │
       ▼
┌────────────────────┐
│ PostgreSQL Database │
│ Documents / Usage   │
└────────────────────┘
       │
       ▼
┌────────────────────┐
│ Redis + BullMQ      │
│ Job Queue           │
└──────┬─────────────┘
       │
       ▼
┌────────────────────┐
│ Background Worker   │
│ OCR + Normalization │
└──────┬─────────────┘
       │
       ▼
┌────────────────────┐
│ Updated JSON Result │
│ Saved in Database   │
└────────────────────┘

Project Structure

ClearExtract-API/
│
├── migrations/
│   └── 001_init.sql
│
├── scripts/
│   ├── createDevApiKey.js
│   └── migrate.js
│
├── src/
│   ├── app.js
│   ├── server.js
│   │
│   ├── config/
│   │   └── env.js
│   │
│   ├── controllers/
│   │   ├── documents.controller.js
│   │   └── usage.controller.js
│   │
│   ├── db/
│   │   └── pool.js
│   │
│   ├── middleware/
│   │   ├── auth.middleware.js
│   │   ├── error.middleware.js
│   │   └── upload.middleware.js
│   │
│   ├── queues/
│   │   └── document.queue.js
│   │
│   ├── routes/
│   │   ├── documents.routes.js
│   │   ├── health.routes.js
│   │   └── usage.routes.js
│   │
│   ├── services/
│   │   ├── extractor.service.js
│   │   ├── usage.service.js
│   │   │
│   │   ├── providers/
│   │   │   └── localOcr.provider.js
│   │   │
│   │   └── normalizers/
│   │       └── localDocument.normalizer.js
│   │
│   ├── utils/
│   │   ├── apiKey.js
│   │   └── httpError.js
│   │
│   └── workers/
│       └── document.worker.js
│
├── uploads/
├── .env.example
├── .gitignore
├── package.json
├── package-lock.json
└── README.md

Requirements

Before running the project, install:

  • Node.js
  • PostgreSQL
  • Redis
  • Docker, recommended for Redis
  • Tesseract OCR, required for image OCR

Environment Variables

Create a .env file in the project root.

Use this structure:

PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50

REDIS_HOST=127.0.0.1
REDIS_PORT=6379

EXTRACTION_PROVIDER=local

Also keep a safe .env.example file in the repository:

PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50

REDIS_HOST=127.0.0.1
REDIS_PORT=6379

EXTRACTION_PROVIDER=local

Do not commit your real .env file.


Installation

Clone the repository:

git clone https://github.com/HirulaAbesignha/ClearExtract-API.git
cd ClearExtract-API

Install dependencies:

npm install

Database Setup

Create a PostgreSQL database named:

clear_extract

Example using psql:

CREATE DATABASE clear_extract;

Then run the migration:

npm run db:migrate

Expected output:

Database migrated successfully.

Create Development API Key

Run:

npm run seed:key

This creates a demo user and API key.

Example output:

User: demo@example.com
API Key: sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Save this key. You will not see it again.

Copy this API key. It is required when calling protected endpoints.


Redis Setup

The project uses Redis for background jobs.

Recommended Docker command:

docker run -d --name clearextract-redis -p 6379:6379 redis

If the container already exists:

docker start clearextract-redis

Check if Redis is running:

docker ps

Tesseract OCR Setup

Tesseract OCR is required for image-based documents such as JPG, PNG, JPEG, and WEBP.

After installation, check it using:

tesseract --version

If the command is not recognized on Windows, add the Tesseract installation folder to your system PATH.

Common Windows path:

C:\Program Files\Tesseract-OCR

Running the Project

You need two terminals.

Terminal 1: Start the API server

npm run dev

Expected output:

ClearExtract API running on http://localhost:4000

Terminal 2: Start the background worker

npm run worker

Expected output:

Document worker is running...

Redis must also be running.


API Authentication

Protected endpoints require an API key.

Use Bearer Token authentication:

Authorization: Bearer sk_test_your_api_key_here

In Postman:

  1. Go to the Authorization tab.
  2. Select Bearer Token.
  3. Paste only the API key.
  4. Do not include the word Bearer manually.

API Endpoints

Base URL:

http://localhost:4000/v1

1. Health Check

GET /health

Full URL:

http://localhost:4000/v1/health

Authentication required: No

Example response:

{
  "status": "ok",
  "service": "clear-extract-api"
}

2. Extract Document

POST /documents/extract

Full URL:

http://localhost:4000/v1/documents/extract

Authentication required: Yes

Body type:

form-data

Fields:

Key Type Required Description
document_type Text Yes invoice or receipt
file File Yes PDF, JPG, JPEG, PNG, or WEBP

Example response:

{
  "document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "status": "processing",
  "message": "Document accepted and queued for extraction.",
  "result_url": "/v1/documents/b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "created_at": "2026-05-27T05:20:00.000Z"
}

3. Get Document Result

GET /documents/:id

Full URL:

http://localhost:4000/v1/documents/YOUR_DOCUMENT_ID

Authentication required: Yes

Example response for invoice:

{
  "document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
  "document_type": "invoice",
  "status": "completed",
  "pages": 1,
  "confidence": 0.75,
  "data": {
    "vendor_name": "ABC Supplies",
    "invoice_number": "INV-1001",
    "invoice_date": "2026-05-21",
    "due_date": "2026-06-21",
    "currency": "USD",
    "subtotal": 420,
    "tax": 33.6,
    "total": 453.6,
    "line_items": [
      {
        "description": "Cotton fabric roll",
        "quantity": null,
        "unit_price": null,
        "amount": 420,
        "confidence": 0.45
      }
    ]
  },
  "warnings": [
    "Local OCR uses heuristic parsing. Review important fields before using in production."
  ],
  "error_message": null,
  "created_at": "2026-05-27T05:20:00.000Z",
  "completed_at": "2026-05-27T05:20:03.000Z"
}

Example response for receipt:

{
  "document_id": "0c1e5d5e-1d92-45ab-b53c-e832df276221",
  "document_type": "receipt",
  "status": "completed",
  "pages": 1,
  "confidence": 0.67,
  "data": {
    "merchant_name": "Demo Grocery Store",
    "receipt_number": "RCPT-1001",
    "purchase_date": "2026-05-21",
    "currency": "USD",
    "subtotal": 42.5,
    "tax": 3.4,
    "total": 45.9,
    "items": [
      {
        "description": "Notebook",
        "quantity": null,
        "unit_price": null,
        "amount": 20,
        "confidence": 0.45
      }
    ]
  },
  "warnings": [
    "Local OCR uses heuristic parsing. Review important fields before using in production."
  ],
  "error_message": null,
  "created_at": "2026-05-27T05:20:00.000Z",
  "completed_at": "2026-05-27T05:20:03.000Z"
}

4. Get Usage

GET /usage

Full URL:

http://localhost:4000/v1/usage

Authentication required: Yes

Example response:

{
  "plan": "free",
  "pages_used_this_month": 1,
  "monthly_page_limit": 50,
  "remaining_pages": 49
}

Testing with Postman

Health Check

Method:

GET

URL:

http://localhost:4000/v1/health

Upload Invoice or Receipt

Method:

POST

URL:

http://localhost:4000/v1/documents/extract

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Body:

form-data

Fields:

document_type = invoice
file = select a PDF/image file

After sending, copy the document_id.


Get Extraction Result

Method:

GET

URL:

http://localhost:4000/v1/documents/YOUR_DOCUMENT_ID

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Get Usage

Method:

GET

URL:

http://localhost:4000/v1/usage

Authorization:

Bearer Token

Token:

sk_test_your_api_key_here

Example Invoice JSON Schema

{
  "vendor_name": "string | null",
  "invoice_number": "string | null",
  "invoice_date": "string | null",
  "due_date": "string | null",
  "currency": "string | null",
  "subtotal": "number | null",
  "tax": "number | null",
  "total": "number | null",
  "line_items": [
    {
      "description": "string",
      "quantity": "number | null",
      "unit_price": "number | null",
      "amount": "number | null",
      "confidence": "number"
    }
  ]
}

Example Receipt JSON Schema

{
  "merchant_name": "string | null",
  "receipt_number": "string | null",
  "purchase_date": "string | null",
  "currency": "string | null",
  "subtotal": "number | null",
  "tax": "number | null",
  "total": "number | null",
  "items": [
    {
      "description": "string",
      "quantity": "number | null",
      "unit_price": "number | null",
      "amount": "number | null",
      "confidence": "number"
    }
  ]
}

Usage Tracking

Each processed document records page usage in the usage_events table.

The free plan limit is controlled by:

FREE_PLAN_PAGE_LIMIT=50

When the monthly usage exceeds the limit, the API returns a payment-required style response.

Example:

{
  "error": {
    "message": "Monthly page limit exceeded. Used 50/50 pages.",
    "status": 402
  }
}

Local OCR Notes

This MVP uses local extraction:

  • pdf-parse for text-based PDFs
  • tesseract for image files
  • heuristic parsing for invoice and receipt fields

This works well for:

  • simple invoices
  • simple receipts
  • clean images
  • text-based PDFs
  • MVP demos

It may be less accurate for:

  • blurry images
  • handwritten receipts
  • complex invoice layouts
  • image-based scanned PDFs
  • complicated line-item tables
  • low-resolution documents

For production accuracy, the extraction provider should later be upgraded to a dedicated document intelligence service.


Common Errors

database "clear_extract" does not exist

Create the database first:

CREATE DATABASE clear_extract;

Then run:

npm run db:migrate

ECONNREFUSED 127.0.0.1:6379

Redis is not running.

Fix:

docker start clearextract-redis

Document stays in processing

The worker is not running.

Fix:

npm run worker

Tesseract OCR failed

Tesseract is not installed or not added to PATH.

Check:

tesseract --version

Missing API key

Add your API key in the Authorization header:

Authorization: Bearer sk_test_your_api_key_here

Available Scripts

Script Description
npm run dev Starts the Express API with nodemon
npm run worker Starts the background worker with nodemon
npm run start Starts the API server using Node
npm run start:worker Starts the worker using Node
npm run db:migrate Runs database migration
npm run seed:key Creates a demo user and development API key

Roadmap

Planned improvements:

  • Better OCR preprocessing
  • Scanned PDF support
  • More accurate line-item extraction
  • File URL upload support
  • Batch document extraction
  • Webhook support
  • Stripe billing integration
  • API dashboard
  • OpenAPI/Swagger documentation
  • Rate limiting
  • Cloud OCR provider option
  • Docker Compose setup
  • Production deployment guide

Future API Ideas

Possible future endpoints:

POST /v1/documents/batch
POST /v1/documents/from-url
POST /v1/documents/validate
GET  /v1/documents
GET  /v1/usage/events
GET  /v1/plans
POST /v1/webhooks

Security Notes

Do not commit:

  • .env
  • API keys
  • service account credentials
  • uploaded documents
  • database dumps
  • cloud provider credentials

Recommended .gitignore entries:

node_modules/
.env
uploads/
google-credentials.json
*.key
*.pem
logs/
*.log

Current Status

This project is currently an MVP.

Completed:

  • API server
  • PostgreSQL setup
  • API key authentication
  • File upload
  • Redis queue
  • BullMQ worker
  • Usage tracking
  • Mock extraction
  • Local OCR extraction
  • Basic invoice/receipt normalization

Not yet completed:

  • Production billing
  • Cloud OCR integration
  • Public API dashboard
  • Advanced OCR accuracy
  • Deployment setup

License

MIT License


Author

Built as a document extraction API MVP for converting invoices and receipts into structured JSON.

About

Upload invoice/receipt - get clean JSON.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors