ClearExtract API is a backend API for extracting structured data from invoices and receipts. It accepts document uploads, processes them through a background queue, extracts text using local OCR/PDF parsing, and returns clean JSON output through a simple REST API.
This project is currently built as an MVP using local OCR and heuristic parsing. It is designed so the extraction layer can later be upgraded to a cloud document intelligence provider such as Google Document AI, AWS Textract, or Azure AI Document Intelligence.
- Overview
- Features
- Tech Stack
- System Architecture
- Project Structure
- Requirements
- Environment Variables
- Installation
- Database Setup
- Redis Setup
- Running the Project
- API Authentication
- API Endpoints
- Testing with Postman
- Example Responses
- Usage Tracking
- Local OCR Notes
- Roadmap
- Security Notes
- License
ClearExtract API converts business documents into structured JSON.
The first supported document types are:
- Invoices
- Receipts
The API accepts PDF and image files, stores the uploaded document, creates a processing job, and lets a background worker extract the data.
Basic flow:
Client uploads invoice/receipt
↓
API validates request and API key
↓
Document is saved in PostgreSQL
↓
Job is added to Redis queue
↓
Worker processes the document
↓
OCR/PDF parser extracts text
↓
Normalizer converts text into JSON
↓
Client retrieves final result- REST API built with Express.js
- API key authentication
- Invoice and receipt upload support
- File validation using Multer
- Background job processing with BullMQ
- Redis queue integration
- PostgreSQL database storage
- Usage tracking by monthly page count
- Free-plan page limit support
- Local OCR using Tesseract
- PDF text extraction using
pdf-parse - Clean JSON response format
- Worker-based extraction architecture
- Modular extraction provider structure
- Ready for future cloud OCR integration
| Area | Technology |
|---|---|
| Runtime | Node.js |
| API Framework | Express.js |
| Database | PostgreSQL |
| Queue | BullMQ |
| Queue Storage | Redis |
| File Uploads | Multer |
| OCR | Tesseract OCR |
| PDF Text Extraction | pdf-parse |
| Security Middleware | Helmet |
| Environment Config | dotenv |
┌──────────────┐
│ Client │
│ Postman/App │
└──────┬───────┘
│
│ HTTP Request
▼
┌────────────────────┐
│ Express API │
│ localhost:4000 │
└──────┬─────────────┘
│
├── Validates API Key
├── Accepts File Upload
├── Creates Document Record
│
▼
┌────────────────────┐
│ PostgreSQL Database │
│ Documents / Usage │
└────────────────────┘
│
▼
┌────────────────────┐
│ Redis + BullMQ │
│ Job Queue │
└──────┬─────────────┘
│
▼
┌────────────────────┐
│ Background Worker │
│ OCR + Normalization │
└──────┬─────────────┘
│
▼
┌────────────────────┐
│ Updated JSON Result │
│ Saved in Database │
└────────────────────┘ClearExtract-API/
│
├── migrations/
│ └── 001_init.sql
│
├── scripts/
│ ├── createDevApiKey.js
│ └── migrate.js
│
├── src/
│ ├── app.js
│ ├── server.js
│ │
│ ├── config/
│ │ └── env.js
│ │
│ ├── controllers/
│ │ ├── documents.controller.js
│ │ └── usage.controller.js
│ │
│ ├── db/
│ │ └── pool.js
│ │
│ ├── middleware/
│ │ ├── auth.middleware.js
│ │ ├── error.middleware.js
│ │ └── upload.middleware.js
│ │
│ ├── queues/
│ │ └── document.queue.js
│ │
│ ├── routes/
│ │ ├── documents.routes.js
│ │ ├── health.routes.js
│ │ └── usage.routes.js
│ │
│ ├── services/
│ │ ├── extractor.service.js
│ │ ├── usage.service.js
│ │ │
│ │ ├── providers/
│ │ │ └── localOcr.provider.js
│ │ │
│ │ └── normalizers/
│ │ └── localDocument.normalizer.js
│ │
│ ├── utils/
│ │ ├── apiKey.js
│ │ └── httpError.js
│ │
│ └── workers/
│ └── document.worker.js
│
├── uploads/
├── .env.example
├── .gitignore
├── package.json
├── package-lock.json
└── README.mdBefore running the project, install:
- Node.js
- PostgreSQL
- Redis
- Docker, recommended for Redis
- Tesseract OCR, required for image OCR
Create a .env file in the project root.
Use this structure:
PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50
REDIS_HOST=127.0.0.1
REDIS_PORT=6379
EXTRACTION_PROVIDER=localAlso keep a safe .env.example file in the repository:
PORT=4000
DATABASE_URL=postgres://postgres:YOUR_PASSWORD@localhost:5432/clear_extract
UPLOAD_DIR=uploads
FREE_PLAN_PAGE_LIMIT=50
REDIS_HOST=127.0.0.1
REDIS_PORT=6379
EXTRACTION_PROVIDER=localDo not commit your real .env file.
Clone the repository:
git clone https://github.com/HirulaAbesignha/ClearExtract-API.git
cd ClearExtract-APIInstall dependencies:
npm installCreate a PostgreSQL database named:
clear_extractExample using psql:
CREATE DATABASE clear_extract;Then run the migration:
npm run db:migrateExpected output:
Database migrated successfully.Run:
npm run seed:keyThis creates a demo user and API key.
Example output:
User: demo@example.com
API Key: sk_test_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Save this key. You will not see it again.Copy this API key. It is required when calling protected endpoints.
The project uses Redis for background jobs.
Recommended Docker command:
docker run -d --name clearextract-redis -p 6379:6379 redisIf the container already exists:
docker start clearextract-redisCheck if Redis is running:
docker psTesseract OCR is required for image-based documents such as JPG, PNG, JPEG, and WEBP.
After installation, check it using:
tesseract --versionIf the command is not recognized on Windows, add the Tesseract installation folder to your system PATH.
Common Windows path:
C:\Program Files\Tesseract-OCRYou need two terminals.
npm run devExpected output:
ClearExtract API running on http://localhost:4000npm run workerExpected output:
Document worker is running...Redis must also be running.
Protected endpoints require an API key.
Use Bearer Token authentication:
Authorization: Bearer sk_test_your_api_key_hereIn Postman:
- Go to the
Authorizationtab. - Select
Bearer Token. - Paste only the API key.
- Do not include the word
Bearermanually.
Base URL:
http://localhost:4000/v1GET /healthFull URL:
http://localhost:4000/v1/healthAuthentication required: No
Example response:
{
"status": "ok",
"service": "clear-extract-api"
}POST /documents/extractFull URL:
http://localhost:4000/v1/documents/extractAuthentication required: Yes
Body type:
form-dataFields:
| Key | Type | Required | Description |
|---|---|---|---|
| document_type | Text | Yes | invoice or receipt |
| file | File | Yes | PDF, JPG, JPEG, PNG, or WEBP |
Example response:
{
"document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
"status": "processing",
"message": "Document accepted and queued for extraction.",
"result_url": "/v1/documents/b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
"created_at": "2026-05-27T05:20:00.000Z"
}GET /documents/:idFull URL:
http://localhost:4000/v1/documents/YOUR_DOCUMENT_IDAuthentication required: Yes
Example response for invoice:
{
"document_id": "b5b4d8e1-45e4-4a88-8e70-c9b24d08489e",
"document_type": "invoice",
"status": "completed",
"pages": 1,
"confidence": 0.75,
"data": {
"vendor_name": "ABC Supplies",
"invoice_number": "INV-1001",
"invoice_date": "2026-05-21",
"due_date": "2026-06-21",
"currency": "USD",
"subtotal": 420,
"tax": 33.6,
"total": 453.6,
"line_items": [
{
"description": "Cotton fabric roll",
"quantity": null,
"unit_price": null,
"amount": 420,
"confidence": 0.45
}
]
},
"warnings": [
"Local OCR uses heuristic parsing. Review important fields before using in production."
],
"error_message": null,
"created_at": "2026-05-27T05:20:00.000Z",
"completed_at": "2026-05-27T05:20:03.000Z"
}Example response for receipt:
{
"document_id": "0c1e5d5e-1d92-45ab-b53c-e832df276221",
"document_type": "receipt",
"status": "completed",
"pages": 1,
"confidence": 0.67,
"data": {
"merchant_name": "Demo Grocery Store",
"receipt_number": "RCPT-1001",
"purchase_date": "2026-05-21",
"currency": "USD",
"subtotal": 42.5,
"tax": 3.4,
"total": 45.9,
"items": [
{
"description": "Notebook",
"quantity": null,
"unit_price": null,
"amount": 20,
"confidence": 0.45
}
]
},
"warnings": [
"Local OCR uses heuristic parsing. Review important fields before using in production."
],
"error_message": null,
"created_at": "2026-05-27T05:20:00.000Z",
"completed_at": "2026-05-27T05:20:03.000Z"
}GET /usageFull URL:
http://localhost:4000/v1/usageAuthentication required: Yes
Example response:
{
"plan": "free",
"pages_used_this_month": 1,
"monthly_page_limit": 50,
"remaining_pages": 49
}Method:
GETURL:
http://localhost:4000/v1/healthMethod:
POSTURL:
http://localhost:4000/v1/documents/extractAuthorization:
Bearer TokenToken:
sk_test_your_api_key_hereBody:
form-dataFields:
document_type = invoice
file = select a PDF/image fileAfter sending, copy the document_id.
Method:
GETURL:
http://localhost:4000/v1/documents/YOUR_DOCUMENT_IDAuthorization:
Bearer TokenToken:
sk_test_your_api_key_hereMethod:
GETURL:
http://localhost:4000/v1/usageAuthorization:
Bearer TokenToken:
sk_test_your_api_key_here{
"vendor_name": "string | null",
"invoice_number": "string | null",
"invoice_date": "string | null",
"due_date": "string | null",
"currency": "string | null",
"subtotal": "number | null",
"tax": "number | null",
"total": "number | null",
"line_items": [
{
"description": "string",
"quantity": "number | null",
"unit_price": "number | null",
"amount": "number | null",
"confidence": "number"
}
]
}{
"merchant_name": "string | null",
"receipt_number": "string | null",
"purchase_date": "string | null",
"currency": "string | null",
"subtotal": "number | null",
"tax": "number | null",
"total": "number | null",
"items": [
{
"description": "string",
"quantity": "number | null",
"unit_price": "number | null",
"amount": "number | null",
"confidence": "number"
}
]
}Each processed document records page usage in the usage_events table.
The free plan limit is controlled by:
FREE_PLAN_PAGE_LIMIT=50When the monthly usage exceeds the limit, the API returns a payment-required style response.
Example:
{
"error": {
"message": "Monthly page limit exceeded. Used 50/50 pages.",
"status": 402
}
}This MVP uses local extraction:
pdf-parsefor text-based PDFstesseractfor image files- heuristic parsing for invoice and receipt fields
This works well for:
- simple invoices
- simple receipts
- clean images
- text-based PDFs
- MVP demos
It may be less accurate for:
- blurry images
- handwritten receipts
- complex invoice layouts
- image-based scanned PDFs
- complicated line-item tables
- low-resolution documents
For production accuracy, the extraction provider should later be upgraded to a dedicated document intelligence service.
Create the database first:
CREATE DATABASE clear_extract;Then run:
npm run db:migrateRedis is not running.
Fix:
docker start clearextract-redisThe worker is not running.
Fix:
npm run workerTesseract is not installed or not added to PATH.
Check:
tesseract --versionAdd your API key in the Authorization header:
Authorization: Bearer sk_test_your_api_key_here| Script | Description |
|---|---|
npm run dev |
Starts the Express API with nodemon |
npm run worker |
Starts the background worker with nodemon |
npm run start |
Starts the API server using Node |
npm run start:worker |
Starts the worker using Node |
npm run db:migrate |
Runs database migration |
npm run seed:key |
Creates a demo user and development API key |
Planned improvements:
- Better OCR preprocessing
- Scanned PDF support
- More accurate line-item extraction
- File URL upload support
- Batch document extraction
- Webhook support
- Stripe billing integration
- API dashboard
- OpenAPI/Swagger documentation
- Rate limiting
- Cloud OCR provider option
- Docker Compose setup
- Production deployment guide
Possible future endpoints:
POST /v1/documents/batch
POST /v1/documents/from-url
POST /v1/documents/validate
GET /v1/documents
GET /v1/usage/events
GET /v1/plans
POST /v1/webhooksDo not commit:
.env- API keys
- service account credentials
- uploaded documents
- database dumps
- cloud provider credentials
Recommended .gitignore entries:
node_modules/
.env
uploads/
google-credentials.json
*.key
*.pem
logs/
*.logThis project is currently an MVP.
Completed:
- API server
- PostgreSQL setup
- API key authentication
- File upload
- Redis queue
- BullMQ worker
- Usage tracking
- Mock extraction
- Local OCR extraction
- Basic invoice/receipt normalization
Not yet completed:
- Production billing
- Cloud OCR integration
- Public API dashboard
- Advanced OCR accuracy
- Deployment setup
MIT License
Built as a document extraction API MVP for converting invoices and receipts into structured JSON.