Skip to content

yichozy/dpsk-ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSeek OCR PDF Service

Complete OCR service for PDF documents with layout detection, powered by DeepSeek-OCR and vLLM.

Table of Contents


Quick Start

Prerequisites

  • NVIDIA GPU with CUDA 11.8 support
  • Python 3.12 virtual environment at .venv/
  • At least 8GB GPU memory

Installation

  1. Run the installation script:
./install/install.sh

This will install:

  • PyTorch 2.6.0 with CUDA 11.8
  • vLLM 0.8.5
  • flash-attn 2.7.3
  • All required dependencies

Start the Service

./run.sh

The service will be available at http://localhost:8000

Check Status

./status.sh

Stop the Service

./stop.sh

Installation

System Requirements

Hardware:

  • NVIDIA GPU (tested on A40 with 44GB memory)
  • CUDA 11.8 compatible GPU
  • Minimum 8GB GPU memory

Software:

  • Ubuntu 24.04 (or compatible)
  • Python 3.12
  • CUDA 11.8
  • nvidia-smi

Installation Steps

  1. Create virtual environment (if not exists):
python3.12 -m venv .venv
  1. Run installation script:
chmod +x install/install.sh
./install/install.sh

The script will:

  • Activate virtual environment
  • Install PyTorch with CUDA 11.8 support
  • Install vLLM wheel
  • Install all dependencies
  • Install flash-attn
  • Display authentication setup instructions
  1. Verify installation:
.venv/bin/python -c "import torch, vllm; print(f'PyTorch: {torch.__version__}, vLLM: {vllm.__version__}')"

Redis Setup (Optional)

Redis can be used as a message broker to ensure tasks are processed sequentially (one at a time), preventing concurrent GPU access issues. This is optional but recommended for production use.

Quick Redis Installation

Option 1: Docker (Recommended)

chmod +x install/install_redis_docker.sh
./install/install_redis_docker.sh

Option 2: Standalone System Installation

chmod +x install/install_redis_standalone.sh
sudo ./install/install_redis_standalone.sh

Configuration

After installing Redis, update your .env file:

# Copy example if you haven't already
cp .env.example .env

# Edit .env and uncomment Redis settings:
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
QUEUE_NAME=deepseek_ocr_tasks
MAX_WORKERS=1  # Process one task at a time

Install Python Dependencies

.venv/bin/pip install redis rq

Verify Redis

# Test connection
redis-cli ping
# Should return: PONG

# Or with Docker
docker exec -it deepseek-redis redis-cli ping

For detailed Redis setup instructions, see REDIS_SETUP.md.


Service Management

Three scripts provide complete service lifecycle management:

run.sh - Start Service

./run.sh

Features:

  • Checks if service is already running (prevents duplicates)
  • Validates environment and port availability
  • Creates PID file for reliable tracking
  • Starts service in background with logging
  • Waits for service to be ready (up to 60 seconds)
  • Displays service URLs and status
  • Checks authentication configuration

Output:

=== DeepSeek OCR PDF Service ===

✓ Service started successfully
  PID: 12345
  Port: 8000

Service URLs:
  • Health check: http://localhost:8000/health
  • API docs: http://localhost:8000/docs
  • Base URL: http://localhost:8000/

Useful commands:
  • View logs: tail -f /tmp/deepseek_ocr.log
  • Stop service: ./stop.sh

status.sh - Check Status

./status.sh

Displays:

  • Process status (PID, uptime, CPU/memory usage)
  • GPU metrics (memory, utilization, temperature)
  • Authentication status
  • Recent log entries (last 5 lines)
  • Service URLs (if running)

stop.sh - Stop Service

./stop.sh

Features:

  • Graceful shutdown (SIGTERM, waits 10 seconds)
  • Force-kill if necessary (SIGKILL)
  • Cleans up PID file
  • Stops orphaned processes
  • Displays GPU memory status
  • Shows log file location

Common Workflows

Start and monitor:

./run.sh
tail -f /tmp/deepseek_ocr.log

Restart service:

./stop.sh && ./run.sh

Check if running:

./status.sh | grep -q "Service is RUNNING" && echo "Running" || echo "Not running"

File Locations

File Location Purpose
PID file /tmp/deepseek_ocr.pid Process ID tracking
Log file /tmp/deepseek_ocr.log Service logs
Config .env Authentication token

Authentication

The service supports token-based authentication using Bearer tokens.

Setup Authentication

  1. Create .env file:
cp .env.example .env
  1. Generate a secure token:
# Using Python
python -c "import secrets; print(secrets.token_hex(32))"

# Using OpenSSL
openssl rand -hex 32
  1. Edit .env and set your token:
AUTH_TOKEN=your-generated-token-here
  1. Restart service (if running):
./stop.sh && ./run.sh

Disable Authentication

To disable authentication (development only):

  • Remove or comment out AUTH_TOKEN in .env, or
  • Don't create a .env file

Protected Endpoints

When AUTH_TOKEN is set, these endpoints require authentication:

  • POST /process_pdf - Upload and process PDF
  • GET /result/{job_id}/markdown - Get markdown output
  • GET /result/{job_id}/markdown_det - Get markdown with detections
  • GET /result/{job_id}/layout_pdf - Download layout PDF
  • GET /result/{job_id}/images - List extracted images
  • GET /result/{job_id}/images/{image_name} - Get specific image
  • DELETE /result/{job_id} - Delete job files

Public Endpoints

These endpoints are always accessible without authentication:

  • GET / - API information
  • GET /health - Health check

API Usage

Using curl

Upload and process PDF:

curl -X POST "http://localhost:8000/process_pdf" \
  -H "Authorization: Bearer your-token-here" \
  -F "file=@document.pdf"

Response:

{
  "job_id": "abc123-def456-789...",
  "status": "completed",
  "message": "PDF processed successfully"
}

Get markdown result:

curl -X GET "http://localhost:8000/result/{job_id}/markdown" \
  -H "Authorization: Bearer your-token-here"

Download layout PDF:

curl -X GET "http://localhost:8000/result/{job_id}/layout_pdf" \
  -H "Authorization: Bearer your-token-here" \
  -o layout.pdf

Health check (no auth required):

curl http://localhost:8000/health

Using Python

import requests

# Configure
API_URL = "http://localhost:8000"
AUTH_TOKEN = "your-token-here"
headers = {"Authorization": f"Bearer {AUTH_TOKEN}"}

# Upload PDF
with open("document.pdf", "rb") as f:
    files = {"file": f}
    response = requests.post(
        f"{API_URL}/process_pdf",
        headers=headers,
        files=files
    )
    result = response.json()
    job_id = result["job_id"]
    print(f"Job ID: {job_id}")

# Get markdown result
response = requests.get(
    f"{API_URL}/result/{job_id}/markdown",
    headers=headers
)
markdown_content = response.json()["content"]
print(markdown_content)

# Download layout PDF
response = requests.get(
    f"{API_URL}/result/{job_id}/layout_pdf",
    headers=headers
)
with open("layout.pdf", "wb") as f:
    f.write(response.content)

# Clean up
requests.delete(f"{API_URL}/result/{job_id}", headers=headers)

Using JavaScript

const API_URL = "http://localhost:8000";
const AUTH_TOKEN = "your-token-here";

// Upload PDF
const formData = new FormData();
formData.append("file", pdfFile);

const uploadResponse = await fetch(`${API_URL}/process_pdf`, {
    method: "POST",
    headers: {
        "Authorization": `Bearer ${AUTH_TOKEN}`
    },
    body: formData
});

const { job_id } = await uploadResponse.json();

// Get markdown result
const resultResponse = await fetch(
    `${API_URL}/result/${job_id}/markdown`,
    {
        headers: {
            "Authorization": `Bearer ${AUTH_TOKEN}`
        }
    }
);

const { content } = await resultResponse.json();
console.log(content);

API Endpoints Reference

Method Endpoint Auth Description
GET / No API information
GET /health No Health check
POST /process_pdf Yes* Upload and process PDF
GET /result/{job_id}/markdown Yes* Get markdown output
GET /result/{job_id}/markdown_det Yes* Get markdown with detections
GET /result/{job_id}/layout_pdf Yes* Download layout PDF
GET /result/{job_id}/images Yes* List extracted images
GET /result/{job_id}/images/{image_name} Yes* Get specific image
DELETE /result/{job_id} Yes* Delete job files

*Auth required only if AUTH_TOKEN is configured in .env

Interactive API Documentation:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Configuration

Environment Variables

Create a .env file in the project root:

# Authentication Token
# Set this to enable token-based authentication
# If not set, API will be accessible without authentication
AUTH_TOKEN=your-secret-token-here

Service Configuration

Edit config.py to modify:

  • MODEL_PATH - Model location
  • PROMPT - OCR prompt template
  • SKIP_REPEAT - Skip repeated content
  • MAX_CONCURRENCY - Max concurrent requests
  • NUM_WORKERS - Number of worker threads
  • CROP_MODE - Image cropping mode

Model Configuration

The service uses DeepSeek-OCR model with these settings:

  • Model: deepseek-ai/DeepSeek-OCR
  • Max sequence length: 8192 tokens
  • GPU memory utilization: 90%
  • Tensor parallel size: 1
  • Block size: 256

Troubleshooting

Service Won't Start

Check if already running:

./status.sh

Check for port conflicts:

netstat -tuln | grep 8000
lsof -i :8000

Check logs:

tail -100 /tmp/deepseek_ocr.log

Check virtual environment:

ls -la .venv/bin/python
.venv/bin/python --version

Service Won't Stop

Force stop:

./stop.sh
# If that doesn't work:
pkill -9 -f "serve_pdf.py"
rm -f /tmp/deepseek_ocr.pid

Authentication Issues

401 Unauthorized Error:

  • Verify token matches AUTH_TOKEN in .env
  • Check "Bearer " prefix in Authorization header
  • Ensure .env file is loaded (restart service)

Disable authentication:

# Comment out or remove AUTH_TOKEN from .env
sed -i 's/^AUTH_TOKEN=/#AUTH_TOKEN=/' .env
./stop.sh && ./run.sh

GPU Memory Issues

Check GPU status:

nvidia-smi

Free GPU memory:

./stop.sh
# If memory not released:
pkill -9 -f "python.*serve_pdf"
nvidia-smi

Reduce memory usage: Edit serve_pdf.py:

llm = LLM(
    ...
    gpu_memory_utilization=0.7,  # Reduce from 0.9
    max_num_seqs=4,  # Reduce from MAX_CONCURRENCY
)

Import Errors

Missing modules:

# Reinstall dependencies
.venv/bin/pip install -r requirements.txt
.venv/bin/pip install PyMuPDF img2pdf easydict addict

Check Python version:

.venv/bin/python --version  # Should be 3.12.x

Log Management

Log file too large:

./stop.sh
mv /tmp/deepseek_ocr.log /tmp/deepseek_ocr.log.old
./run.sh

Monitor logs:

# Real-time
tail -f /tmp/deepseek_ocr.log

# Last 50 lines
tail -50 /tmp/deepseek_ocr.log

# Search for errors
grep -i error /tmp/deepseek_ocr.log

Stale PID File

rm -f /tmp/deepseek_ocr.pid
./status.sh  # Verify clean state
./run.sh     # Start fresh

Advanced Topics

Systemd Integration

Create /etc/systemd/system/deepseek-ocr.service:

[Unit]
Description=DeepSeek OCR PDF Service
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/root/dpsk
ExecStart=/root/dpsk/.venv/bin/python /root/dpsk/serve_pdf.py
ExecStop=/root/dpsk/stop.sh
Restart=on-failure
RestartSec=10s
StandardOutput=append:/tmp/deepseek_ocr.log
StandardError=append:/tmp/deepseek_ocr.log

[Install]
WantedBy=multi-user.target

Enable and start:

systemctl daemon-reload
systemctl enable deepseek-ocr
systemctl start deepseek-ocr
systemctl status deepseek-ocr

Automated Health Checks

Cron job (check every 5 minutes):

# Add to crontab
*/5 * * * * /root/dpsk/status.sh | grep -q "NOT running" && /root/dpsk/run.sh

Monitoring script:

#!/bin/bash
# check_service.sh

if ! /root/dpsk/status.sh | grep -q "Service is RUNNING"; then
    echo "ALERT: Service is DOWN" | mail -s "Service Alert" admin@example.com
    /root/dpsk/run.sh
fi

Performance Tuning

Adjust concurrency: Edit config.py:

MAX_CONCURRENCY = 8  # Adjust based on GPU memory
NUM_WORKERS = 4      # Adjust based on CPU cores

Optimize GPU usage: Edit serve_pdf.py:

llm = LLM(
    ...
    gpu_memory_utilization=0.85,  # Adjust (0.7-0.95)
    max_num_seqs=MAX_CONCURRENCY,
    tensor_parallel_size=1,        # Increase for multi-GPU
)

Enable CUDA graphs for better performance:

llm = LLM(
    ...
    enforce_eager=False,  # Use CUDA graphs
)

Security Best Practices

  1. Use strong tokens (at least 32 characters)
  2. Rotate tokens regularly
  3. Use HTTPS in production (reverse proxy with nginx/caddy)
  4. Limit token sharing to authorized users only
  5. Never commit .env to version control
  6. Set up firewall rules to restrict access
  7. Monitor access logs for suspicious activity

Reverse Proxy Setup (nginx)

server {
    listen 443 ssl http2;
    server_name ocr.example.com;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Increase timeout for long processing
        proxy_read_timeout 300s;
        proxy_connect_timeout 300s;

        # Increase max body size for large PDFs
        client_max_body_size 100M;
    }
}

Project Structure

/root/dpsk/
├── serve_pdf.py              # Main service application
├── pdf_utils.py              # PDF conversion utilities
├── processing_utils.py       # Image processing utilities
├── deepseek_ocr.py           # DeepSeek OCR model
├── config.py                 # Configuration
├── requirements.txt          # Python dependencies
├── install/                  # Installation scripts
│   └── install.sh            # Installation script
├── run.sh                    # Start service script
├── stop.sh                   # Stop service script
├── status.sh                 # Status check script
├── .env.example              # Environment template
├── .env                      # Your configuration (not in git)
├── .venv/                    # Virtual environment
├── process/                  # Processing modules
│   ├── ngram_norepeat.py
│   └── image_process.py
├── deepencoder/              # Encoder modules
│   ├── clip_sdpa.py
│   └── sam_vary_sdpa.py
└── README.md                 # This file

Technical Details

Environment:

  • Virtual environment: .venv/
  • Python: 3.12
  • PyTorch: 2.6.0 with CUDA 11.8
  • vLLM: 0.8.5
  • Model: DeepSeek-OCR

GPU Support:

  • Tested on NVIDIA A40 (44GB)
  • Requires CUDA 11.8
  • Uses 90% GPU memory by default

Features:

  • PDF to image conversion (high quality, 144 DPI)
  • OCR with layout detection
  • Bounding box extraction and visualization
  • Image region extraction
  • Markdown output with/without layout annotations
  • Token-based authentication
  • RESTful API with OpenAPI docs
  • Concurrent request processing

Support and Contribution

Check Status:

./status.sh

View Logs:

tail -f /tmp/deepseek_ocr.log

Report Issues: Include in your report:

  • Service status output
  • Last 50 lines of log
  • GPU status (nvidia-smi)
  • Error messages

License

This service uses DeepSeek-OCR model. Please refer to the model's license for usage terms.


Quick Reference

Common Commands

# Installation
./install/install.sh

# Service Management
./run.sh          # Start service
./stop.sh         # Stop service
./status.sh       # Check status

# Logs
tail -f /tmp/deepseek_ocr.log       # Follow logs
grep error /tmp/deepseek_ocr.log    # Find errors

# API Testing
curl http://localhost:8000/health   # Health check
curl http://localhost:8000/docs     # API documentation

# GPU Monitoring
nvidia-smi                           # Check GPU status
watch -n1 nvidia-smi                 # Monitor GPU continuously

Environment Setup

# Create .env
cp .env.example .env

# Generate token
python -c "import secrets; print(secrets.token_hex(32))"

# Edit .env
nano .env

Troubleshooting Commands

# Check if running
ps aux | grep serve_pdf.py

# Check port
netstat -tuln | grep 8000
lsof -i :8000

# Force stop all
pkill -9 -f serve_pdf.py
rm -f /tmp/deepseek_ocr.pid

# Clean restart
./stop.sh && rm -f /tmp/deepseek_ocr.log && ./run.sh

Version: 1.0.0 Last Updated: 2025-10-31

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors