Skip to content

Deehlusa/doc2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 doc2md — Offline Document ETL for LLM Context Optimization

Convert PDFs, DOCX, and images into clean, LLM-ready Markdown — fully offline, no AI calls.


Why

Most LLM orchestrators waste context window processing raw documents.
doc2md pre-converts your files into structured Markdown with metadata headers,
so your local model (Ollama, OpenClaw, LM Studio) reads signal — not noise.


What it does

  • Recursively scans a directory for .pdf, .docx, .jpg, .png, .tiff and more
  • Smart routing: digital parser first, OCR fallback if the file is a scanned copy
  • 100% offline — no OpenAI, no Claude, no API calls. Ever.
  • Caching — skips files already converted
  • Data lineage — every .md starts with a YAML header tracking origin, path, method and timestamp

Stack

Layer Tool
Primary parser markitdown (Microsoft)
PDF fallback pymupdf4llm
OCR engine pytesseract + pdf2image (Tesseract + Poppler)
Runtime Python 3.10+ · macOS arm64

Output example

---
source_file: "contract_2024.pdf"
source_relative_path: "docs/contract_2024.pdf"
extraction_method: "tesseract-ocr-pdf"
converted_at_utc: "2025-04-11T10:23:45+00:00"
---

# Service Agreement — January 2024
...

Quick start

# 1. System dependencies (macOS)
brew install tesseract tesseract-lang poppler

# 2. Python environment
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 3. Run
python etl_to_markdown.py "/path/to/your/documents"

Full installation guide and troubleshooting → RUNBOOK.md


Options

usage: etl_to_markdown.py [-h] [--lang LANG] [--debug] input_dir

positional arguments:
  input_dir    Root directory containing source documents

options:
  --lang LANG  Tesseract language(s) (default: por+eng)
  --debug      Verbose logging

Project structure

doc2md/
├── etl_to_markdown.py   # ETL pipeline (10 SRP functions)
├── requirements.txt     # Python dependencies
└── RUNBOOK.md           # Install guide + troubleshooting

Design principles

  • Zero-AI Rule — extraction is algorithmic, not generative
  • SRP — one function, one responsibility
  • Offline-first — built for air-gapped and privacy-sensitive environments
  • LLM-agnostic — output works with any local model

License

MIT

About

Convert PDFs, DOCX, and images into clean, LLM-ready Markdown — fully offline, no AI calls.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages