Convert PDFs, DOCX, and images into clean, LLM-ready Markdown — fully offline, no AI calls.
Most LLM orchestrators waste context window processing raw documents.
doc2md pre-converts your files into structured Markdown with metadata headers,
so your local model (Ollama, OpenClaw, LM Studio) reads signal — not noise.
- Recursively scans a directory for
.pdf,.docx,.jpg,.png,.tiffand more - Smart routing: digital parser first, OCR fallback if the file is a scanned copy
- 100% offline — no OpenAI, no Claude, no API calls. Ever.
- Caching — skips files already converted
- Data lineage — every
.mdstarts with a YAML header tracking origin, path, method and timestamp
| Layer | Tool |
|---|---|
| Primary parser | markitdown (Microsoft) |
| PDF fallback | pymupdf4llm |
| OCR engine | pytesseract + pdf2image (Tesseract + Poppler) |
| Runtime | Python 3.10+ · macOS arm64 |
---
source_file: "contract_2024.pdf"
source_relative_path: "docs/contract_2024.pdf"
extraction_method: "tesseract-ocr-pdf"
converted_at_utc: "2025-04-11T10:23:45+00:00"
---
# Service Agreement — January 2024
...# 1. System dependencies (macOS)
brew install tesseract tesseract-lang poppler
# 2. Python environment
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 3. Run
python etl_to_markdown.py "/path/to/your/documents"Full installation guide and troubleshooting → RUNBOOK.md
usage: etl_to_markdown.py [-h] [--lang LANG] [--debug] input_dir
positional arguments:
input_dir Root directory containing source documents
options:
--lang LANG Tesseract language(s) (default: por+eng)
--debug Verbose logging
doc2md/
├── etl_to_markdown.py # ETL pipeline (10 SRP functions)
├── requirements.txt # Python dependencies
└── RUNBOOK.md # Install guide + troubleshooting
- Zero-AI Rule — extraction is algorithmic, not generative
- SRP — one function, one responsibility
- Offline-first — built for air-gapped and privacy-sensitive environments
- LLM-agnostic — output works with any local model
MIT