📄 doc2md — Offline Document ETL for LLM Context Optimization

Convert PDFs, DOCX, and images into clean, LLM-ready Markdown — fully offline, no AI calls.

Why

Most LLM orchestrators waste context window processing raw documents.
doc2md pre-converts your files into structured Markdown with metadata headers,
so your local model (Ollama, OpenClaw, LM Studio) reads signal — not noise.

What it does

Recursively scans a directory for .pdf, .docx, .jpg, .png, .tiff and more
Smart routing: digital parser first, OCR fallback if the file is a scanned copy
100% offline — no OpenAI, no Claude, no API calls. Ever.
Caching — skips files already converted
Data lineage — every .md starts with a YAML header tracking origin, path, method and timestamp

Stack

Layer	Tool
Primary parser	`markitdown` (Microsoft)
PDF fallback	`pymupdf4llm`
OCR engine	`pytesseract` + `pdf2image` (Tesseract + Poppler)
Runtime	Python 3.10+ · macOS arm64

Output example

---
source_file: "contract_2024.pdf"
source_relative_path: "docs/contract_2024.pdf"
extraction_method: "tesseract-ocr-pdf"
converted_at_utc: "2025-04-11T10:23:45+00:00"
---

# Service Agreement — January 2024
...

Quick start

# 1. System dependencies (macOS)
brew install tesseract tesseract-lang poppler

# 2. Python environment
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 3. Run
python etl_to_markdown.py "/path/to/your/documents"

Full installation guide and troubleshooting → RUNBOOK.md

Options

usage: etl_to_markdown.py [-h] [--lang LANG] [--debug] input_dir

positional arguments:
  input_dir    Root directory containing source documents

options:
  --lang LANG  Tesseract language(s) (default: por+eng)
  --debug      Verbose logging

Project structure

doc2md/
├── etl_to_markdown.py   # ETL pipeline (10 SRP functions)
├── requirements.txt     # Python dependencies
└── RUNBOOK.md           # Install guide + troubleshooting

Design principles

Zero-AI Rule — extraction is algorithmic, not generative
SRP — one function, one responsibility
Offline-first — built for air-gapped and privacy-sensitive environments
LLM-agnostic — output works with any local model

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
RUNBOOK.md		RUNBOOK.md
etl_to_markdown.py		etl_to_markdown.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 doc2md — Offline Document ETL for LLM Context Optimization

Why

What it does

Stack

Output example

Quick start

Options

Project structure

Design principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 doc2md — Offline Document ETL for LLM Context Optimization

Why

What it does

Stack

Output example

Quick start

Options

Project structure

Design principles

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages