Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 33 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, RAG pipelines, and document-heavy automation where noisy extraction burns tokens and makes downstream results less reliable.

- **How fast is it?** — 0.008s per page. 176x faster than docling, 10x faster than opendataloader. ([benchmarks](#benchmarks))
- **How fast is it?** — 0.007s per page. 90x faster than docling, 37x faster than pymupdf4llm. ([benchmarks](#benchmarks))
- **How accurate is it?** — 0.92 reading order (best in class), 0.88 overall extraction accuracy, 0.81 heading detection. ([benchmarks](#benchmarks))
- **Where do my PDFs go?** — Nowhere. The CLI runs locally. Your documents are not uploaded to Nutrient. ([trust & licensing](#trust-and-licensing))
- **What does it cost?** — Free for up to 1,000 documents per calendar month. No license key, no signup, no API token. ([license](LICENSE.md))
Expand Down Expand Up @@ -86,45 +86,54 @@ When both arguments are directories, the CLI converts every PDF in the input dir

## Benchmarks

Published benchmark values from [Nutrient's PDF-to-Markdown page](https://www.nutrient.io/ai/skills/pdf-to-markdown/), recorded on `AMD EPYC 9454`.
Benchmark results from 200 PDF documents with hand-annotated Markdown ground truth, evaluated using NID (reading order), TEDS (table structure), and MHS (heading hierarchy) metrics. Benchmarked on `2026-04-02`.

### Visual Snapshot

![Extraction accuracy benchmark](docs/assets/extraction-accuracy.svg)
![Extraction accuracy](docs/assets/extraction-accuracy.png)

![Extraction speed benchmark](docs/assets/extraction-speed.svg)
![Reading order](docs/assets/reading-order.png)

![Structure quality benchmark](docs/assets/structure-quality.svg)
![Table structure](docs/assets/table-structure.png)

![Relative speedup benchmark](docs/assets/faster-with-nutrient.svg)
![Heading level](docs/assets/heading-level.png)

![Extraction speed](docs/assets/extraction-speed.png)

![Faster with Nutrient](docs/assets/faster-with-nutrient.png)

### Accuracy

| Metric | Nutrient | Best competitor | MarkItDown |
| --- | ---: | ---: | ---: |
| Extraction accuracy | 0.88 | 0.89 (docling) | 0.58 |
| Reading order (NID) | 0.92 | 0.91 | 0.88 |
| Table structure (TEDS) | 0.66 | 0.93 (docling) | 0.00 |
| Heading level (MHS) | 0.81 | 0.83 (docling) | 0.00 |
| Solution | Overall | Reading Order (NID) | Table Structure (TEDS) | Heading Level (MHS) |
| --- | ---: | ---: | ---: | ---: |
| docling | **0.88** | 0.90 | **0.89** | **0.82** |
| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 |
| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 |
| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 |
| markitdown | 0.59 | 0.84 | 0.27 | 0.00 |
| pypdf | 0.58 | 0.87 | 0.00 | 0.00 |
| liteparse | 0.57 | 0.86 | 0.00 | 0.00 |

### Speed

| Solution | Seconds per page |
| --- | ---: |
| Nutrient | 0.008 |
| opendataloader | 0.056 |
| markitdown | 0.058 |
| pymupdf4llm | 0.083 |
| opendataloader-hybrid | 1.412 |
| docling | 1.473 |
| **Nutrient** | **0.007** |
| opendataloader | 0.014 |
| pypdf | 0.019 |
| markitdown | 0.106 |
| liteparse | 0.233 |
| pymupdf4llm | 0.252 |
| docling | 0.618 |

### Faster with Nutrient

- `176x` faster than `docling`
- `172x` faster than `opendataloader-hybrid`
- `10x` faster than `opendataloader`
- `7x` faster than `pymupdf4llm`
- `7x` faster than `markitdown`
- `90x` faster than `docling`
- `37x` faster than `pymupdf4llm`
- `34x` faster than `liteparse`
- `15x` faster than `markitdown`
- `3x` faster than `pypdf`
- `2x` faster than `opendataloader`

For the full comparison table, see [docs/benchmarks.md](docs/benchmarks.md).

Expand All @@ -142,7 +151,7 @@ See [LICENSE.md](LICENSE.md) for the full terms and [docs/distribution-model.md]

### What makes this different from other PDF extractors?

Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.008s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results.
Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.007s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results.

### Do my documents leave my machine?

Expand Down
Binary file added docs/assets/extraction-accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/extraction-speed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/faster-with-nutrient.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/heading-level.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/reading-order.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/table-structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 25 additions & 25 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,41 @@
# Benchmarks

These values mirror the benchmark figures currently published on Nutrient's PDF-to-Markdown product page:
Evaluated on 200 PDF documents with hand-annotated Markdown ground truth from the DP-Bench corpus.

- Source: <https://www.nutrient.io/ai/skills/pdf-to-markdown/>
- Snapshot date: `2026-04-01`
- Hardware note on page: `Benchmark data recorded on AMD EPYC 9454`
- Benchmark date: `2026-04-02`
- Corpus: 200 documents with ground-truth Markdown annotations
- Metrics: NID (reading order), TEDS (table structure), MHS (heading hierarchy)
- All scores normalized to [0, 1] — higher is better

## Accuracy Metrics

| Solution | Extraction accuracy | Reading order (NID) | Table structure (TEDS) | Heading level (MHS) |
| --- | ---: | ---: | ---: | ---: |
| Nutrient | 0.88 | 0.92 | 0.66 | 0.81 |
| docling | 0.89 | 0.91 | 0.93 | 0.83 |
| opendataloader | 0.84 | 0.91 | 0.49 | 0.74 |
| opendataloader-hybrid | 0.83 | 0.91 | 0.43 | 0.73 |
| pymupdf4llm | 0.74 | 0.89 | 0.40 | 0.43 |
| markitdown | 0.58 | 0.88 | 0.00 | 0.00 |
| docling | 0.88 | 0.90 | **0.89** | **0.82** |
| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 |
| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 |
| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 |
| markitdown | 0.59 | 0.84 | 0.27 | 0.00 |
| pypdf | 0.58 | 0.87 | 0.00 | 0.00 |
| liteparse | 0.57 | 0.86 | 0.00 | 0.00 |

## Speed

| Solution | Seconds per page |
| --- | ---: |
| Nutrient | 0.008 |
| opendataloader | 0.056 |
| markitdown | 0.058 |
| pymupdf4llm | 0.083 |
| opendataloader-hybrid | 1.412 |
| docling | 1.473 |
| **Nutrient** | **0.007** |
| opendataloader | 0.014 |
| pypdf | 0.019 |
| markitdown | 0.106 |
| liteparse | 0.233 |
| pymupdf4llm | 0.252 |
| docling | 0.618 |

## Relative Speed Callouts

- Nutrient is `176x` faster than `docling`
- Nutrient is `172x` faster than `opendataloader-hybrid`
- Nutrient is `10x` faster than `opendataloader`
- Nutrient is `7x` faster than `pymupdf4llm`
- Nutrient is `7x` faster than `markitdown`

## Note

This file reflects the currently published benchmark table. A public reproducibility harness is planned as a future addition.
- Nutrient is `90x` faster than `docling`
- Nutrient is `37x` faster than `pymupdf4llm`
- Nutrient is `34x` faster than `liteparse`
- Nutrient is `15x` faster than `markitdown`
- Nutrient is `3x` faster than `pypdf`
- Nutrient is `2x` faster than `opendataloader`
Loading