diff --git a/README.md b/README.md index b61961d..415653e 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@ Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, RAG pipelines, and document-heavy automation where noisy extraction burns tokens and makes downstream results less reliable. -- **How fast is it?** — 0.008s per page. 176x faster than docling, 10x faster than opendataloader. ([benchmarks](#benchmarks)) +- **How fast is it?** — 0.007s per page. 90x faster than docling, 37x faster than pymupdf4llm. ([benchmarks](#benchmarks)) - **How accurate is it?** — 0.92 reading order (best in class), 0.88 overall extraction accuracy, 0.81 heading detection. ([benchmarks](#benchmarks)) - **Where do my PDFs go?** — Nowhere. The CLI runs locally. Your documents are not uploaded to Nutrient. ([trust & licensing](#trust-and-licensing)) - **What does it cost?** — Free for up to 1,000 documents per calendar month. No license key, no signup, no API token. ([license](LICENSE.md)) @@ -86,45 +86,54 @@ When both arguments are directories, the CLI converts every PDF in the input dir ## Benchmarks -Published benchmark values from [Nutrient's PDF-to-Markdown page](https://www.nutrient.io/ai/skills/pdf-to-markdown/), recorded on `AMD EPYC 9454`. +Benchmark results from 200 PDF documents with hand-annotated Markdown ground truth, evaluated using NID (reading order), TEDS (table structure), and MHS (heading hierarchy) metrics. Benchmarked on `2026-04-02`. ### Visual Snapshot -![Extraction accuracy benchmark](docs/assets/extraction-accuracy.svg) +![Extraction accuracy](docs/assets/extraction-accuracy.png) -![Extraction speed benchmark](docs/assets/extraction-speed.svg) +![Reading order](docs/assets/reading-order.png) -![Structure quality benchmark](docs/assets/structure-quality.svg) +![Table structure](docs/assets/table-structure.png) -![Relative speedup benchmark](docs/assets/faster-with-nutrient.svg) +![Heading level](docs/assets/heading-level.png) + +![Extraction speed](docs/assets/extraction-speed.png) + +![Faster with Nutrient](docs/assets/faster-with-nutrient.png) ### Accuracy -| Metric | Nutrient | Best competitor | MarkItDown | -| --- | ---: | ---: | ---: | -| Extraction accuracy | 0.88 | 0.89 (docling) | 0.58 | -| Reading order (NID) | 0.92 | 0.91 | 0.88 | -| Table structure (TEDS) | 0.66 | 0.93 (docling) | 0.00 | -| Heading level (MHS) | 0.81 | 0.83 (docling) | 0.00 | +| Solution | Overall | Reading Order (NID) | Table Structure (TEDS) | Heading Level (MHS) | +| --- | ---: | ---: | ---: | ---: | +| docling | **0.88** | 0.90 | **0.89** | **0.82** | +| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 | +| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 | +| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 | +| markitdown | 0.59 | 0.84 | 0.27 | 0.00 | +| pypdf | 0.58 | 0.87 | 0.00 | 0.00 | +| liteparse | 0.57 | 0.86 | 0.00 | 0.00 | ### Speed | Solution | Seconds per page | | --- | ---: | -| Nutrient | 0.008 | -| opendataloader | 0.056 | -| markitdown | 0.058 | -| pymupdf4llm | 0.083 | -| opendataloader-hybrid | 1.412 | -| docling | 1.473 | +| **Nutrient** | **0.007** | +| opendataloader | 0.014 | +| pypdf | 0.019 | +| markitdown | 0.106 | +| liteparse | 0.233 | +| pymupdf4llm | 0.252 | +| docling | 0.618 | ### Faster with Nutrient -- `176x` faster than `docling` -- `172x` faster than `opendataloader-hybrid` -- `10x` faster than `opendataloader` -- `7x` faster than `pymupdf4llm` -- `7x` faster than `markitdown` +- `90x` faster than `docling` +- `37x` faster than `pymupdf4llm` +- `34x` faster than `liteparse` +- `15x` faster than `markitdown` +- `3x` faster than `pypdf` +- `2x` faster than `opendataloader` For the full comparison table, see [docs/benchmarks.md](docs/benchmarks.md). @@ -142,7 +151,7 @@ See [LICENSE.md](LICENSE.md) for the full terms and [docs/distribution-model.md] ### What makes this different from other PDF extractors? -Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.008s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results. +Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.007s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results. ### Do my documents leave my machine? diff --git a/docs/assets/extraction-accuracy.png b/docs/assets/extraction-accuracy.png new file mode 100644 index 0000000..a6b64de Binary files /dev/null and b/docs/assets/extraction-accuracy.png differ diff --git a/docs/assets/extraction-speed.png b/docs/assets/extraction-speed.png new file mode 100644 index 0000000..3c862da Binary files /dev/null and b/docs/assets/extraction-speed.png differ diff --git a/docs/assets/faster-with-nutrient.png b/docs/assets/faster-with-nutrient.png new file mode 100644 index 0000000..2518563 Binary files /dev/null and b/docs/assets/faster-with-nutrient.png differ diff --git a/docs/assets/heading-level.png b/docs/assets/heading-level.png new file mode 100644 index 0000000..d7dace9 Binary files /dev/null and b/docs/assets/heading-level.png differ diff --git a/docs/assets/reading-order.png b/docs/assets/reading-order.png new file mode 100644 index 0000000..e499796 Binary files /dev/null and b/docs/assets/reading-order.png differ diff --git a/docs/assets/table-structure.png b/docs/assets/table-structure.png new file mode 100644 index 0000000..eeb8b64 Binary files /dev/null and b/docs/assets/table-structure.png differ diff --git a/docs/benchmarks.md b/docs/benchmarks.md index ad4acc8..a4b0281 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -1,41 +1,41 @@ # Benchmarks -These values mirror the benchmark figures currently published on Nutrient's PDF-to-Markdown product page: +Evaluated on 200 PDF documents with hand-annotated Markdown ground truth from the DP-Bench corpus. -- Source: -- Snapshot date: `2026-04-01` -- Hardware note on page: `Benchmark data recorded on AMD EPYC 9454` +- Benchmark date: `2026-04-02` +- Corpus: 200 documents with ground-truth Markdown annotations +- Metrics: NID (reading order), TEDS (table structure), MHS (heading hierarchy) +- All scores normalized to [0, 1] — higher is better ## Accuracy Metrics | Solution | Extraction accuracy | Reading order (NID) | Table structure (TEDS) | Heading level (MHS) | | --- | ---: | ---: | ---: | ---: | -| Nutrient | 0.88 | 0.92 | 0.66 | 0.81 | -| docling | 0.89 | 0.91 | 0.93 | 0.83 | -| opendataloader | 0.84 | 0.91 | 0.49 | 0.74 | -| opendataloader-hybrid | 0.83 | 0.91 | 0.43 | 0.73 | -| pymupdf4llm | 0.74 | 0.89 | 0.40 | 0.43 | -| markitdown | 0.58 | 0.88 | 0.00 | 0.00 | +| docling | 0.88 | 0.90 | **0.89** | **0.82** | +| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 | +| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 | +| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 | +| markitdown | 0.59 | 0.84 | 0.27 | 0.00 | +| pypdf | 0.58 | 0.87 | 0.00 | 0.00 | +| liteparse | 0.57 | 0.86 | 0.00 | 0.00 | ## Speed | Solution | Seconds per page | | --- | ---: | -| Nutrient | 0.008 | -| opendataloader | 0.056 | -| markitdown | 0.058 | -| pymupdf4llm | 0.083 | -| opendataloader-hybrid | 1.412 | -| docling | 1.473 | +| **Nutrient** | **0.007** | +| opendataloader | 0.014 | +| pypdf | 0.019 | +| markitdown | 0.106 | +| liteparse | 0.233 | +| pymupdf4llm | 0.252 | +| docling | 0.618 | ## Relative Speed Callouts -- Nutrient is `176x` faster than `docling` -- Nutrient is `172x` faster than `opendataloader-hybrid` -- Nutrient is `10x` faster than `opendataloader` -- Nutrient is `7x` faster than `pymupdf4llm` -- Nutrient is `7x` faster than `markitdown` - -## Note - -This file reflects the currently published benchmark table. A public reproducibility harness is planned as a future addition. +- Nutrient is `90x` faster than `docling` +- Nutrient is `37x` faster than `pymupdf4llm` +- Nutrient is `34x` faster than `liteparse` +- Nutrient is `15x` faster than `markitdown` +- Nutrient is `3x` faster than `pypdf` +- Nutrient is `2x` faster than `opendataloader`