Add side-by-side parser comparison on Anthropic Mythos PDF by jdrhyne · Pull Request #4 · PSPDFKit/pdf-to-markdown

jdrhyne · 2026-04-11T23:08:45Z

Summary

Responds to Karpathy's ask: "Post the converted Mythos pdf, figures, tables and all."

Adds docs/examples/comparison/ with the same 245-page Claude Mythos System Card run through seven PDF-to-markdown converters, plus a qualitative spot-check of one table and one figure.

What's in the PR

File	What
`docs/examples/comparison/README.md`	Side-by-side analysis with TL;DR, spot-check, and raw counts
`docs/examples/comparison/anthropic-claude-mythos-*.md`	One full markdown file per competitor parser
`docs/examples/comparison/figures/`	110 figures extracted by the Nutrient Python SDK (the only parser that got them)

The canonical free-CLI output already lives one level up at docs/examples/anthropic-claude-mythos-system-card.md.

Key findings

The free CLI in this repo matches docling and pymupdf4llm on structural extraction and is 15–75× faster.
Half the parsers produced zero markdown structure. markitdown, pypdf, markit-ai, and liteparse all collapsed 245 pages into flat prose with 0 headings and 0 tables.
The Nutrient Python SDK is the only parser that extracted figures. 110/110 figures in the Mythos PDF recovered as actual image files, in place, adjacent to their captions. Every other parser (including docling, pymupdf4llm, and our own free CLI) produced zero images.

Speed on this 245-page PDF

Parser	Runtime	vs nutrient CLI
nutrient CLI	1.19s	1.0×
markit-ai	1.44s	1.2× slower
liteparse	2.05s	1.7× slower
pypdf	8.10s	6.8× slower
pymupdf4llm	17.7s	14.9× slower
markitdown	40.9s	34.4× slower
docling	89.0s	74.8× slower

Structural extraction

Parser	Headings	Table rows	Lists	Figures
nutrient CLI	265	398	231	0
nutrient SDK (Python)	265	398	231	110
docling	287	420	284	0
pymupdf4llm	258	422	273	0
markit-ai	2	0	13	0
markitdown	0	0	13	0
liteparse	0	0	13	0
pypdf	0	0	0	0

Test plan

Every parser output in this PR was generated by running the parser against the same PDF on the same machine on the same day
Reproducer commands documented in comparison/README.md
Spot-check includes full table transcripts and inline figure preview for both Nutrient tiers

Responds to Karpathy's ask: "Post the converted Mythos pdf, figures, tables and all." Adds docs/examples/comparison/ with the same 245-page PDF run through seven PDF-to-markdown converters, plus a qualitative spot-check of one table (Table 6.3.A benchmark results) and one figure (Figure 2.2.5.2.A Virology Uplift Trial). Parsers compared: - nutrient CLI (this repo) - nutrient-sdk (Python, v1.0.4, premium) - docling 2.86.0 - pymupdf4llm 1.27.2.2 - markitdown[pdf] 0.1.5 - markit-ai 0.2.0 - @llamaindex/liteparse - pypdf 6.9.2 Key findings: 1. The free CLI in this repo matches docling and pymupdf4llm on structural extraction and is 15-75x faster. 2. Half the parsers (markitdown, pypdf, markit-ai, liteparse) produced zero markdown structure - all 245 pages collapsed into flat prose. 3. The Python SDK is the only parser that extracted figures. 110/110 figures in the Mythos PDF recovered as actual image files, in place, adjacent to their captions. The comparison directory includes: - README.md with the qualitative spot-check - One markdown file per parser (all 245 pages each) - figures/ with 110 images extracted by the Python SDK

jdrhyne merged commit a244e19 into main Apr 11, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add side-by-side parser comparison on Anthropic Mythos PDF#4

Add side-by-side parser comparison on Anthropic Mythos PDF#4
jdrhyne merged 1 commit intomainfrom
add-mythos-comparison-outputs

jdrhyne commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdrhyne commented Apr 11, 2026

Summary

What's in the PR

Key findings

Speed on this 245-page PDF

Structural extraction

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant