Skip to content

Add side-by-side parser comparison on Anthropic Mythos PDF#4

Merged
jdrhyne merged 1 commit intomainfrom
add-mythos-comparison-outputs
Apr 11, 2026
Merged

Add side-by-side parser comparison on Anthropic Mythos PDF#4
jdrhyne merged 1 commit intomainfrom
add-mythos-comparison-outputs

Conversation

@jdrhyne
Copy link
Copy Markdown
Contributor

@jdrhyne jdrhyne commented Apr 11, 2026

Summary

Responds to Karpathy's ask: "Post the converted Mythos pdf, figures, tables and all."

Adds docs/examples/comparison/ with the same 245-page Claude Mythos System Card run through seven PDF-to-markdown converters, plus a qualitative spot-check of one table and one figure.

What's in the PR

File What
docs/examples/comparison/README.md Side-by-side analysis with TL;DR, spot-check, and raw counts
docs/examples/comparison/anthropic-claude-mythos-*.md One full markdown file per competitor parser
docs/examples/comparison/figures/ 110 figures extracted by the Nutrient Python SDK (the only parser that got them)

The canonical free-CLI output already lives one level up at docs/examples/anthropic-claude-mythos-system-card.md.

Key findings

  1. The free CLI in this repo matches docling and pymupdf4llm on structural extraction and is 15–75× faster.
  2. Half the parsers produced zero markdown structure. markitdown, pypdf, markit-ai, and liteparse all collapsed 245 pages into flat prose with 0 headings and 0 tables.
  3. The Nutrient Python SDK is the only parser that extracted figures. 110/110 figures in the Mythos PDF recovered as actual image files, in place, adjacent to their captions. Every other parser (including docling, pymupdf4llm, and our own free CLI) produced zero images.

Speed on this 245-page PDF

Parser Runtime vs nutrient CLI
nutrient CLI 1.19s 1.0×
markit-ai 1.44s 1.2× slower
liteparse 2.05s 1.7× slower
pypdf 8.10s 6.8× slower
pymupdf4llm 17.7s 14.9× slower
markitdown 40.9s 34.4× slower
docling 89.0s 74.8× slower

Structural extraction

Parser Headings Table rows Lists Figures
nutrient CLI 265 398 231 0
nutrient SDK (Python) 265 398 231 110
docling 287 420 284 0
pymupdf4llm 258 422 273 0
markit-ai 2 0 13 0
markitdown 0 0 13 0
liteparse 0 0 13 0
pypdf 0 0 0 0

Test plan

  • Every parser output in this PR was generated by running the parser against the same PDF on the same machine on the same day
  • Reproducer commands documented in comparison/README.md
  • Spot-check includes full table transcripts and inline figure preview for both Nutrient tiers

Responds to Karpathy's ask: "Post the converted Mythos pdf, figures,
tables and all."

Adds docs/examples/comparison/ with the same 245-page PDF run through
seven PDF-to-markdown converters, plus a qualitative spot-check of one
table (Table 6.3.A benchmark results) and one figure (Figure 2.2.5.2.A
Virology Uplift Trial).

Parsers compared:
- nutrient CLI (this repo)
- nutrient-sdk (Python, v1.0.4, premium)
- docling 2.86.0
- pymupdf4llm 1.27.2.2
- markitdown[pdf] 0.1.5
- markit-ai 0.2.0
- @llamaindex/liteparse
- pypdf 6.9.2

Key findings:
1. The free CLI in this repo matches docling and pymupdf4llm on
   structural extraction and is 15-75x faster.
2. Half the parsers (markitdown, pypdf, markit-ai, liteparse) produced
   zero markdown structure - all 245 pages collapsed into flat prose.
3. The Python SDK is the only parser that extracted figures. 110/110
   figures in the Mythos PDF recovered as actual image files, in place,
   adjacent to their captions.

The comparison directory includes:
- README.md with the qualitative spot-check
- One markdown file per parser (all 245 pages each)
- figures/ with 110 images extracted by the Python SDK
@jdrhyne jdrhyne merged commit a244e19 into main Apr 11, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant