Two Python scripts for working with PDF files using open-source tools only — no Adobe Acrobat needed.
| Tool | What it does |
|---|---|
add_toc.py |
Detects chapter headings and inserts a Table of Contents page into each PDF |
pdf_to_docx.py |
Converts PDFs to Word documents with high fidelity — preserving fonts, images, tables, equations, and layout |
# Install dependencies
pip install PyMuPDF python-docx Pillow lxml
# Place your PDFs in a folder, then:
# 1. Add Table of Contents to all PDFs
python3 add_toc.py # Output: with_toc/ subfolder
# 2. Convert all PDFs to Word
python3 pdf_to_docx.py # Output: docx_output/ subfolderNote: Both scripts process all
work_dirvariable at the bottom of each script to change the input folder.
- Python 3.10+
- PyMuPDF (fitz) — PDF reading and manipulation
- python-docx — Word document generation
- Pillow — Image processing
- lxml — XML manipulation (installed with python-docx)
pip install -r requirements.txtScans each PDF for chapter headings using font metrics (size, bold weight) and inserts a formatted TOC page at the beginning.
- Body text detection — The most frequently used font size (by character count) is identified as body text
- Heading extraction — Bold text at font sizes larger than body text is collected as heading candidates
- Multi-line merge — Adjacent heading fragments on the same page are merged (handles line-wrapped titles)
- H1/H2 selection — If the largest heading level has more than 3 entries, it's used for chapters. If 3 or fewer, it's likely a book title, so the next level down is used instead
- Over-extraction guard — If more than 50 headings are detected, the script narrows to only "Chapter N"-prefixed entries
- TOC generation — A formatted TOC page with dot leaders and page numbers is created and inserted before page 1
- Bookmarks — PDF bookmark metadata is also set for navigation in PDF viewers
$ python3 add_toc.py
Processing: Cisco ASA Handbook.pdf
Body text: 12.0pt, Heading sizes: [13.5, 18.0]
Found 12 H1 headings, 1 H2 heading
TOC: 1 page
Saved: with_toc/Cisco ASA Handbook.pdf
SUMMARY: 12/12 succeeded
- Handles TeX/LaTeX PDFs with Computer Modern fonts
- Detects running headers (text repeated on >25% of pages) and excludes them
- Keyword-based fallback for PDFs where font metrics alone aren't enough ("Chapter N" pattern matching)
- Deduplication of headings appearing on adjacent pages
- Hierarchy normalization to ensure valid PDF bookmark structure
Converts each PDF to a .docx file, preserving the original layout as closely as possible.
| Element | How it's handled |
|---|---|
| Text | Extracted with font name, size, bold/italic, color per span. PDF's broken lines are rejoined into proper paragraphs |
| Fonts | Mapped to closest Word-compatible equivalent (Times→Times New Roman, Helvetica→Arial, Courier→Courier New, etc.) |
| Page layout | Original page size and margins are preserved |
| Images | Extracted at original resolution, positioned as block or inline based on width |
| Tables | Detected from vector line drawings (horizontal/vertical line intersections), reconstructed with cell text and borders |
| Equations | TeX math (detected via CM/AMS font names) is rendered as 200 DPI images |
| Headers/Footers | Running text patterns are detected, excluded from body, and placed in Word header/footer zones |
| Headings | Detected via font-size analysis and styled with Word heading styles |
| Hyphenation | Line-end hyphens are removed when joining lines |
| Colors | RGB, grayscale, and CMYK colors are mapped to Word RGB values |
$ python3 pdf_to_docx.py
Processing: Cisco ASA Handbook.pdf
Pages: 884
Page: 612x792pt, margins: L=72 R=71 T=72 B=85
Body: 12.0pt, Heading sizes: [13.5, 18.0]
Total elements rendered: 8706
Saved: docx_output/Cisco ASA Handbook.docx (6.5 MB)
SUMMARY: 12/12 succeeded
PDF breaks text into individual lines. The converter reassembles them into paragraphs using these heuristics:
- Join lines when: vertical gap is small (< 1.8x font height), same font size, no list bullet or number at start
- Don't join when: font size changes (heading boundary), large vertical gap, line ends with
.!?and next starts uppercase with a gap, significant indentation change - Hyphen handling: if a line ends with
-and the next starts lowercase, the hyphen is removed and words are joined
- Equations are images, not editable Office Math — this preserves visual fidelity for complex TeX math
- Vector diagrams (line art, flowcharts) are not currently extracted as images
- Multi-column layouts are flowed into single-column output
- Type 3 fonts with broken encoding produce garbled text (this is a PDF-level issue)
- Exact line breaks differ from the original because Word re-flows text using its own line-breaking algorithm
Successfully processed 12 diverse PDFs covering:
| Type | Examples | Pages |
|---|---|---|
| Math/Science textbooks | Algorithmic Algebra, Noncommutative Geometry | 338–538 |
| Scientific protocols | Chromatin Protocols | 508 |
| Networking/IT handbooks | Cisco ASA, Communications Network | 784–884 |
| Security books | Hack Attacks, Hackers Beware, Hacking Firewalls | 354–792 |
| Encyclopedia | Encyclopedia of Public Health | 384 |
| Materials science | Creep in Metals and Alloys | 279 |
| Physics (Type 3 fonts) | Solid State Physics | 406 |
.
├── add_toc.py # TOC generation script (951 lines)
├── pdf_to_docx.py # PDF → DOCX converter (1819 lines)
├── requirements.txt # Python dependencies
├── CLAUDE.md # Claude Code project context
├── README.md # This file
├── with_toc/ # Output: PDFs with TOC (generated, gitignored)
└── docx_output/ # Output: DOCX files (generated, gitignored)
Copyright (c) 2025-2026 Thothica Private Limited, Delhi, India. All rights reserved.
This software is proprietary and confidential. Unauthorized copying, distribution, modification, or use of this software, in whole or in part, is strictly prohibited without prior written authorization from Thothica Private Limited.
See LICENSE for full terms.