refactor: extract shared text walker and add encoding chain#62
Open
ajroetker wants to merge 10 commits intoledongthuc:masterfrom
Open
refactor: extract shared text walker and add encoding chain#62ajroetker wants to merge 10 commits intoledongthuc:masterfrom
ajroetker wants to merge 10 commits intoledongthuc:masterfrom
Conversation
The PDF specification states that ToUnicode CMap is the authoritative source for character-to-Unicode mapping. Previously, the library only checked ToUnicode for fonts with "Identity-H" encoding or null encoding, causing incorrect text extraction for many PDFs. This change: - Checks ToUnicode CMap first before falling back to Encoding - Falls back to pdfDocEncoding instead of nopEncoder for better compatibility with unknown encodings - Removes the now-redundant charmapEncoding() method This fixes text extraction issues where characters were being incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode being ignored when an Encoding entry was present.
The PDF spec (section 9.6.6) requires that when an Encoding dictionary is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding) should be applied first, then the Differences array overlays specific character code mappings on top. Previously, dictEncoder only looked at the Differences array and matched character codes one by one, which was both slow and incorrect for fonts that rely on BaseEncoding for most characters. This fix: - Builds a complete 256-entry lookup table at initialization time - Copies the BaseEncoding table first (defaulting to PDFDocEncoding) - Applies Differences array entries on top - Uses O(1) lookup instead of O(n) scanning during decoding Fixes font encoding corruption in PDFs where fonts use custom Encoding dictionaries with BaseEncoding + Differences (common in legal documents).
Builds on fix/tounicode-priority branch. Changes: - Add FontEncodingChain for multi-layer font decoding: 1. ToUnicode CMap (authoritative per PDF spec) 2. CIDFont chain for Type0 fonts 3. BaseEncoding + Differences 4. Glyph name resolution (Adobe Glyph List patterns) 5. Fallback heuristics - Add walkTextContent() to consolidate PDF content stream parsing - Refactor Content() and ContentWithMarks() to use shared walker - Add CharInfo and ContentWalkOptions types - Remove dead dictEncoder code (now handled by FontEncodingChain) - Use math package instead of custom sqrt/atan implementations - Add comprehensive tests for encoding chain functionality
Instead of replacing unmapped character codes with U+FFFD (which loses information), encode them in the Private Use Area (U+E000-U+E0FF). This allows post-processing to recover the original byte value and apply custom decodings (e.g., shifted encodings). Also raises the replacement threshold from 20% to 50% to allow more text through for post-processing.
The previous PUA preservation fix only addressed encoding_chain.go (layers 3-5). Pages with partial ToUnicode coverage (e.g., 70% valid, 30% unmapped) would pass validation and never reach the PUA-preserving layers. This fix adds PUA preservation directly to cmap.Decode() in page.go, converting the 3 locations that used noRune (U+FFFD) to use PUA instead: - Line 311: bfrange with unknown destination type - Line 315: codespace match but no bfchar/bfrange mapping - Line 323: no codespace range matches
Rewrite GetPlainText to use walkTextContent with position-based word boundary detection, similar to MuPDF's approach: - Use character width × 0.2 as gap threshold for detecting word breaks - Use font size × 0.5 as threshold for detecting line breaks - Insert spaces when gap between characters exceeds threshold This fixes PDFs where text runs together without explicit space characters in the content stream. The previous implementation just concatenated all text without considering character positions.
For CID fonts with 2-byte character codes where the high byte is 0x00, only convert the low byte to PUA. This avoids interleaved null PUA chars (\ue000) that break shift detection in post-processing. Matches PyMuPDF's handling of CID fonts.
Return an empty reader for zero-length streams before applying filters. This prevents "unexpected EOF" errors from zlib when processing empty FlateDecode streams in PDFs.
- Change module path from ledongthuc/pdf to ajroetker/pdf - Add render/ submodule for PDF page rasterization (from antflydb/antfly) - render/ depends on ajroetker/go-jpeg2000 for embedded JPEG2000 images
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR builds on #61 and adds:
FontEncodingChain: Multi-layer font decoding that prioritizes:
walkTextContent(): Shared method to consolidate PDF content stream parsing, reducing duplication between
Content()and the newContentWithMarks()