Skip to content

refactor: extract shared text walker and add encoding chain#62

Open
ajroetker wants to merge 10 commits intoledongthuc:masterfrom
ajroetker:refactor/text-walker-extraction
Open

refactor: extract shared text walker and add encoding chain#62
ajroetker wants to merge 10 commits intoledongthuc:masterfrom
ajroetker:refactor/text-walker-extraction

Conversation

@ajroetker
Copy link

@ajroetker ajroetker commented Jan 3, 2026

This PR builds on #61 and adds:

  • FontEncodingChain: Multi-layer font decoding that prioritizes:

    1. ToUnicode CMap (authoritative per PDF spec)
    2. CIDFont chain for Type0 fonts
    3. BaseEncoding + Differences
    4. Glyph name resolution (Adobe Glyph List patterns)
    5. Fallback heuristics
  • walkTextContent(): Shared method to consolidate PDF content stream parsing, reducing duplication between Content() and the new ContentWithMarks()

The PDF specification states that ToUnicode CMap is the authoritative
source for character-to-Unicode mapping. Previously, the library only
checked ToUnicode for fonts with "Identity-H" encoding or null encoding,
causing incorrect text extraction for many PDFs.

This change:
- Checks ToUnicode CMap first before falling back to Encoding
- Falls back to pdfDocEncoding instead of nopEncoder for better
  compatibility with unknown encodings
- Removes the now-redundant charmapEncoding() method

This fixes text extraction issues where characters were being
incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode
being ignored when an Encoding entry was present.
The PDF spec (section 9.6.6) requires that when an Encoding dictionary
is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding)
should be applied first, then the Differences array overlays specific
character code mappings on top.

Previously, dictEncoder only looked at the Differences array and matched
character codes one by one, which was both slow and incorrect for fonts
that rely on BaseEncoding for most characters.

This fix:
- Builds a complete 256-entry lookup table at initialization time
- Copies the BaseEncoding table first (defaulting to PDFDocEncoding)
- Applies Differences array entries on top
- Uses O(1) lookup instead of O(n) scanning during decoding

Fixes font encoding corruption in PDFs where fonts use custom Encoding
dictionaries with BaseEncoding + Differences (common in legal documents).
Builds on fix/tounicode-priority branch.

Changes:
- Add FontEncodingChain for multi-layer font decoding:
  1. ToUnicode CMap (authoritative per PDF spec)
  2. CIDFont chain for Type0 fonts
  3. BaseEncoding + Differences
  4. Glyph name resolution (Adobe Glyph List patterns)
  5. Fallback heuristics
- Add walkTextContent() to consolidate PDF content stream parsing
- Refactor Content() and ContentWithMarks() to use shared walker
- Add CharInfo and ContentWalkOptions types
- Remove dead dictEncoder code (now handled by FontEncodingChain)
- Use math package instead of custom sqrt/atan implementations
- Add comprehensive tests for encoding chain functionality
Instead of replacing unmapped character codes with U+FFFD (which loses
information), encode them in the Private Use Area (U+E000-U+E0FF). This
allows post-processing to recover the original byte value and apply
custom decodings (e.g., shifted encodings).

Also raises the replacement threshold from 20% to 50% to allow more
text through for post-processing.
The previous PUA preservation fix only addressed encoding_chain.go
(layers 3-5). Pages with partial ToUnicode coverage (e.g., 70% valid,
30% unmapped) would pass validation and never reach the PUA-preserving
layers.

This fix adds PUA preservation directly to cmap.Decode() in page.go,
converting the 3 locations that used noRune (U+FFFD) to use PUA instead:
- Line 311: bfrange with unknown destination type
- Line 315: codespace match but no bfchar/bfrange mapping
- Line 323: no codespace range matches
Rewrite GetPlainText to use walkTextContent with position-based word
boundary detection, similar to MuPDF's approach:

- Use character width × 0.2 as gap threshold for detecting word breaks
- Use font size × 0.5 as threshold for detecting line breaks
- Insert spaces when gap between characters exceeds threshold

This fixes PDFs where text runs together without explicit space
characters in the content stream. The previous implementation just
concatenated all text without considering character positions.
For CID fonts with 2-byte character codes where the high byte is 0x00,
only convert the low byte to PUA. This avoids interleaved null PUA chars
(\ue000) that break shift detection in post-processing.

Matches PyMuPDF's handling of CID fonts.
Return an empty reader for zero-length streams before applying filters.
This prevents "unexpected EOF" errors from zlib when processing empty
FlateDecode streams in PDFs.
- Change module path from ledongthuc/pdf to ajroetker/pdf
- Add render/ submodule for PDF page rasterization (from antflydb/antfly)
- render/ depends on ajroetker/go-jpeg2000 for embedded JPEG2000 images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant