Skip to content

Avoid panic bad TL on malformed Tf/TL; skip operator instead#8

Open
BrennenWright wants to merge 2 commits intodslipak:masterfrom
BrennenWright:master
Open

Avoid panic bad TL on malformed Tf/TL; skip operator instead#8
BrennenWright wants to merge 2 commits intodslipak:masterfrom
BrennenWright:master

Conversation

@BrennenWright
Copy link

This PR prevents panics when a content stream uses Tf (set text font and size)
or TL (set text leading) with an unexpected number of arguments.

Today, page.go does:

case "Tf":
    if len(args) != 2 {
        panic("bad TL")
    }
    ...

case "TL":
    if len(args) != 1 {
        panic("bad TL")
    }

Some real-world PDFs produced by use non-standard or malformed
Tf/TL, which causes a panic ("bad TL") and terminates extraction, even
though the rest of the page is readable.

This change makes those cases non-fatal:

- If Tf/TL have the wrong arg count, we simply return from the handler
  and continue interpreting the rest of the stream.

This matches the library’s general behavior of ignoring unknown or
unsupported operators instead of panicking, and allows callers to
still extract partial text from PDFs with slightly malformed content
streams.

I’ve tested this against a PDF that previously panicked ("bad TL"); with
this change the panic is gone and text extraction now succeeds.

The panic fault on bad Tf and TL should be a skip for general parsing as a panic makes the library unusable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant