fix: render filled PDF form field values (init_forms) by badGarnet · Pull Request #515 · Unstructured-IO/unstructured-inference

badGarnet · 2026-06-11T21:54:01Z

Summary

Values typed into fillable PDF form fields live in widget annotation appearance streams, and pdfium only paints those appearances after the form-fill environment is initialized. convert_pdf_to_image opened the document but never called init_forms(), so filled form-field values were silently dropped from the rendered page image — and therefore from downstream OCR / hi_res partitioning.

Change

unstructured_inference/inference/pdf_image.py — call pdf.init_forms() immediately after opening the PdfDocument, inside the existing _pdfium_lock. The form environment is torn down automatically on pdf.close().

with _pdfium_lock:
    pdf = pdfium.PdfDocument(filename or file, password=password)
    pdf.init_forms()
    n_pages = len(pdf)

Test

New test_pdf_image_forms.py builds a synthetic 1-page PDF in-test (pypdf): an empty content stream plus one /Tx widget whose value is drawn only by its /AP /N appearance stream.

test_convert_pdf_to_image_renders_acroform_field_value — renders via convert_pdf_to_image and asserts the field value's pixels appear.
test_convert_pdf_to_image_drops_form_field_without_init_forms (control) — patches init_forms to a no-op, reproducing pre-fix behavior, and asserts the field region is blank.

The empty content stream + control test prove the rendered value comes specifically from init_forms(), not from page content (verified: ~0 dark px without init, ~3000 with).

Notes

Behavior change: every fillable PDF now renders previously-missing field text into the page image.
Pairs with the matching unstructured PR that recovers the same values in the pdfminer extracted-text layer.

🤖 Generated with Claude Code

convert_pdf_to_image now calls init_forms() so AcroForm/XFA field values (text typed into fillable fields) are painted into the rendered page image. pdfium only draws widget annotation appearances after the form-fill environment is initialized; without it, filled field values were silently dropped from the rendered image (and thus from downstream OCR/hi_res). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

pypdf is not a declared dependency of unstructured-inference; save the synthetic AcroForm PDF as a sample-docs asset and load it like the other rendering test fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">

<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>

_{Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-11T22:08:34Z

+# filled text form field. The field value ("FORMVALUE777") is drawn only by the widget
+# annotation's appearance stream, so it renders only when the form-fill environment is
+# initialized (init_forms). Geometry below mirrors the fixture's widget rectangle.
+FORM_PDF = "sample-docs/form-field.pdf"


P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising init_forms().

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At test_unstructured_inference/inference/test_pdf_image_forms.py, line 12: <comment>The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</comment> <file context> @@ -2,84 +2,19 @@ +# filled text form field. The field value ("FORMVALUE777") is drawn only by the widget +# annotation's appearance stream, so it renders only when the form-fill environment is +# initialized (init_forms). Geometry below mirrors the fixture's widget rectangle. +FORM_PDF = "sample-docs/form-field.pdf" PAGE_WIDTH, PAGE_HEIGHT = 612, 792 -# Widget rectangle in PDF user space (origin bottom-left): x1, y1, x2, y2. </file context>

the fixture is a real pdf file committed in 3e82abe

Got it, thanks for the correction.

cubic-dev-ai

1 issue found across 5 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">

<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>

_{Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic}

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

cubic-dev-ai

0 issues found across 1 file (changes from recent commits).

_{Shadow auto-approve: would require human review. This change modifies core rendering logic in the PDF processing pipeline, and while it fixes a bug, the init_forms() call alters behavior for all fillable PDFs, which could introduce regressions in edge cases like corrupted form fields or XFA forms; additionally, the code suppresses potential...

Re-trigger cubic}

badGarnet and others added 2 commits June 11, 2026 16:53

cubic-dev-ai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread unstructured_inference/inference/pdf_image.py Outdated

badGarnet and others added 2 commits June 11, 2026 17:15

Update unstructured_inference/inference/pdf_image.py

0352b79

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

lint

6ecb102

cubic-dev-ai Bot reviewed Jun 11, 2026

View reviewed changes

aadland6 approved these changes Jun 11, 2026

View reviewed changes

badGarnet merged commit fc64017 into main Jun 11, 2026
16 checks passed

badGarnet deleted the fix/render-acroform-fields branch June 11, 2026 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: render filled PDF form field values (init_forms)#515

fix: render filled PDF form field values (init_forms)#515
badGarnet merged 4 commits into
mainfrom
fix/render-acroform-fields

badGarnet commented Jun 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026 •

edited

Loading

Uh oh!

badGarnet Jun 11, 2026

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

badGarnet commented Jun 11, 2026

Summary

Change

Test

Notes

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

badGarnet Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cubic-dev-ai Bot Jun 11, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading