Skip to content

fix: render filled PDF form field values (init_forms)#515

Merged
badGarnet merged 4 commits into
mainfrom
fix/render-acroform-fields
Jun 11, 2026
Merged

fix: render filled PDF form field values (init_forms)#515
badGarnet merged 4 commits into
mainfrom
fix/render-acroform-fields

Conversation

@badGarnet

Copy link
Copy Markdown
Collaborator

Summary

Values typed into fillable PDF form fields live in widget annotation appearance streams, and pdfium only paints those appearances after the form-fill environment is initialized. convert_pdf_to_image opened the document but never called init_forms(), so filled form-field values were silently dropped from the rendered page image — and therefore from downstream OCR / hi_res partitioning.

Change

unstructured_inference/inference/pdf_image.py — call pdf.init_forms() immediately after opening the PdfDocument, inside the existing _pdfium_lock. The form environment is torn down automatically on pdf.close().

with _pdfium_lock:
    pdf = pdfium.PdfDocument(filename or file, password=password)
    pdf.init_forms()
    n_pages = len(pdf)

Test

New test_pdf_image_forms.py builds a synthetic 1-page PDF in-test (pypdf): an empty content stream plus one /Tx widget whose value is drawn only by its /AP /N appearance stream.

  • test_convert_pdf_to_image_renders_acroform_field_value — renders via convert_pdf_to_image and asserts the field value's pixels appear.
  • test_convert_pdf_to_image_drops_form_field_without_init_forms (control) — patches init_forms to a no-op, reproducing pre-fix behavior, and asserts the field region is blank.

The empty content stream + control test prove the rendered value comes specifically from init_forms(), not from page content (verified: ~0 dark px without init, ~3000 with).

Notes

  • Behavior change: every fillable PDF now renders previously-missing field text into the page image.
  • Pairs with the matching unstructured PR that recovers the same values in the pdfminer extracted-text layer.

🤖 Generated with Claude Code

badGarnet and others added 2 commits June 11, 2026 16:53
convert_pdf_to_image now calls init_forms() so AcroForm/XFA field values
(text typed into fillable fields) are painted into the rendered page image.
pdfium only draws widget annotation appearances after the form-fill
environment is initialized; without it, filled field values were silently
dropped from the rendered image (and thus from downstream OCR/hi_res).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pypdf is not a declared dependency of unstructured-inference; save the
synthetic AcroForm PDF as a sample-docs asset and load it like the other
rendering test fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">

<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>

Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic

# filled text form field. The field value ("FORMVALUE777") is drawn only by the widget
# annotation's appearance stream, so it renders only when the form-fill environment is
# initialized (init_forms). Geometry below mirrors the fixture's widget rectangle.
FORM_PDF = "sample-docs/form-field.pdf"

@cubic-dev-ai cubic-dev-ai Bot Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising init_forms().

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test_unstructured_inference/inference/test_pdf_image_forms.py, line 12:

<comment>The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</comment>

<file context>
@@ -2,84 +2,19 @@
+# filled text form field. The field value ("FORMVALUE777") is drawn only by the widget
+# annotation's appearance stream, so it renders only when the form-fill environment is
+# initialized (init_forms). Geometry below mirrors the fixture's widget rectangle.
+FORM_PDF = "sample-docs/form-field.pdf"
 PAGE_WIDTH, PAGE_HEIGHT = 612, 792
-# Widget rectangle in PDF user space (origin bottom-left): x1, y1, x2, y2.
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fixture is a real pdf file committed in 3e82abe

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the correction.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">

<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>

Shadow auto-approve: would not auto-approve because issues were found.

Fix all with cubic | Re-trigger cubic

Comment thread unstructured_inference/inference/pdf_image.py Outdated
badGarnet and others added 2 commits June 11, 2026 17:15
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would require human review. This change modifies core rendering logic in the PDF processing pipeline, and while it fixes a bug, the init_forms() call alters behavior for all fillable PDFs, which could introduce regressions in edge cases like corrupted form fields or XFA forms; additionally, the code suppresses potential...

Re-trigger cubic

@badGarnet badGarnet merged commit fc64017 into main Jun 11, 2026
16 checks passed
@badGarnet badGarnet deleted the fix/render-acroform-fields branch June 11, 2026 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants