fix: render filled PDF form field values (init_forms)#515
Conversation
convert_pdf_to_image now calls init_forms() so AcroForm/XFA field values (text typed into fillable fields) are painted into the rendered page image. pdfium only draws widget annotation appearances after the form-fill environment is initialized; without it, filled field values were silently dropped from the rendered image (and thus from downstream OCR/hi_res). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pypdf is not a declared dependency of unstructured-inference; save the synthetic AcroForm PDF as a sample-docs asset and load it like the other rendering test fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">
<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>
Shadow auto-approve: would not auto-approve because issues were found.
Fix all with cubic | Re-trigger cubic
| # filled text form field. The field value ("FORMVALUE777") is drawn only by the widget | ||
| # annotation's appearance stream, so it renders only when the form-fill environment is | ||
| # initialized (init_forms). Geometry below mirrors the fixture's widget rectangle. | ||
| FORM_PDF = "sample-docs/form-field.pdf" |
There was a problem hiding this comment.
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising init_forms().
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test_unstructured_inference/inference/test_pdf_image_forms.py, line 12:
<comment>The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</comment>
<file context>
@@ -2,84 +2,19 @@
+# filled text form field. The field value ("FORMVALUE777") is drawn only by the widget
+# annotation's appearance stream, so it renders only when the form-fill environment is
+# initialized (init_forms). Geometry below mirrors the fixture's widget rectangle.
+FORM_PDF = "sample-docs/form-field.pdf"
PAGE_WIDTH, PAGE_HEIGHT = 612, 792
-# Widget rectangle in PDF user space (origin bottom-left): x1, y1, x2, y2.
</file context>
There was a problem hiding this comment.
the fixture is a real pdf file committed in 3e82abe
There was a problem hiding this comment.
Got it, thanks for the correction.
There was a problem hiding this comment.
1 issue found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="test_unstructured_inference/inference/test_pdf_image_forms.py">
<violation number="1" location="test_unstructured_inference/inference/test_pdf_image_forms.py:12">
P1: The new tests point at a sparse-checkout placeholder instead of a real PDF fixture, so rendering will fail before exercising `init_forms()`.</violation>
</file>
Shadow auto-approve: would not auto-approve because issues were found.
Fix all with cubic | Re-trigger cubic
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
0 issues found across 1 file (changes from recent commits).
Shadow auto-approve: would require human review. This change modifies core rendering logic in the PDF processing pipeline, and while it fixes a bug, the init_forms() call alters behavior for all fillable PDFs, which could introduce regressions in edge cases like corrupted form fields or XFA forms; additionally, the code suppresses potential...
Re-trigger cubic
Summary
Values typed into fillable PDF form fields live in widget annotation appearance streams, and pdfium only paints those appearances after the form-fill environment is initialized.
convert_pdf_to_imageopened the document but never calledinit_forms(), so filled form-field values were silently dropped from the rendered page image — and therefore from downstream OCR / hi_res partitioning.Change
unstructured_inference/inference/pdf_image.py— callpdf.init_forms()immediately after opening thePdfDocument, inside the existing_pdfium_lock. The form environment is torn down automatically onpdf.close().Test
New
test_pdf_image_forms.pybuilds a synthetic 1-page PDF in-test (pypdf): an empty content stream plus one/Txwidget whose value is drawn only by its/AP /Nappearance stream.test_convert_pdf_to_image_renders_acroform_field_value— renders viaconvert_pdf_to_imageand asserts the field value's pixels appear.test_convert_pdf_to_image_drops_form_field_without_init_forms(control) — patchesinit_formsto a no-op, reproducing pre-fix behavior, and asserts the field region is blank.The empty content stream + control test prove the rendered value comes specifically from
init_forms(), not from page content (verified: ~0 dark px without init, ~3000 with).Notes
unstructuredPR that recovers the same values in the pdfminer extracted-text layer.🤖 Generated with Claude Code