Skip to content

CochranResearchGroup/pdf-mutation

Repository files navigation

pdf-mutation

Deterministic glyph-preserving PDF text replacement experiments.

Tool

pdf_glyph_replace.py rewrites encoded PDF text without changing fonts, coordinates, drawing operators, or layout spacing. It works by:

  1. converting the input PDF to QDF with qpdf;
  2. reading Type0 font /ToUnicode CMaps;
  3. decoding hexadecimal Tj glyph operands inside text objects;
  4. replacing only the glyph CIDs for matching decoded text;
  5. rebuilding a valid PDF with fix-qdf and qpdf.

Usage

The script can be run directly:

./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdf

Or installed as a local console command:

python3 -m venv .venv
. .venv/bin/activate
python -m pip install -e .
pdf-glyph-replace --version
pdf-fixture-qdf --version
pdf-inventory --version
pdf-dogfood --version
pdf-dogfood-summary --version
pdf-glyph-replace input.pdf 3807 8304 -o output.pdf

The CLI requires qpdf and fix-qdf on PATH. For validation workflows, pdftotext and pdftotext -bbox from Poppler are also expected.

For length-changing replacements where the right edge should stay fixed, use --align right:

./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right -o output.pdf

For supported length-changing replacements where the left edge and original text matrix should stay fixed, use --align left:

./pdf_glyph_replace.py input.pdf 37.34 138.46 --align left -o output.pdf

To inspect the edited QDF:

./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdf --keep-qdf work/output.qdf.pdf

To check decoded matches and feasibility without writing a PDF:

./pdf_glyph_replace.py input.pdf 3807 8304 --dry-run
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right --dry-run --json
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align left --dry-run --json

To audit every decoded text object and mixed-font split match before deciding whether a mutation is structurally patchable:

./pdf_glyph_replace.py input.pdf 3807 8304 --audit
./pdf_glyph_replace.py input.pdf 3807 8304 --audit --json --report work/audit.json

Audit JSON omits full decoded document text and literal search/replacement strings. It records text object indexes, stream objects, font resources, decoded lengths, short decoded-text hashes, match counts, patchability, and split matches across text objects or font resources. The blocker_summary section aggregates unpatchable text-object reasons, split kinds, blocker reasons, and blocker fonts without including decoded document text.

To write a reviewable mutation plan without editing the PDF:

./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json
./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json --json
./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json --expect-count 1

Plan JSON is non-sensitive by default. It includes input fingerprint metadata, font resources, expected candidate counts, patchable match entries, glyph CID spans, replacement CIDs, and split candidates. Split candidates include ordered segment metadata and font-specific blockers, but remain unpatchable until a separate segmented-plan schema is implemented. The top-level blocker_summary aggregates split kinds and blocker reasons so unsupported plans can be triaged without scanning every candidate.

To apply a reviewed same-glyph-count plan later:

./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf
./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf --report work/apply-report.json
./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf --expect-count 1

Plan application verifies the input PDF fingerprint from the plan and checks each planned glyph span against a freshly regenerated QDF before writing the output PDF. Stale plans, split candidates, missing replacement glyphs, and length-changing plans fail closed.

Use --expect-count N to require an exact patchable/applied match count. The guard applies to direct writes, dry-runs, audits, plan generation, and reviewed-plan application. In write modes, a mismatch fails before the output PDF is written.

To write a non-sensitive JSON report:

./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdf --report work/report.json
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right -o output.pdf --report work/report.json --bbox-dir work/bbox

The report records match counts, font resources, stream object ids, text object ids, alignment policy, and validation hints. It does not include full decoded text or literal search/replacement strings by default.

Use --bbox-dir PATH with write/apply modes and --report to generate optional before/after pdftotext -bbox HTML artifacts. The JSON report records artifact paths, sizes, short hashes, and warnings, but not extracted bbox text. If pdftotext is missing or bbox extraction fails, mutation still succeeds and the report records a layout-evidence warning. For direct writes, exact mode records before/after extraction counts, while --align left and --align right record numeric bbox edge deltas and pass/fail assertions. Failed edge assertions name the checked coordinate (x_min for left alignment, x_max for right alignment) and the measured delta without embedding decoded document text.

Synthetic Fixtures

Use pdf-fixture-qdf to create public, non-sensitive QDF or standalone PDF fixtures for issues, tests, and repros:

pdf-fixture-qdf 3807 -o work/fixture.qdf
pdf-fixture-qdf '$37.34' --one-glyph-per-line --x 653.375 --y 1370 -o work/amount.qdf
pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdf

The fixture helper emits a minimal QDF-like byte stream by default, or a valid standalone PDF with --pdf. Both forms use a synthetic Type0 font, /ToUnicode CMap, and hexadecimal text operands. They are designed for testing pdf_glyph_replace parsing, replacement logic, and public smoke workflows without sharing private PDFs.

Use the standalone PDF mode to smoke length-changing layout behavior with public data:

pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdf
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align left -o work/public-length-left.pdf --report work/public-length-left.json --bbox-dir work/public-length-left-bbox
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align right -o work/public-length-right.pdf --report work/public-length-right.json --bbox-dir work/public-length-right-bbox
qpdf --check work/public-length-left.pdf
qpdf --check work/public-length-right.pdf
pdftotext work/public-length-left.pdf - | rg '13846|3734'
pdftotext work/public-length-right.pdf - | rg '13846|3734'

The public length-changing smoke should report layout_evidence.status: "ok" and alignment_assertions.status: "ok" for both --align left and --align right. The test suite also exercises the same public fixture shape with the replacement target at the beginning, middle, and end of a one-glyph-per-line text object.

The same helper is available from Python:

import pdf_fixture

qdf = pdf_fixture.synthetic_qdf("3807")
amount_qdf = pdf_fixture.synthetic_qdf(
    "$37.34",
    one_glyph_per_line=True,
    x="653.375",
    y="1370",
)
public_pdf = pdf_fixture.synthetic_pdf(
    "3734",
    one_glyph_per_line=True,
    x="653.375",
    y="1370",
)

The synthetic font intentionally contains only a small glyph set used by the tests and examples. If a repro needs more characters, extend the synthetic map in code rather than attaching a real private document.

PDF Inventory

Use pdf-inventory to classify PDFs without mutating files or extracting document text:

pdf-inventory work/dogfood-pdfs/sample-*.pdf \
  --json work/dogfood-pdfs/inventory/inventory.json \
  --tsv work/dogfood-pdfs/inventory/inventory.tsv

The command reports structural support signals: qpdf validity, QDF conversion, object and stream counts, Type0 font count, /ToUnicode references, decoded font resource count, and text-object count. Unsupported-but-valid PDFs are reported with status: "unsupported" and exit code 0; only hard errors such as missing files or failed qpdf --check make the command fail.

Add --probe SEARCH REPLACEMENT to include non-mutating match feasibility signals in the inventory:

pdf-inventory work/dogfood-pdfs/sample-*.pdf \
  --probe 3807 8304 \
  --summary \
  --json work/dogfood-pdfs/inventory/probed.json

Probe output includes search/replacement lengths, short hashes, match counts, feasibility status, and infeasible reasons. It does not include literal probe strings or decoded document text.

Use --summary to include aggregate counts by inventory status and probe status. The JSON output becomes an object with rows and summary; TSV remains row-oriented.

Use --max-input-bytes to skip QDF conversion for very large PDFs during broad corpus scans:

pdf-inventory work/dogfood-pdfs/sample-*.pdf \
  --max-input-bytes 50000000 \
  --summary

Skipped PDFs are reported with status: "skipped" and still contribute to the summary counts. This avoids expanding large inputs into much larger QDF files unless explicitly needed.

Use --fail-on to turn inventory reports into deterministic corpus gates:

pdf-inventory work/dogfood-pdfs/sample-*.pdf \
  --probe 3807 8304 \
  --max-input-bytes 50000000 \
  --fail-on error qdf-conversion-failed probe-feasible

The command exits 2 when any selected rule matches and prints the matching rows to stderr. Available rules are error, unsupported, skipped, qpdf-check-failed, qdf-conversion-failed, probe-unsupported, probe-no-match, probe-infeasible, probe-feasible, and probe-match.

For repeatable local corpus checks, see docs/dev/dogfood-runbook.md.

The same routine dogfood gate is available as a wrapper command:

pdf-dogfood --probe 3807 8304

By default, pdf-dogfood scans work/dogfood-pdfs/sample-*.pdf, writes work/dogfood-pdfs/inventory/dogfood.json and .tsv, applies --max-input-bytes 50000000, and fails on error, qpdf-check-failed, qdf-conversion-failed, or probe-feasible. Use --policy complete to fail on skipped files as well, or --policy readiness --probe SEARCH REPLACEMENT to require a clean supported probe match. Dogfood JSON reports include a policy block with the wrapper version, selected policy, effective fail-on rules, size guard, input glob, report paths, and hashed probe metadata. Use --manifest to append a compact JSONL run history record; with no path argument it writes to work/dogfood-pdfs/inventory/dogfood-manifest.jsonl. Use pdf-dogfood-summary to print recent manifest records as a TSV table:

pdf-dogfood-summary --limit 10
pdf-dogfood-summary --latest-by-policy
pdf-dogfood-summary --latest-by-policy --markdown
pdf-dogfood-summary --latest-by-policy --markdown --output work/dogfood-pdfs/inventory/latest.md
pdf-dogfood-summary --health
pdf-dogfood-summary --fail-only --policy readiness
pdf-dogfood-summary --exit-code 2 --json

For an Actions-side check, run the CI workflow manually from GitHub Actions and use its dogfood_manifest input. The manual Dogfood manifest health job defaults to tests/fixtures/dogfood-manifest.jsonl, reports the --health exit status as a notice, and does not require private PDFs in the repository.

Validation

Run the source-level tests:

python3 -m py_compile pdf_glyph_replace.py pdf_fixture.py pdf_inventory.py
python3 -m unittest discover -s tests -v
pdf-glyph-replace --version
pdf-fixture-qdf --version
pdf-inventory --version
pdf-dogfood --version
pdf-dogfood-summary --version

Run public PDF smoke tests:

pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdf
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align left -o work/public-length-left.pdf --report work/public-length-left.json --bbox-dir work/public-length-left-bbox
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align right -o work/public-length-right.pdf --report work/public-length-right.json --bbox-dir work/public-length-right-bbox
qpdf --check work/public-length-left.pdf
qpdf --check work/public-length-right.pdf
pdftotext work/public-length-left.pdf - | rg '13846|3734'
pdftotext work/public-length-right.pdf - | rg '13846|3734'

Run private PDF smoke tests only when local fixture PDFs are available:

./pdf_glyph_replace.py tmp.before-travel.pdf 3807 8304 --dry-run
./pdf_glyph_replace.py tmp.before-travel.pdf 3807 8304 -o work/smoke.8304.pdf --report work/smoke.8304.report.json
qpdf --check work/smoke.8304.pdf
pdftotext work/smoke.8304.pdf - | rg '8304|3807'

./pdf_glyph_replace.py tmp.before-travel.pdf 37.34 138.46 --align right --dry-run
./pdf_glyph_replace.py tmp.before-travel.pdf 37.34 138.46 --align right --plan work/smoke.amount.plan.json --json

Do not create amount-mutated PDFs from private financial fixtures as smoke or release artifacts. Use plan-only output for private amount-like examples unless a non-sensitive synthetic fixture is available.

Versioning And Release

The CLI uses semantic versioning. The package version in pyproject.toml and pdf_glyph_replace.__version__ must stay in sync. See CHANGELOG.md for release notes and docs/dev/RELEASE_CHECKLIST.md for the full release checklist.

Release gate:

python3 -m py_compile pdf_glyph_replace.py pdf_fixture.py pdf_inventory.py pdf_dogfood.py pdf_dogfood_summary.py
python3 -m unittest discover -s tests -v
python3 -m venv work/release-venv
work/release-venv/bin/python -m pip install -e .
work/release-venv/bin/python -c "from pdf_mutation.engine import plan_qdf; from pdf_mutation import cli; print(plan_qdf.__name__, cli.main.__name__)"
work/release-venv/bin/pdf-glyph-replace --version
work/release-venv/bin/pdf-fixture-qdf --version
work/release-venv/bin/pdf-inventory --version
work/release-venv/bin/pdf-dogfood --version
work/release-venv/bin/pdf-dogfood-summary --version
work/release-venv/bin/python -m pip wheel . -w work/dist

For behavior-changing releases, also run the local PDF smoke tests above when fixture PDFs are available. Release notes should describe newly supported PDF text structures and known limits.

Python API

The compatibility CLI module remains importable as pdf_glyph_replace. New Python integrations should import through the package boundary:

from pdf_mutation.engine import apply_plan_to_qdf, plan_qdf, replace_qdf
from pdf_mutation.reports import bbox_alignment_assertions, report_payload

The historic pdf_glyph_replace module remains as a compatibility wrapper. The reusable implementation lives under the pdf_mutation package boundary. The console command is implemented by pdf_mutation.cli:main. Lower-level package modules such as pdf_mutation.cmap, pdf_mutation.layout, and pdf_mutation.adapters are internal implementation seams; prefer pdf_mutation.engine and pdf_mutation.reports for integration code.

Current Scope

This first version is intentionally strict by default:

  • default --align exact requires search and replacement to have the same decoded glyph count;
  • --align right supports length-changing replacements only for simple one-glyph-per-line text objects and shifts the text matrix to preserve the right edge;
  • --align left supports the same simple one-glyph-per-line text objects and preserves the original text matrix;
  • dry-run reports the active alignment contract and estimated text-matrix x-shift for length-changing modes;
  • --report writes a non-sensitive JSON report with match locations, font resources, and validation hints;
  • --bbox-dir writes optional before/after bbox HTML artifacts and records non-sensitive artifact metadata and alignment assertions in the report;
  • --audit inventories every decoded text object and reports split mixed-font matches without including full decoded document text;
  • --plan writes a non-sensitive JSON mutation plan for same-glyph-count patchable matches and split/unpatchable candidates;
  • --apply-plan applies only reviewed same-glyph-count patchable plan entries after verifying the input fingerprint and planned QDF byte spans;
  • --expect-count N fails unless the operation finds exactly N patchable or applied matches, and write modes fail before producing an output PDF;
  • split candidates record per-segment font resources and blockers but are audit-only under the current plan schema;
  • replacement characters must already exist in the active PDF font CMap;
  • matches must fit inside one BT ... ET text object;
  • supported exact-mode text drawing forms are hexadecimal <...> Tj and simple [...] TJ arrays with hexadecimal string entries;
  • exact mode supports multiple CIDs inside one hexadecimal string operand;
  • matches split across text objects or font changes are reported as infeasible by dry-run instead of being patched.

That covers deterministic token changes such as account suffixes and IDs, plus simple right-aligned amount edits in PDFs that emit each glyph as a separate Tj line with Td advances.

About

Deterministic glyph-preserving PDF text replacement CLI

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages