Deterministic glyph-preserving PDF text replacement experiments.
pdf_glyph_replace.py rewrites encoded PDF text without changing fonts,
coordinates, drawing operators, or layout spacing. It works by:
- converting the input PDF to QDF with
qpdf; - reading Type0 font
/ToUnicodeCMaps; - decoding hexadecimal
Tjglyph operands inside text objects; - replacing only the glyph CIDs for matching decoded text;
- rebuilding a valid PDF with
fix-qdfandqpdf.
The script can be run directly:
./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdfOr installed as a local console command:
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -e .
pdf-glyph-replace --version
pdf-fixture-qdf --version
pdf-inventory --version
pdf-dogfood --version
pdf-dogfood-summary --version
pdf-glyph-replace input.pdf 3807 8304 -o output.pdfThe CLI requires qpdf and fix-qdf on PATH. For validation workflows,
pdftotext and pdftotext -bbox from Poppler are also expected.
For length-changing replacements where the right edge should stay fixed, use
--align right:
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right -o output.pdfFor supported length-changing replacements where the left edge and original
text matrix should stay fixed, use --align left:
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align left -o output.pdfTo inspect the edited QDF:
./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdf --keep-qdf work/output.qdf.pdfTo check decoded matches and feasibility without writing a PDF:
./pdf_glyph_replace.py input.pdf 3807 8304 --dry-run
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right --dry-run --json
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align left --dry-run --jsonTo audit every decoded text object and mixed-font split match before deciding whether a mutation is structurally patchable:
./pdf_glyph_replace.py input.pdf 3807 8304 --audit
./pdf_glyph_replace.py input.pdf 3807 8304 --audit --json --report work/audit.jsonAudit JSON omits full decoded document text and literal search/replacement
strings. It records text object indexes, stream objects, font resources,
decoded lengths, short decoded-text hashes, match counts, patchability, and
split matches across text objects or font resources. The blocker_summary
section aggregates unpatchable text-object reasons, split kinds, blocker
reasons, and blocker fonts without including decoded document text.
To write a reviewable mutation plan without editing the PDF:
./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json
./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json --json
./pdf_glyph_replace.py input.pdf 3807 8304 --plan work/plan.json --expect-count 1Plan JSON is non-sensitive by default. It includes input fingerprint metadata,
font resources, expected candidate counts, patchable match entries, glyph CID
spans, replacement CIDs, and split candidates. Split candidates include ordered
segment metadata and font-specific blockers, but remain unpatchable until a
separate segmented-plan schema is implemented. The top-level blocker_summary
aggregates split kinds and blocker reasons so unsupported plans can be triaged
without scanning every candidate.
To apply a reviewed same-glyph-count plan later:
./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf
./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf --report work/apply-report.json
./pdf_glyph_replace.py input.pdf --apply-plan work/plan.json -o output.pdf --expect-count 1Plan application verifies the input PDF fingerprint from the plan and checks each planned glyph span against a freshly regenerated QDF before writing the output PDF. Stale plans, split candidates, missing replacement glyphs, and length-changing plans fail closed.
Use --expect-count N to require an exact patchable/applied match count. The
guard applies to direct writes, dry-runs, audits, plan generation, and
reviewed-plan application. In write modes, a mismatch fails before the output
PDF is written.
To write a non-sensitive JSON report:
./pdf_glyph_replace.py input.pdf 3807 8304 -o output.pdf --report work/report.json
./pdf_glyph_replace.py input.pdf 37.34 138.46 --align right -o output.pdf --report work/report.json --bbox-dir work/bboxThe report records match counts, font resources, stream object ids, text object ids, alignment policy, and validation hints. It does not include full decoded text or literal search/replacement strings by default.
Use --bbox-dir PATH with write/apply modes and --report to generate
optional before/after pdftotext -bbox HTML artifacts. The JSON report records
artifact paths, sizes, short hashes, and warnings, but not extracted bbox text.
If pdftotext is missing or bbox extraction fails, mutation still succeeds and
the report records a layout-evidence warning. For direct writes, exact mode
records before/after extraction counts, while --align left and --align right
record numeric bbox edge deltas and pass/fail assertions. Failed edge
assertions name the checked coordinate (x_min for left alignment, x_max for
right alignment) and the measured delta without embedding decoded document text.
Use pdf-fixture-qdf to create public, non-sensitive QDF or standalone PDF
fixtures for issues, tests, and repros:
pdf-fixture-qdf 3807 -o work/fixture.qdf
pdf-fixture-qdf '$37.34' --one-glyph-per-line --x 653.375 --y 1370 -o work/amount.qdf
pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdfThe fixture helper emits a minimal QDF-like byte stream by default, or a valid
standalone PDF with --pdf. Both forms use a synthetic Type0 font,
/ToUnicode CMap, and hexadecimal text operands. They are designed for testing
pdf_glyph_replace parsing, replacement logic, and public smoke workflows
without sharing private PDFs.
Use the standalone PDF mode to smoke length-changing layout behavior with public data:
pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdf
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align left -o work/public-length-left.pdf --report work/public-length-left.json --bbox-dir work/public-length-left-bbox
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align right -o work/public-length-right.pdf --report work/public-length-right.json --bbox-dir work/public-length-right-bbox
qpdf --check work/public-length-left.pdf
qpdf --check work/public-length-right.pdf
pdftotext work/public-length-left.pdf - | rg '13846|3734'
pdftotext work/public-length-right.pdf - | rg '13846|3734'The public length-changing smoke should report layout_evidence.status: "ok"
and alignment_assertions.status: "ok" for both --align left and
--align right. The test suite also exercises the same public fixture shape
with the replacement target at the beginning, middle, and end of a
one-glyph-per-line text object.
The same helper is available from Python:
import pdf_fixture
qdf = pdf_fixture.synthetic_qdf("3807")
amount_qdf = pdf_fixture.synthetic_qdf(
"$37.34",
one_glyph_per_line=True,
x="653.375",
y="1370",
)
public_pdf = pdf_fixture.synthetic_pdf(
"3734",
one_glyph_per_line=True,
x="653.375",
y="1370",
)The synthetic font intentionally contains only a small glyph set used by the tests and examples. If a repro needs more characters, extend the synthetic map in code rather than attaching a real private document.
Use pdf-inventory to classify PDFs without mutating files or extracting
document text:
pdf-inventory work/dogfood-pdfs/sample-*.pdf \
--json work/dogfood-pdfs/inventory/inventory.json \
--tsv work/dogfood-pdfs/inventory/inventory.tsvThe command reports structural support signals: qpdf validity, QDF conversion,
object and stream counts, Type0 font count, /ToUnicode references, decoded
font resource count, and text-object count. Unsupported-but-valid PDFs are
reported with status: "unsupported" and exit code 0; only hard errors such as
missing files or failed qpdf --check make the command fail.
Add --probe SEARCH REPLACEMENT to include non-mutating match feasibility
signals in the inventory:
pdf-inventory work/dogfood-pdfs/sample-*.pdf \
--probe 3807 8304 \
--summary \
--json work/dogfood-pdfs/inventory/probed.jsonProbe output includes search/replacement lengths, short hashes, match counts, feasibility status, and infeasible reasons. It does not include literal probe strings or decoded document text.
Use --summary to include aggregate counts by inventory status and probe
status. The JSON output becomes an object with rows and summary; TSV remains
row-oriented.
Use --max-input-bytes to skip QDF conversion for very large PDFs during broad
corpus scans:
pdf-inventory work/dogfood-pdfs/sample-*.pdf \
--max-input-bytes 50000000 \
--summarySkipped PDFs are reported with status: "skipped" and still contribute to the
summary counts. This avoids expanding large inputs into much larger QDF files
unless explicitly needed.
Use --fail-on to turn inventory reports into deterministic corpus gates:
pdf-inventory work/dogfood-pdfs/sample-*.pdf \
--probe 3807 8304 \
--max-input-bytes 50000000 \
--fail-on error qdf-conversion-failed probe-feasibleThe command exits 2 when any selected rule matches and prints the matching rows
to stderr. Available rules are error, unsupported, skipped,
qpdf-check-failed, qdf-conversion-failed, probe-unsupported,
probe-no-match, probe-infeasible, probe-feasible, and probe-match.
For repeatable local corpus checks, see
docs/dev/dogfood-runbook.md.
The same routine dogfood gate is available as a wrapper command:
pdf-dogfood --probe 3807 8304By default, pdf-dogfood scans work/dogfood-pdfs/sample-*.pdf, writes
work/dogfood-pdfs/inventory/dogfood.json and .tsv, applies
--max-input-bytes 50000000, and fails on error, qpdf-check-failed,
qdf-conversion-failed, or probe-feasible.
Use --policy complete to fail on skipped files as well, or
--policy readiness --probe SEARCH REPLACEMENT to require a clean supported
probe match.
Dogfood JSON reports include a policy block with the wrapper version,
selected policy, effective fail-on rules, size guard, input glob, report paths,
and hashed probe metadata.
Use --manifest to append a compact JSONL run history record; with no path
argument it writes to work/dogfood-pdfs/inventory/dogfood-manifest.jsonl.
Use pdf-dogfood-summary to print recent manifest records as a TSV table:
pdf-dogfood-summary --limit 10
pdf-dogfood-summary --latest-by-policy
pdf-dogfood-summary --latest-by-policy --markdown
pdf-dogfood-summary --latest-by-policy --markdown --output work/dogfood-pdfs/inventory/latest.md
pdf-dogfood-summary --health
pdf-dogfood-summary --fail-only --policy readiness
pdf-dogfood-summary --exit-code 2 --jsonFor an Actions-side check, run the CI workflow manually from GitHub Actions
and use its dogfood_manifest input. The manual Dogfood manifest health job
defaults to tests/fixtures/dogfood-manifest.jsonl, reports the --health
exit status as a notice, and does not require private PDFs in the repository.
Run the source-level tests:
python3 -m py_compile pdf_glyph_replace.py pdf_fixture.py pdf_inventory.py
python3 -m unittest discover -s tests -v
pdf-glyph-replace --version
pdf-fixture-qdf --version
pdf-inventory --version
pdf-dogfood --version
pdf-dogfood-summary --versionRun public PDF smoke tests:
pdf-fixture-qdf 3734 --pdf --one-glyph-per-line --x 653.375 --y 1370 -o work/public-length.pdf
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align left -o work/public-length-left.pdf --report work/public-length-left.json --bbox-dir work/public-length-left-bbox
./pdf_glyph_replace.py work/public-length.pdf 3734 13846 --align right -o work/public-length-right.pdf --report work/public-length-right.json --bbox-dir work/public-length-right-bbox
qpdf --check work/public-length-left.pdf
qpdf --check work/public-length-right.pdf
pdftotext work/public-length-left.pdf - | rg '13846|3734'
pdftotext work/public-length-right.pdf - | rg '13846|3734'Run private PDF smoke tests only when local fixture PDFs are available:
./pdf_glyph_replace.py tmp.before-travel.pdf 3807 8304 --dry-run
./pdf_glyph_replace.py tmp.before-travel.pdf 3807 8304 -o work/smoke.8304.pdf --report work/smoke.8304.report.json
qpdf --check work/smoke.8304.pdf
pdftotext work/smoke.8304.pdf - | rg '8304|3807'
./pdf_glyph_replace.py tmp.before-travel.pdf 37.34 138.46 --align right --dry-run
./pdf_glyph_replace.py tmp.before-travel.pdf 37.34 138.46 --align right --plan work/smoke.amount.plan.json --jsonDo not create amount-mutated PDFs from private financial fixtures as smoke or release artifacts. Use plan-only output for private amount-like examples unless a non-sensitive synthetic fixture is available.
The CLI uses semantic versioning. The package version in pyproject.toml and
pdf_glyph_replace.__version__ must stay in sync.
See CHANGELOG.md for release notes and docs/dev/RELEASE_CHECKLIST.md for
the full release checklist.
Release gate:
python3 -m py_compile pdf_glyph_replace.py pdf_fixture.py pdf_inventory.py pdf_dogfood.py pdf_dogfood_summary.py
python3 -m unittest discover -s tests -v
python3 -m venv work/release-venv
work/release-venv/bin/python -m pip install -e .
work/release-venv/bin/python -c "from pdf_mutation.engine import plan_qdf; from pdf_mutation import cli; print(plan_qdf.__name__, cli.main.__name__)"
work/release-venv/bin/pdf-glyph-replace --version
work/release-venv/bin/pdf-fixture-qdf --version
work/release-venv/bin/pdf-inventory --version
work/release-venv/bin/pdf-dogfood --version
work/release-venv/bin/pdf-dogfood-summary --version
work/release-venv/bin/python -m pip wheel . -w work/distFor behavior-changing releases, also run the local PDF smoke tests above when fixture PDFs are available. Release notes should describe newly supported PDF text structures and known limits.
The compatibility CLI module remains importable as pdf_glyph_replace. New
Python integrations should import through the package boundary:
from pdf_mutation.engine import apply_plan_to_qdf, plan_qdf, replace_qdf
from pdf_mutation.reports import bbox_alignment_assertions, report_payloadThe historic pdf_glyph_replace module remains as a compatibility wrapper.
The reusable implementation lives under the pdf_mutation package boundary.
The console command is implemented by pdf_mutation.cli:main.
Lower-level package modules such as pdf_mutation.cmap,
pdf_mutation.layout, and pdf_mutation.adapters are internal implementation
seams; prefer pdf_mutation.engine and pdf_mutation.reports for integration
code.
This first version is intentionally strict by default:
- default
--align exactrequires search and replacement to have the same decoded glyph count; --align rightsupports length-changing replacements only for simple one-glyph-per-line text objects and shifts the text matrix to preserve the right edge;--align leftsupports the same simple one-glyph-per-line text objects and preserves the original text matrix;- dry-run reports the active alignment contract and estimated text-matrix x-shift for length-changing modes;
--reportwrites a non-sensitive JSON report with match locations, font resources, and validation hints;--bbox-dirwrites optional before/after bbox HTML artifacts and records non-sensitive artifact metadata and alignment assertions in the report;--auditinventories every decoded text object and reports split mixed-font matches without including full decoded document text;--planwrites a non-sensitive JSON mutation plan for same-glyph-count patchable matches and split/unpatchable candidates;--apply-planapplies only reviewed same-glyph-count patchable plan entries after verifying the input fingerprint and planned QDF byte spans;--expect-count Nfails unless the operation finds exactlyNpatchable or applied matches, and write modes fail before producing an output PDF;- split candidates record per-segment font resources and blockers but are audit-only under the current plan schema;
- replacement characters must already exist in the active PDF font CMap;
- matches must fit inside one
BT ... ETtext object; - supported exact-mode text drawing forms are hexadecimal
<...> Tjand simple[...] TJarrays with hexadecimal string entries; - exact mode supports multiple CIDs inside one hexadecimal string operand;
- matches split across text objects or font changes are reported as infeasible by dry-run instead of being patched.
That covers deterministic token changes such as account suffixes and IDs, plus
simple right-aligned amount edits in PDFs that emit each glyph as a separate
Tj line with Td advances.