Skip to content

QrCommunication/gigapdf-lib

Repository files navigation

gigapdf-lib

A zero-dependency PDF engine, written from scratch in Rust and compiled to WebAssembly — read, edit, render, secure, and convert PDFs with no third-party crates and no native libraries.

The TypeScript SDK is published as @qrcommunication/gigapdf-lib (see sdk/); the self-contained .wasm ships inside it.

Copyright 2025 Rony Licha / QR Communication. Licensed under the PolyForm Noncommercial License 1.0.0 — see LICENSE. Required Notice: Copyright 2025 Rony Licha / QR Communication.

Why it exists

The previous editor used a Fabric.js overlay + cosmetic mask, which cannot reconstruct a complex background (gradient, image, pattern) under edited text. This engine edits the real PDF content stream: it physically removes/edits/adds the page operators, so the background is preserved by construction and the original glyphs never leak. It then grew into a self-contained PDF toolkit so the product depends on no external PDF/Office/font library (no MuPDF, no LibreOffice, no fontkit) for its core flows.

Zero dependencies

None. Everything is pure std and compiles straight to wasm32:

  • Lexer, object parser, xref-streams, object-streams.
  • FlateDecode/zlib inflate and deflate (RFC 1950/1951) from scratch.
  • Content-stream interpreter + editor; renumbering serializer.
  • Crypto from scratch: MD5, RC4, AES-128/256, SHA-256/384/512, big-integer modular arithmetic (Montgomery), RSA, ASN.1 DER, X.509, CMS/PKCS#7.
  • Rasterizer: scanline fill (AA), PNG encoder, TrueType glyf + CFF Type2 glyph outlines, image XObject blit.
  • ZIP reader/writer, OOXML/ODF builders, a from-scratch PDF page builder.

The WebAssembly sandbox has no network and no entropy — those come from the host through a tiny port (the host supplies crypto.getRandomValues bytes and performs Google-Fonts downloads). Everything else is in the engine.

Feature matrix

Area Capabilities
Read PDF 1.7, xref + object streams, FlateDecode, encrypted (RC4/AESV2/AESV3)
Write Renumbering serializer, save, save_compressed (Flate streams)
Edit content Text edit/remove (with underline / strikethrough decorations), elements (text/image/shape) list/remove/move/duplicate/add; draw text/rect/line/ellipse/polygon/SVG-path/image (opacity + PNG alpha); hit-test
Text extraction Font-aware, zero-tofu via WinAnsi + /ToUnicode CMap (CID/Type0); per-run colour/size/rotation/direction; document language detection
Headers / footers Bake a running header/footer onto an existing PDF ({{page}}/{{pages}} tokens) and read back what's baked; per-page margins read/write
Annotations Highlight, underline, strike-out, squiggly, free-text, square, line, ink, sticky note, stamp, link; rich read-back metadata; flatten
Forms (AcroForm) Text/checkbox/radio/combo/list/signature fields — read · fill · create (build widgets from scratch with appearance streams + NeedAppearances)
Pages Rotate, delete, move, extract, merge, resize, insert, copy; bookmarks/outline; metadata; embedded-file attachments
Security Encrypt/permissions, self-signed digital signature (RSA/X.509/CMS), PKCS#12 signing (import a user .p12/.pfx natively — PBES2 AES + PBES1 3DES/RC2, MAC-verified — no node-forge/@signpdf), true redaction (delete from stream) + redactPii (v0.52.4) — irreversible redaction that also erases image pixels (safe on scans/OCR) under an opaque mark
Render Rasterize a page to PNG (vector + TrueType/CFF glyphs + images); native image codecs — encode/decode PNG · JPEG · lossless WebP, decode GIF + AVIF (AV1 intra); alpha-correct resize
Text intelligence Font-aware extraction, structured text (reading-order lines + boxes), full-text search with highlight boxes
OCR Built-in recognizer — Otsu → connected components → line/word segmentation → CNN trained on EMNIST handwriting + synthetic font glyphs (Latin + accents); opt-in line-level CRNN+CTC models per script (Latin/Cyrillic/Greek, Arabic/Hebrew, Devanagari, Bengali, Tamil). No Tesseract, no model download at runtime
Convert → PDF → TXT, HTML, DOCX, PPTX, ODP, ODT, XLSX, ODS, RTF (real editable elements, not a page image)
Convert ← TXT, HTML, RTF, DOCX, ODT, ODP, PPTX, XLSX, ODS → PDF (ODF .odt/.ods/.odp are fully bidirectional)
Unified editable model Format-neutral document tree (sections → pages → blocks → runs): lower any format in (toModel/officeToModel/htmlToModel), edit with structured ops (applyModelOps), raise to any format (modelTo{Docx,Xlsx,Pptx,Odt,Ods,Odp,Pdf,Html,Rtf}) — edit every format the same way
HTML rendering Native HTML + CSS → PDF engine (parser, selector cascade, block / inline / table / flex (direction · justify-content · grow) / grid layout, pagination, page-break-* + <pagebreak>, running header/footer in the page margins) — no headless browser. Text set in embedded Google fonts (real glyphs + metrics, identical or nearest match)
JavaScript Built-in zero-dependency JS engine that runs a document's inline <script>s before layout — no Chromium/Playwright. Lexer → parser → tree-walking interpreter with classes + super, closures, destructuring, generators (function*/yield), async/await + Promise (microtask queue + setTimeout), and built-ins: Object/Array/String/Number/Math/JSON/console/Map/Set/RegExp + a backtracking regex engine. DOM bindings: getElementById, querySelector(All) (#id/.class/tag/>/+/~/[attr]), textContent, innerHTML, createElement/appendChild, classList, style, …
Archival PDF/A-2b metadata (XMP + sRGB OutputIntent + ID)
Fonts Draw and edit real text in every font source & any font file — built-in base-14 standard fonts (no embedding), any family / Google Font (1951-family catalog + URL builder + TrueType and OpenType-CFF embedding: glyf→Type0/CIDFontType2+FontFile2, .otf/OTTO→Type0/CIDFontType0+FontFile3, Identity-H + full widths + ToUnicode), and the document's own embedded faces (embeddedFonts + extractFont → re-embed). addText and font-aware replaceText resolve any face's char→glyph map (FontFile2/FontFile3); needed-font detection

All of it is exercised by cargo test (284 tests, incl. a 100-test pure-Rust JavaScript engine: lexer, parser, interpreter, built-ins, regex, DOM, and a suspendable bytecode VM with lazy generators, spec-ordered async, and full control-flow — try/catch/finally, switch, labels, destructuring, spread), a Node WASM smoke test (end-to-end, all green), and validated externally: generated Office files (DOCX/PPTX/XLSX and ODT/ODS/ODP) open and round-trip in LibreOffice; embedded fonts verify as emb=yes under poppler's pdffonts.

Honest scope

Conversions are content-and-layout faithful, not pixel-perfect re-typesetting. PDF→Office reconstructs real, editable objects (positioned text boxes, re-embedded images, table cells) the way an office suite's PDF import does — not a rendered page image. Office→PDF is text-faithful (all content, reading order, pagination) using the standard-14 fonts; pixel-perfect re-layout of an arbitrary, richly-styled document stays the job of a full layout engine. Full PDF/A conformance additionally requires every font embedded (the engine can do that).

The JavaScript engine targets the language used by templating/report scripts: classes/super, closures, destructuring/spread, RegExp, Map/Set, Symbol (real, with the iterator protocol), eval/Function, tagged templates, and import/export (parsed transparently). function*/async bodies compile to a suspendable bytecode VM, so generators are truly lazy (infinite while (true) { yield … } works, .next(v) is bidirectional, yield* delegates lazily) and await yields to the event loop with spec microtask ordering. The VM covers the full statement/expression language used by templates — try/catch/finally, for…of/for…in, switch, labelled break/ continue, destructuring, compound assignment, and ...spread — all able to span a yield/await. A handful of corner cases (a return/break through a finally, a logical &&=/||=/??= with an awaited right-hand side, sparse array holes) transparently fall back to the eager generator / synchronous-await model — same results, just not lazy. By design the sandbox has no network and no real timers (setTimeout resolves on the microtask queue). CSS flex supports flex-direction, justify-content and flex-grow; grid lays out grid-template-columns; float maps to inline-block.

OCR & text intelligence

Text already in a PDF is extracted font-aware (zero tofu) with reading-order lines and bounding boxes, and is searchable with highlight boxes. For scanned, image-only pages the engine has a built-in OCR following the classic Tesseract pipeline — Otsu binarization → connected-component blobs → line/word segmentation → per-glyph classification — but with a from-scratch, dependency-free classifier:

  • The classifier is a compact CNN trained offline on two public sources: EMNIST (NIST handwritten digits + letters, public domain) for handwriting, and synthetic glyphs rendered from thousands of fonts (system + Google Fonts, the Tesseract text2image approach) for printed text, punctuation and accented Latin.
  • Training is build-time only (tools/train_ocr_cnn.py); the engine ships the int8-quantized weights and runs a pure-std forward pass — no ML library, no model download at runtime.
  • Scripts/languages (mono-glyph engine): Latin — 0-9 A-Z a-z, common punctuation, and accented Latin (é è à ç ñ ü …) for French, Spanish, German, Portuguese, etc. Both printed and handwritten Latin are recognized.
  • Honest accuracy: strong on clean machine print, decent on tidy handwriting (EMNIST-grade); noisy scans and dense layouts are harder.

Line-level CRNN+CTC engine (opt-in, multi-script). A second recognizer removes the per-glyph segmentation that caps the classic pipeline (touching glyphs, cursive scripts, noisy scans). It reads a whole text line as a sequence — Otsu or Sauvola binarization → projection-profile line bands → CNN → bidirectional GRU → CTC — still a pure-std int8 forward pass (crates/core/src/raster/ocr_crnn.rs), no ML dependency. Models are per script group, trained offline (tools/train_ocr_crnn.py) and enabled via Cargo features (ocr-alpha, …); ocr() uses the CRNN when a model is embedded and falls back to the mono-glyph classifier otherwise.

  • Trained today: group alphaLatin-extended + Cyrillic + Greek printed (Polish, Czech, Turkish, Vietnamese, Russian, Ukrainian, Greek, …). On a synthetic multi-script clean-print benchmark it comfortably beats Tesseract 5.3.4 — CER 0.119 vs 0.258 (~2.2×), WER 0.41 vs 0.62 (larger 24/48/96 backbone; see docs/OCR_TRAINING_LOG.md) — with homoglyph disambiguation snapping Latin/Greek/Cyrillic lookalikes (A/Α/А). Caveat: synthetic clean print on the four trained languages; real degraded scans and untrained scripts still favour Tesseract's breadth.
  • Also trained (non-Latin): Tamil (taml) — beats Tesseract (0.077 vs 0.101); Arabic + Hebrew (arabic, RTL) — beats Tesseract on synthetic (0.071 vs 0.349), output verified non-mirrored; Devanagari (deva, larger 24/48/96 backbone) — now beats Tesseract (0.078 vs 0.089); Bengali (beng) — competitive (0.104 vs 0.073), larger-backbone retrain pending. Backbone is env-tunable (GIGA_OCR_C1/C2/HID); PIL raqm shaping handles Indic/Arabic forms.
  • Handwriting: a handwriting variant ocr_alpha_hw.gpocr (32/64/128 backbone, trained on ~108k real handwriting lines — IAM/RIMES/NorHand/NewsEye/Belfort/POPP/Esposalles/Cyrillic via the HF datasets-server — plus synthetic Handwriting fonts) beats Tesseract on real cursive: CER 0.309 vs 0.353 (WER 0.737 vs 0.775) on the IAM test set. The printed champion stays primary for clean scans; load the HW variant via gp_ocr_load_model for handwriting-heavy input — see docs/OCR_TRAINING_DATA.md.
  • Deliberately out of scope: cjk (Chinese/Japanese/Korean) — not trained by design. A usable model needs the full frequency charset, many CJK fonts, and a much larger backbone for 3 000+ classes (a 152-char proof would be a toy); the infra is in place if revisited.
  • Design: docs/OCR_ARCHITECTURE.md · data catalogue: docs/OCR_TRAINING_DATA.md · training log: docs/OCR_TRAINING_LOG.md.

Layout

crates/core   gigapdf-core  — the whole engine (parse, inflate, edit, render, crypto, convert)
crates/wasm   gigapdf-wasm  — extern "C" WebAssembly bindings (zero-dep ABI)
fixtures/     test PDFs
test/         wasm-smoke.mjs — end-to-end Node harness
tools/        catalog/ICC generators + snapshots
docs/         SDK.md · COOKBOOK.md · USAGE.md · API.md · HTML-CSS.md · INSTALL.md · OCR_ARCHITECTURE.md · OCR_TRAINING_DATA.md · OCR_TRAINING_LOG.md

Quickstart

Rust

use gigapdf_core::Document;

let mut doc = Document::open(&bytes)?;
let docx = doc.to_docx();            // PDF → editable Word
let pdf  = gigapdf_core::convert::reverse::txt_to_pdf("Hello\nWorld"); // text → PDF
doc.embed_truetype_font("Roboto", &ttf)?; // host-downloaded font
let signed = doc.sign(&signer, "Me", "Approval", "D:20260614120000Z")?;
let out = doc.save();

Browser / Node (WebAssembly)

const { instance } = await WebAssembly.instantiate(wasmBytes, {});
const ex = instance.exports;
const handle = ex.gp_open(ptr, len);     // returns an opaque handle
const docx = callBuffer(() => ex.gp_to_docx(handle, lenPtr)); // → Uint8Array
ex.gp_close(handle);

Documentation

Doc What's in it
docs/SDK.md Complete TypeScript SDK reference — every GigaPdfEngine/GigaPdfDoc method, grouped by domain, with parameters, returns and notes.
docs/COOKBOOK.md Task-oriented recipes — redaction, styled text, headers/footers, conversions, OCR, forms, annotations, signing, encryption, and the editable model, each as a short runnable snippet.
docs/USAGE.md Host integration: the raw extern "C" buffer ABI plus a worked example for every feature area.
docs/API.md The Rust ↔ WASM ABI mapping (every gp_* export and its Rust method).
docs/HTML-CSS.md The exhaustive list of supported HTML elements, CSS properties, units, colours, selectors and JS in the HTML→PDF renderer.
docs/INSTALL.md Install, build-from-source, and Next.js (output: "standalone") wiring.

Build

cargo test -p gigapdf-core   # native tests (real fixtures)
cargo wasm                   # build the WASM engine (alias, see .cargo/config.toml)
node test/wasm-smoke.mjs     # end-to-end WASM smoke test

cargo wasm is a repo alias for the full target build, so you never type the target triple by hand (cargo wasm-dev for a debug build).

The release .wasm is ~540 KB — zero dependencies, versus ~14 MB for MuPDF.

License & provenance

PolyForm Noncommercial 1.0.0. Built clean-room from the ISO 32000 specification; no AGPL code (e.g. MuPDF) was ever read or copied. See LICENSE.

Packages

 
 
 

Contributors