Skip to content

Parse and render first-page headers with VML/EMF images#1

Open
samcorcos wants to merge 2 commits into
mainfrom
feat/open-headers
Open

Parse and render first-page headers with VML/EMF images#1
samcorcos wants to merge 2 commits into
mainfrom
feat/open-headers

Conversation

@samcorcos

Copy link
Copy Markdown

Summary

Word header logos and OLE preview pictures are commonly EMF metafiles, which browsers can't decode — they rendered as a broken <img> box. This PR extracts the PNG/JPEG bitmap that's almost always embedded inside the metafile and uses it as the display URL, so the logo just shows up. Original EMF bytes stay on MediaFile.data so round-trip is unaffected.

Approach shipped: embedded-raster extraction (not full EMF rasterization). For the residual case — a vector-only metafile with no embedded bitmap — there's a new parseDocx(buf, { mediaResolver }) hook so a host can rasterize server-side and hand back a data:/blob: URL.

Also in this PR:

  • w:smartTag is a transparent wrapper; recurse into it instead of dropping its runs (the Treasury template lost "WASHINGTON" without this).
  • Unedited headers/footers re-emit their original XML on save (HeaderFooter.verbatimXml), so w:object/OLE/VML the model can't fully represent round-trips byte-identically. The HF inline editor and every model-mutation site clear it on first edit.

Visual verification

Treasury information-memo template, page 1:

Before After
before after

Header close-up (after):

header

Tests

  • packages/core/src/docx/__tests__/header-vml-emf.test.tsextractMetafileRaster on a real EMF; parseDocx populates headers, image src is data:image/png, smartTag runs survive, mediaResolver overrides; round-trip leaves header1.xml / footer1.xml / image1.emf byte-identical; default vs first+titlePg both populate.
  • e2e/tests/header-vml-emf.spec.ts — seal <img> decodes (naturalWidth > 0) inside .layout-page-header on page 1; body text starts below the header band; clicking body still places the caret; header renders under externalContent mode.
  • 1785/1785 unit tests pass; targeted header/footer/image e2e suite green (3 pre-existing failures on main confirmed unrelated).

Notes for upstream

Needs a bun changeset (minor — additive public API: MediaResolver, extractMetafileRaster, HeaderFooter.verbatimXml). Left for the upstream PR per the changeset workflow.

Word header logos and OLE preview pictures are commonly EMF metafiles
that browsers can't decode, so they rendered as broken `<img>` boxes.
The metafile almost always wraps a single PNG/JPEG bitmap record;
extract it at media-load time and use that as the display URL while
keeping the original bytes for round-trip. Adds an optional
`mediaResolver` parse hook for hosts that want to rasterize the
vector-only residual server-side.

Also: recurse into `w:smartTag` (its runs were silently dropped), and
capture each header/footer's original XML so unedited parts re-emit
byte-identically on save instead of being rebuilt from the model.
- clear HeaderFooter.verbatimXml in the React/Vue overlay save paths
  (useHeaderFooterEditing / usePagesPointer) so edits aren't lost on export
- mediaResolver: resolve files in parallel and swallow per-file errors so
  one bad conversion doesn't abort the whole parse
- reuse mediaToDataUrl from unzip instead of a local duplicate
- metafileRaster: drop the brittle GIF extractor; scan JPEG for the last
  EOI so an EXIF thumbnail doesn't truncate the outer image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant