WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes by joshday · Pull Request #54 · JuliaComputing/XML.jl

joshday · 2026-03-06T21:54:41Z

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

Major rewrite of XML.jl's internals that addresses many open issues
Self-contained src/XMLTokenizer.jl module for speedy tokenization
Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
XPath support — xpath(node, path) with a practical subset of XPath 1.0
Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Closes XML character references are not unescaped/escaped #17 — XML character references are now unescaped/escaped
Closes XPath support #30 — XPath support
Closes Inconsistent type for attributes where nodes have no attributes #33 — Inconsistent type for attributes where nodes have no attributes
Closes Simple XML.write followed by XML.parse fails #35 — Simple XML.write followed by XML.parse no longer fails
Closes get not defined to match getindex #50 — get defined to match getindex
Closes Question: Why the choice not to escape & to &amp; ? #52 — escape now unconditionally escapes '&'
Closes Incorrect unescape result. #53 — Incorrect unescape result (double-unescaping)

Benchmarks: See `benchmarks/compare.jl`

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

TimG1964 · 2026-03-08T12:25:20Z

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

TimG1964 · 2026-04-02T10:31:26Z

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tag, value, keys, and attributes on LazyNode now return SubString{String} views into the source rather than allocating fresh Strings, so traversing a large document lazily does not duplicate its text data. Introduces a small _as_substring helper to promote the String that `unescape` can return into a SubString so Attributes stays homogeneous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_write_xml now inspects children before reformatting: if any Text child has non-whitespace content (or any CData child exists), the element is treated as mixed content and its whitespace is preserved verbatim. Otherwise the writer drops the whitespace-only Text nodes the parser emits for round-tripping source formatting and generates fresh indentation. Same filter is applied at the Document level. Also adds an unescape(::SubString{String}) specialization that returns the input unchanged when it contains no '&', avoiding an allocation on the lazy scanning path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mathieu17g · 2026-05-20T22:23:34Z

Hi @joshday — the v0.4 rewrite looks good; the tokenizer architecture reads faster on eager mode.

I'm currently evaluating the v0.4 upgrade on FastKML.jl following your comment on my PR #58. The eager-path improvements are substantial on real-world KML: ×2–2.6 speedup and 37–69% memory reduction across four reference files (5k to 163k Placemarks), versus v0.3.8 + #58 + #59 eager.

There is a trade-off I wanted to surface before #54 is merged: v0.4 removed the linear-traversal API on LazyNode — next / prev (and the next! / prev! zero-alloc variants from PR #59) — replaced by eachchildnode / children, which allocate per child. That leaves no API equivalent for the zero-alloc lazy walk class that PR #59 provided under v0.3.8.

On real-world KML files with non-trivial structure, the v0.4 lazy path regresses by ×1.4 to ×2.6 vs v0.3.8 + #58 + #59 lazy across the full 4-file reference set. Two concrete cases:

USGS WRS-2 tiles — 28k Polygon Placemarks (each a 5-vertex LinearRing) in a single flat top-level layer. Regresses ×2.6.
EPA Facility Registry — 163k Point Placemarks across 19k nested folders. Regresses ×2.3.

On those same files, the v0.3.8+PRs lazy path was actually faster than v0.4 eager too, so a strict migration loses the previously optimal path. Full per-file profile and methodology in the linked results doc below.

I've written up the decomposition (synthetic bench + FastKML real workloads + cost attribution + a SOTA-informed design space) as a separate design issue #61 rather than clutter this PR thread.

Full data: benchmark/results_eager_vs_lazy_3way_2026-05-11.md on the FastKML wip-xml-v0.4 branch.

Happy to refine the benchmark or prototype any direction if useful.

mathieu17g · 2026-06-03T06:21:45Z

Hi, @joshday. I've posted some results on #61 — a Cursor streaming primitive plus an isbits Token that takes the iterate tuple allocation-free. The cursor is additive, but making Token isbits touches this PR's core Token, so flagging it here too: I'd value your read on whether that change fits v0.4.

joshday · 2026-06-09T11:55:44Z

Hey all, sincere apologies for my lack of communication here. I need to step back from finishing this PR and maintaining XML in the foreseeable future. I wish I had wrapped up this PR before getting into a busy season of life and I'm more than happy to share my thoughts and general vision for v0.4, but I should no longer be in the critical path of development and/or decision making.

I'm self-employed and (for better or worse) have been successful enough that I no longer have time to allocate for things that aren't (1) paying the bills or (2) spending time with my family. I'd love to get back to this someday! There's a sick twisted part of me that genuinely likes working on awful XML edge cases 😅.

@mathieu17g @TimG1964 My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it.

TimG1964 · 2026-06-09T15:12:55Z

Thanks for the update and very sorry to hear you are "moving on". I wish you success in your business.

Will you be facilitating the transfer to JuliaData? I really hope they are able to take XML.jl on and can find a worthy successor.

Before you go, would you have time to review and possibly merge any of the pending PRs into a final v0.3.9. I fear it may be a while before a v0.4 can be finalized following the transfer of ownership.

Thanks!

mathieu17g · 2026-06-09T19:52:53Z

+1 on @TimG1964's v0.3.9. The five open PRs that fit a 0.3.x patch are all CI-green on current main and mutually conflict-free. Suggested merge order, #64 last since it carries the version bump: #60 → #56 → #58 → #59 → #64, then register on the final merge commit.

Optionally, two small regression tests, verified against the PR branches and ready to fold in:

# #56 — prev must cross a CDATA section (the prev call itself crashes on v0.3.8)
doc = parse("<r><a>x</a><![CDATA[hello]]><b>y</b></r>", LazyNode)
b = children(children(doc)[1])[3]              # the <b> element
p = nothing
@test (p = XML.prev(b)) isa LazyNode           # asserts the call does not throw
@test XML.nodetype(p) == XML.CData
@test XML.value(p) == "hello"

# #60 — escape on a SubString (MethodError on v0.3.8)
@test XML.escape(SubString("a&b<c>", 1)) == "a&amp;b&lt;c&gt;"

joshday · 2026-06-10T13:36:54Z

I don't have merge permissions here

joshday added 14 commits March 5, 2026 09:34

Rewrite XML parser with tokenizer and XPath

6dacef3

remove dead code

97384c3

more test files

1844b16

Add validation tests and remove legacy DTD/raw code

b6f4d47

Update CI actions and add validation tests

21f647d

update ci

c673427

Add XMark benchmark generator and expand benchmarks

46c5a31

Add LazyNode type and StringViews extension

33bcf35

Refactor simple_value checks and use direct attrs iteration

d011424

Refactor tokenizer into XMLTokenizer and add LazyNode

754f8fa

Add benchmarks, StringViews tests, simplify XML module

8483fed

Add GC.gc before tmpfile cleanup for Windows

eb5caeb

Bump version to v0.4.0

b914bfe

Use mktempdir for temp file cleanup in StringViews tests

d76c484

nhz2 reviewed Mar 8, 2026

View reviewed changes

Comment thread ext/XMLStringViewsExt.jl Outdated

joshday added 3 commits March 8, 2026 15:05

Remove StringViews extension and simplify tokenizer

41836ae

Replace printstyled with print in show methods

b670267

Revamp benchmarks and expand test suite

4a728ee

joshday and others added 6 commits April 2, 2026 16:49

Add Attributes type and performance optimizations

2f71f9a

Add sourcetext, write, eachchildnode for LazyNode

6c4e8f3

joshday mentioned this pull request Apr 30, 2026

perf: avoid per-call ctx allocation in next_no_xml_space #58

Open

Namespace token kinds and document API

60725db

joshday mentioned this pull request May 15, 2026

escape should work with AbstractString. #60

Open

Add LazyNode perf APIs and XLSX-pattern benchmarks

9d129b8

joshday added 8 commits May 15, 2026 13:03

Refresh XLSX-pattern benchmark snapshot

fb583c4

Add AbstractTrees package extension

cfc1f81

Use byte-level Base.write in XML serializer

895e994

Skip unescape scan when tokenizer saw no entities

ff84960

Use findnext for tokenizer text/attr scans

18d88b1

Refresh benchmark snapshot and README bars

b790e85

Wire Token.has_entities into LazyNode read path

a93b9a0

Add end-to-end XLSX hot-loop benchmarks

e532a28

mathieu17g mentioned this pull request May 20, 2026

A StAX-style streaming primitive for v0.4 — recovering FastKML's lazy walk class without the LazyNode-as-cursor hack #61

Open

mathieu17g mentioned this pull request May 20, 2026

feat: add next! and prev! for in-place LazyNode traversal #59

Open

mathieu17g mentioned this pull request Jun 3, 2026

Event-level StAX cursor + isbits Token — v0.4 streaming layer (#61) joshday/XML.jl#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 33 commits into
JuliaComputing:mainfrom
joshday:main

joshday commented Mar 6, 2026

Uh oh!

TimG1964 commented Mar 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

TimG1964 commented Apr 2, 2026

Uh oh!

mathieu17g commented May 20, 2026

Uh oh!

mathieu17g commented Jun 3, 2026

Uh oh!

joshday commented Jun 9, 2026

Uh oh!

TimG1964 commented Jun 9, 2026

Uh oh!

mathieu17g commented Jun 9, 2026

Uh oh!

joshday commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joshday commented Mar 6, 2026

Summary of Changes

Downstream

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Uh oh!

TimG1964 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TimG1964 commented Apr 2, 2026

Uh oh!

mathieu17g commented May 20, 2026

Uh oh!

mathieu17g commented Jun 3, 2026

Uh oh!

joshday commented Jun 9, 2026

Uh oh!

TimG1964 commented Jun 9, 2026

Uh oh!

mathieu17g commented Jun 9, 2026

Uh oh!

joshday commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Benchmarks: See `benchmarks/compare.jl`

TimG1964 commented Mar 8, 2026 •

edited

Loading