Skip to content

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54

Open
joshday wants to merge 33 commits into
JuliaComputing:mainfrom
joshday:main
Open

WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
joshday wants to merge 33 commits into
JuliaComputing:mainfrom
joshday:main

Conversation

@joshday

@joshday joshday commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Summary of Changes

I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!

  • Major rewrite of XML.jl's internals that addresses many open issues
  • Self-contained src/XMLTokenizer.jl module for speedy tokenization
  • Node{T} now parameterized by the string storage type, enabling quick reads via SubString or StringViews.jl
  • StringViews extension — XML.mmap("file.xml", LazyNode) for memory-mapped parsing of very large files
  • XPath support — xpath(node, path) with a practical subset of XPath 1.0
  • Greatly expanded test suite — 243 libxml2 test cases, pugixml and libexpat compatibility tests, W3C conformance tests

Downstream

@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to Raw no longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Addressed Issues

Benchmarks: See benchmarks/compare.jl

Here (SS) refers to using SubString{String} as storage type.

julia --project=. benchmarks/compare.jl
============================================================
  XML.jl Benchmark Comparison
  Current (dev) vs v0.3.8
============================================================

Running dev benchmarks... done
Setting up v0.3.8 worktree... done
Running v0.3.8 benchmarks... done

------------------------------------------------------------

  Parse (small)
          v0.3.8      0.114 ms
             dev     0.0335 ms  (70.6% faster)

  Parse (small, SS)
          v0.3.8           n/a
             dev     0.0285 ms

  Parse (medium)
          v0.3.8   634.7153 ms
             dev   161.0888 ms  (74.6% faster)

  Parse (medium, SS)
          v0.3.8           n/a
             dev   151.3025 ms

  Write (small)
          v0.3.8     0.0227 ms
             dev     0.0176 ms  (22.4% faster)

  Write (medium)
          v0.3.8   118.1504 ms
             dev     77.619 ms  (34.3% faster)

  Read file (medium)
          v0.3.8   645.5785 ms
             dev   170.8398 ms  (73.5% faster)

  Collect tags (small)
          v0.3.8     0.0005 ms
             dev     0.0006 ms  (10.3% slower)

  Collect tags (medium)
          v0.3.8    21.0988 ms
             dev    11.1532 ms  (47.1% faster)

============================================================

@TimG1964

TimG1964 commented Mar 8, 2026

Copy link
Copy Markdown
Contributor

Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks!

In terms of impact on XLSX.jl, I think it looks significant. It isn't just Raw. Since @nhz2 first suggested using Raw, I've known it was internal and therefore subject to change. On first inspection, I think the rework involved should be manageable.

More of a challenge will be the removal of prev and next, which are currently exported functions. I rely on these for fundamental elements of XLSX.jl like the sheetrow and tablerow iterators, and for reading and writing the XML files from/to the zip archive .xlsx file.

These obviously aren't insuperable, but will likely need a bit of time while I get to grips with xpath and tokenizer. Optimistic me thinks the new functionality will simplify the code of XLSX.jl, but I usually find things are considerably harder than I first imagine! I'll feedback more when I've had a bit more of a go at getting XLSX.jl working.

Thanks,

Tim

Comment thread ext/XMLStringViewsExt.jl Outdated
@TimG1964

TimG1964 commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.

Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade.

Thanks!

joshday and others added 6 commits April 2, 2026 16:49
Drops the underscore prefixes from internal names (module is unexported,
the clutter was only needed back when these names leaked into XML.jl).
Replaces the name-byte predicate with a 256-entry const lookup table.

Also fixes a 1-based indexing off-by-one in read_doctype_body: the
'<!--' detection guarded with `pos >= 2` while reading
`codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.

Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.

Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers
captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains
a 70–80% improvement, so this is a post-release follow-up, not a
release blocker. Suspected culprit is the eager Pair{S,S}[] alloc
per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mathieu17g

Copy link
Copy Markdown

Hi @joshday — the v0.4 rewrite looks good; the tokenizer architecture reads faster on eager mode.

I'm currently evaluating the v0.4 upgrade on FastKML.jl following your comment on my PR #58. The eager-path improvements are substantial on real-world KML: ×2–2.6 speedup and 37–69% memory reduction across four reference files (5k to 163k Placemarks), versus v0.3.8 + #58 + #59 eager.

There is a trade-off I wanted to surface before #54 is merged: v0.4 removed the linear-traversal API on LazyNodenext / prev (and the next! / prev! zero-alloc variants from PR #59) — replaced by eachchildnode / children, which allocate per child. That leaves no API equivalent for the zero-alloc lazy walk class that PR #59 provided under v0.3.8.

On real-world KML files with non-trivial structure, the v0.4 lazy path regresses by ×1.4 to ×2.6 vs v0.3.8 + #58 + #59 lazy across the full 4-file reference set. Two concrete cases:

  • USGS WRS-2 tiles — 28k Polygon Placemarks (each a 5-vertex LinearRing) in a single flat top-level layer. Regresses ×2.6.
  • EPA Facility Registry — 163k Point Placemarks across 19k nested folders. Regresses ×2.3.

On those same files, the v0.3.8+PRs lazy path was actually faster than v0.4 eager too, so a strict migration loses the previously optimal path. Full per-file profile and methodology in the linked results doc below.

I've written up the decomposition (synthetic bench + FastKML real workloads + cost attribution + a SOTA-informed design space) as a separate design issue #61 rather than clutter this PR thread.

Full data: benchmark/results_eager_vs_lazy_3way_2026-05-11.md on the FastKML wip-xml-v0.4 branch.

Happy to refine the benchmark or prototype any direction if useful.

@mathieu17g

Copy link
Copy Markdown

Hi, @joshday. I've posted some results on #61 — a Cursor streaming primitive plus an isbits Token that takes the iterate tuple allocation-free. The cursor is additive, but making Token isbits touches this PR's core Token, so flagging it here too: I'd value your read on whether that change fits v0.4.

@joshday

joshday commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Hey all, sincere apologies for my lack of communication here. I need to step back from finishing this PR and maintaining XML in the foreseeable future. I wish I had wrapped up this PR before getting into a busy season of life and I'm more than happy to share my thoughts and general vision for v0.4, but I should no longer be in the critical path of development and/or decision making.

I'm self-employed and (for better or worse) have been successful enough that I no longer have time to allocate for things that aren't (1) paying the bills or (2) spending time with my family. I'd love to get back to this someday! There's a sick twisted part of me that genuinely likes working on awful XML edge cases 😅.

@mathieu17g @TimG1964 My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it.

@TimG1964

TimG1964 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thanks for the update and very sorry to hear you are "moving on". I wish you success in your business.

Will you be facilitating the transfer to JuliaData? I really hope they are able to take XML.jl on and can find a worthy successor.

Before you go, would you have time to review and possibly merge any of the pending PRs into a final v0.3.9. I fear it may be a while before a v0.4 can be finalized following the transfer of ownership.

Thanks!

@mathieu17g

Copy link
Copy Markdown

+1 on @TimG1964's v0.3.9. The five open PRs that fit a 0.3.x patch are all CI-green on current main and mutually conflict-free. Suggested merge order, #64 last since it carries the version bump: #60#56#58#59#64, then register on the final merge commit.

Optionally, two small regression tests, verified against the PR branches and ready to fold in:

# #56 — prev must cross a CDATA section (the prev call itself crashes on v0.3.8)
doc = parse("<r><a>x</a><![CDATA[hello]]><b>y</b></r>", LazyNode)
b = children(children(doc)[1])[3]              # the <b> element
p = nothing
@test (p = XML.prev(b)) isa LazyNode           # asserts the call does not throw
@test XML.nodetype(p) == XML.CData
@test XML.value(p) == "hello"

# #60 — escape on a SubString (MethodError on v0.3.8)
@test XML.escape(SubString("a&b<c>", 1)) == "a&amp;b&lt;c&gt;"

@joshday

joshday commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I don't have merge permissions here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants