WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54
WIP XML.jl v0.4: Rewrite of internals, streaming tokenizer, XPath support, and bug fixes #54joshday wants to merge 33 commits into
Conversation
|
Hey @joshday . I've only had a very superficial look so far but it looks great. Thanks! In terms of impact on XLSX.jl, I think it looks significant. It isn't just More of a challenge will be the removal of These obviously aren't insuperable, but will likely need a bit of time while I get to grips with Thanks, Tim |
Hi @joshday, I've been a bit distracted recently by transferring XLSX.jl to JuliaData and subsequently making a v0.11 release, but my attention will be back on this again after the Easter break. I have to say I'd welcome any PR you could make on XLSX.jl to help facilitate this upgrade. Thanks! |
Drops the underscore prefixes from internal names (module is unexported, the clutter was only needed back when these names leaked into XML.jl). Replaces the name-byte predicate with a 256-entry const lookup table. Also fixes a 1-based indexing off-by-one in read_doctype_body: the '<!--' detection guarded with `pos >= 2` while reading `codeunit(data, pos - 2)`, which is codeunit 0 when pos == 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tag, value, keys, and attributes on LazyNode now return
SubString{String} views into the source rather than allocating
fresh Strings, so traversing a large document lazily does not
duplicate its text data.
Introduces a small _as_substring helper to promote the String that
`unescape` can return into a SubString so Attributes stays homogeneous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_write_xml now inspects children before reformatting: if any Text
child has non-whitespace content (or any CData child exists), the
element is treated as mixed content and its whitespace is preserved
verbatim. Otherwise the writer drops the whitespace-only Text nodes
the parser emits for round-tripping source formatting and generates
fresh indentation. Same filter is applied at the Document level.
Also adds an unescape(::SubString{String}) specialization that
returns the input unchanged when it contains no '&', avoiding an
allocation on the lazy scanning path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The medium-file workloads show a ~10–25% regression vs the numbers captured at 4a728ee ("Revamp benchmarks"). v0.4-vs-v0.3.8 remains a 70–80% improvement, so this is a post-release follow-up, not a release blocker. Suspected culprit is the eager Pair{S,S}[] alloc per TOKEN_OPEN_TAG introduced in 2f71f9a — see follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @joshday — the v0.4 rewrite looks good; the tokenizer architecture reads faster on eager mode. I'm currently evaluating the v0.4 upgrade on FastKML.jl following your comment on my PR #58. The eager-path improvements are substantial on real-world KML: ×2–2.6 speedup and 37–69% memory reduction across four reference files (5k to 163k Placemarks), versus There is a trade-off I wanted to surface before #54 is merged: v0.4 removed the linear-traversal API on On real-world KML files with non-trivial structure, the
On those same files, the I've written up the decomposition (synthetic bench + FastKML real workloads + cost attribution + a SOTA-informed design space) as a separate design issue #61 rather than clutter this PR thread. Full data: Happy to refine the benchmark or prototype any direction if useful. |
|
Hi, @joshday. I've posted some results on #61 — a |
|
Hey all, sincere apologies for my lack of communication here. I need to step back from finishing this PR and maintaining XML in the foreseeable future. I wish I had wrapped up this PR before getting into a busy season of life and I'm more than happy to share my thoughts and general vision for v0.4, but I should no longer be in the critical path of development and/or decision making. I'm self-employed and (for better or worse) have been successful enough that I no longer have time to allocate for things that aren't (1) paying the bills or (2) spending time with my family. I'd love to get back to this someday! There's a sick twisted part of me that genuinely likes working on awful XML edge cases 😅. @mathieu17g @TimG1964 My recommendation would be to transfer XML.jl to JuliaData. This will need to be initiated by someone at JuliaHub, but I think they'll onboard with it. |
|
Thanks for the update and very sorry to hear you are "moving on". I wish you success in your business. Will you be facilitating the transfer to JuliaData? I really hope they are able to take XML.jl on and can find a worthy successor. Before you go, would you have time to review and possibly merge any of the pending PRs into a final v0.3.9. I fear it may be a while before a v0.4 can be finalized following the transfer of ownership. Thanks! |
|
+1 on @TimG1964's v0.3.9. The five open PRs that fit a 0.3.x patch are all CI-green on current Optionally, two small regression tests, verified against the PR branches and ready to fold in: # #56 — prev must cross a CDATA section (the prev call itself crashes on v0.3.8)
doc = parse("<r><a>x</a><![CDATA[hello]]><b>y</b></r>", LazyNode)
b = children(children(doc)[1])[3] # the <b> element
p = nothing
@test (p = XML.prev(b)) isa LazyNode # asserts the call does not throw
@test XML.nodetype(p) == XML.CData
@test XML.value(p) == "hello"
# #60 — escape on a SubString (MethodError on v0.3.8)
@test XML.escape(SubString("a&b<c>", 1)) == "a&b<c>" |
|
I don't have merge permissions here |
Summary of Changes
I revived an old rewrite I had halfway finished with the help of Claude Code. It produced some good results!
src/XMLTokenizer.jlmodule for speedy tokenizationNode{T}now parameterized by the string storage type, enabling quick reads viaSubStringor StringViews.jlXML.mmap("file.xml", LazyNode)for memory-mapped parsing of very large filesxpath(node, path)with a practical subset of XPath 1.0Downstream
@TimG1964 you are likely the most impacted with these changes. The Downstream.yml action does indicate a failure in XLSX.jl tests related to
Rawno longer existing. I'd appreciate your review here! I'm happy to submit a PR for a fix in XLSX.jl so that its ready to go before this gets merged.Addressed Issues
Benchmarks: See
benchmarks/compare.jlHere
(SS)refers to usingSubString{String}as storage type.