Symptom
XML.Node(XML.Raw(bytes)) crashes with BoundsError: attempt to access N-element Vector{UInt8} at index [N+1] when given a UTF-16 (LE or BE) encoded XML document — even though the document is well-formed and declares its encoding correctly in the XML prologue.
Minimal reproduction
using XML # XML.jl v0.3.8
# A perfectly valid UTF-16 LE XML document with BOM
xml_text = """<?xml version="1.0" encoding="utf-16"?><root>hello</root>"""
utf16_units = transcode(UInt16, xml_text) # Base stdlib
utf16_bytes = vcat(UInt8[0xFF, 0xFE], reinterpret(UInt8, utf16_units)) # BOM + UTF-16 LE
XML.Node(XML.Raw(utf16_bytes))
# ERROR: BoundsError: attempt to access 116-element Vector{UInt8} at index [117]
The XML 1.0 spec (§4.3.3) mandates honoring the BOM (FF FE LE, FE FF BE, EF BB BF UTF-8) and the encoding="..." attribute in the XML declaration. XML.jl honors neither.
Real-world impact
Microsoft Excel's PowerQuery feature saves its DataMashup definitions to customXml/*.xml in UTF-16 LE with BOM by default; any .xlsx with a saved PowerQuery embeds such files. Downloadable canonical example (~72 KB, 3 lines to reproduce via XLSX.jl):
using XLSX, Downloads
url = "https://github.com/Apress/data-mash-w-microsoft-excel-using-power-query-and-m/raw/master/Chapter02Sample1.xlsx"
path = Downloads.download(url)
XLSX.openxlsx(path) do _ end
# ERROR: TaskFailedException — nested: BoundsError in XML.jl (the bug filed here)
Source: Apress/data-mash-w-microsoft-excel-using-power-query-and-m (Adam Aspin, Apress).
The bug isn't restricted to customXml: any internal XML file in a .xlsx zip in UTF-16 (including critical ones like xl/workbook.xml, xl/sharedStrings.xml) triggers the same crash. So this issue is the root cause for an entire family of OOXML compatibility problems, not just PowerQuery files.
Suggested fix
A small preprocessing step in XML.Raw(bytes::Vector{UInt8}) (or wherever raw bytes enter XML.jl), using only Julia's stdlib (see Strings — Unicode and UTF-8 for the underlying primitives).
Minimum viable — BOM detection only (~10 lines):
function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
# UTF-16 LE BOM
return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
# UTF-16 BE BOM — byte-swap on little-endian hosts
return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
return @view bytes[4:end] # strip UTF-8 BOM
end
return bytes
end
More complete — adds BOM-less UTF-16 detection via Base.isvalid (covers the XML 1.0 Appendix F auto-detection case, where the document declares encoding="utf-16" in its prologue but has no BOM):
function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
# BOM-based detection (fast path)
if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
return @view bytes[4:end]
end
# No BOM — if the buffer isn't valid UTF-8, peek the first 2 bytes to
# disambiguate UTF-16 LE vs BE (XML 1.0 Appendix F: `<?xml` has a
# distinctive byte signature in each Unicode encoding)
if !isvalid(String, bytes) && length(bytes) >= 4
if bytes[1] == 0x00 && bytes[2] != 0x00 # 00 XX → UTF-16 BE
return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, bytes))))
elseif bytes[1] != 0x00 && bytes[2] == 0x00 # XX 00 → UTF-16 LE
return Vector{UInt8}(transcode(String, reinterpret(UInt16, bytes)))
end
end
return bytes
end
All functions used (transcode, bswap, isvalid, reinterpret, @view) are Julia stdlib — no new dependency.
If full UTF-16 support is out of scope, a minimum acceptable behavior would be to throw a clear XMLEncodingError("Unsupported encoding 'utf-16'") instead of BoundsError, so callers can diagnose the problem from the error message alone.
Environment
- Julia 1.12.6 (Windows 11)
- XML.jl v0.3.8
Symptom
XML.Node(XML.Raw(bytes))crashes withBoundsError: attempt to access N-element Vector{UInt8} at index [N+1]when given a UTF-16 (LE or BE) encoded XML document — even though the document is well-formed and declares its encoding correctly in the XML prologue.Minimal reproduction
The XML 1.0 spec (§4.3.3) mandates honoring the BOM (
FF FELE,FE FFBE,EF BB BFUTF-8) and theencoding="..."attribute in the XML declaration. XML.jl honors neither.Real-world impact
Microsoft Excel's PowerQuery feature saves its
DataMashupdefinitions tocustomXml/*.xmlin UTF-16 LE with BOM by default; any.xlsxwith a saved PowerQuery embeds such files. Downloadable canonical example (~72 KB, 3 lines to reproduce via XLSX.jl):Source:
Apress/data-mash-w-microsoft-excel-using-power-query-and-m(Adam Aspin, Apress).The bug isn't restricted to customXml: any internal XML file in a
.xlsxzip in UTF-16 (including critical ones likexl/workbook.xml,xl/sharedStrings.xml) triggers the same crash. So this issue is the root cause for an entire family of OOXML compatibility problems, not just PowerQuery files.Suggested fix
A small preprocessing step in
XML.Raw(bytes::Vector{UInt8})(or wherever raw bytes enter XML.jl), using only Julia's stdlib (see Strings — Unicode and UTF-8 for the underlying primitives).Minimum viable — BOM detection only (~10 lines):
More complete — adds BOM-less UTF-16 detection via
Base.isvalid(covers the XML 1.0 Appendix F auto-detection case, where the document declaresencoding="utf-16"in its prologue but has no BOM):All functions used (
transcode,bswap,isvalid,reinterpret,@view) are Julia stdlib — no new dependency.If full UTF-16 support is out of scope, a minimum acceptable behavior would be to throw a clear
XMLEncodingError("Unsupported encoding 'utf-16'")instead ofBoundsError, so callers can diagnose the problem from the error message alone.Environment