Skip to content

UTF-16 XML input crashes parser with BoundsError (no encoding detection) #62

@mathieu17g

Description

@mathieu17g

Symptom

XML.Node(XML.Raw(bytes)) crashes with BoundsError: attempt to access N-element Vector{UInt8} at index [N+1] when given a UTF-16 (LE or BE) encoded XML document — even though the document is well-formed and declares its encoding correctly in the XML prologue.

Minimal reproduction

using XML  # XML.jl v0.3.8

# A perfectly valid UTF-16 LE XML document with BOM
xml_text = """<?xml version="1.0" encoding="utf-16"?><root>hello</root>"""
utf16_units = transcode(UInt16, xml_text)      # Base stdlib
utf16_bytes = vcat(UInt8[0xFF, 0xFE], reinterpret(UInt8, utf16_units))  # BOM + UTF-16 LE

XML.Node(XML.Raw(utf16_bytes))
# ERROR: BoundsError: attempt to access 116-element Vector{UInt8} at index [117]

The XML 1.0 spec (§4.3.3) mandates honoring the BOM (FF FE LE, FE FF BE, EF BB BF UTF-8) and the encoding="..." attribute in the XML declaration. XML.jl honors neither.

Real-world impact

Microsoft Excel's PowerQuery feature saves its DataMashup definitions to customXml/*.xml in UTF-16 LE with BOM by default; any .xlsx with a saved PowerQuery embeds such files. Downloadable canonical example (~72 KB, 3 lines to reproduce via XLSX.jl):

using XLSX, Downloads
url  = "https://github.com/Apress/data-mash-w-microsoft-excel-using-power-query-and-m/raw/master/Chapter02Sample1.xlsx"
path = Downloads.download(url)
XLSX.openxlsx(path) do _ end
# ERROR: TaskFailedException — nested: BoundsError in XML.jl (the bug filed here)

Source: Apress/data-mash-w-microsoft-excel-using-power-query-and-m (Adam Aspin, Apress).

The bug isn't restricted to customXml: any internal XML file in a .xlsx zip in UTF-16 (including critical ones like xl/workbook.xml, xl/sharedStrings.xml) triggers the same crash. So this issue is the root cause for an entire family of OOXML compatibility problems, not just PowerQuery files.

Suggested fix

A small preprocessing step in XML.Raw(bytes::Vector{UInt8}) (or wherever raw bytes enter XML.jl), using only Julia's stdlib (see Strings — Unicode and UTF-8 for the underlying primitives).

Minimum viable — BOM detection only (~10 lines):

function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
    if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
        # UTF-16 LE BOM
        return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
    elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
        # UTF-16 BE BOM — byte-swap on little-endian hosts
        return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
    elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
        return @view bytes[4:end]   # strip UTF-8 BOM
    end
    return bytes
end

More complete — adds BOM-less UTF-16 detection via Base.isvalid (covers the XML 1.0 Appendix F auto-detection case, where the document declares encoding="utf-16" in its prologue but has no BOM):

function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
    # BOM-based detection (fast path)
    if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
        return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
    elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
        return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
    elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
        return @view bytes[4:end]
    end
    # No BOM — if the buffer isn't valid UTF-8, peek the first 2 bytes to
    # disambiguate UTF-16 LE vs BE (XML 1.0 Appendix F: `<?xml` has a
    # distinctive byte signature in each Unicode encoding)
    if !isvalid(String, bytes) && length(bytes) >= 4
        if bytes[1] == 0x00 && bytes[2] != 0x00          # 00 XX → UTF-16 BE
            return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, bytes))))
        elseif bytes[1] != 0x00 && bytes[2] == 0x00      # XX 00 → UTF-16 LE
            return Vector{UInt8}(transcode(String, reinterpret(UInt16, bytes)))
        end
    end
    return bytes
end

All functions used (transcode, bswap, isvalid, reinterpret, @view) are Julia stdlib — no new dependency.

If full UTF-16 support is out of scope, a minimum acceptable behavior would be to throw a clear XMLEncodingError("Unsupported encoding 'utf-16'") instead of BoundsError, so callers can diagnose the problem from the error message alone.

Environment

  • Julia 1.12.6 (Windows 11)
  • XML.jl v0.3.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions