UTF-16 XML input crashes parser with BoundsError (no encoding detection)

## Symptom

`XML.Node(XML.Raw(bytes))` crashes with `BoundsError: attempt to access N-element Vector{UInt8} at index [N+1]` when given a UTF-16 (LE or BE) encoded XML document — even though the document is well-formed and declares its encoding correctly in the XML prologue.

## Minimal reproduction

```julia
using XML  # XML.jl v0.3.8

# A perfectly valid UTF-16 LE XML document with BOM
xml_text = """<?xml version="1.0" encoding="utf-16"?><root>hello</root>"""
utf16_units = transcode(UInt16, xml_text)      # Base stdlib
utf16_bytes = vcat(UInt8[0xFF, 0xFE], reinterpret(UInt8, utf16_units))  # BOM + UTF-16 LE

XML.Node(XML.Raw(utf16_bytes))
# ERROR: BoundsError: attempt to access 116-element Vector{UInt8} at index [117]
```

The XML 1.0 spec ([§4.3.3](https://www.w3.org/TR/xml/#charencoding)) mandates honoring the BOM (`FF FE` LE, `FE FF` BE, `EF BB BF` UTF-8) and the `encoding="..."` attribute in the XML declaration. XML.jl honors neither.

## Real-world impact

Microsoft Excel's **PowerQuery** feature saves its `DataMashup` definitions to `customXml/*.xml` in UTF-16 LE with BOM by default; any `.xlsx` with a saved PowerQuery embeds such files. **Downloadable canonical example** (~72 KB, 3 lines to reproduce via XLSX.jl):

```julia
using XLSX, Downloads
url  = "https://github.com/Apress/data-mash-w-microsoft-excel-using-power-query-and-m/raw/master/Chapter02Sample1.xlsx"
path = Downloads.download(url)
XLSX.openxlsx(path) do _ end
# ERROR: TaskFailedException — nested: BoundsError in XML.jl (the bug filed here)
```

Source: [`Apress/data-mash-w-microsoft-excel-using-power-query-and-m`](https://github.com/Apress/data-mash-w-microsoft-excel-using-power-query-and-m) (Adam Aspin, Apress).

The bug isn't restricted to customXml: any internal XML file in a `.xlsx` zip in UTF-16 (including critical ones like `xl/workbook.xml`, `xl/sharedStrings.xml`) triggers the same crash. So this issue is the root cause for an entire family of OOXML compatibility problems, not just PowerQuery files. 

## Suggested fix

A small preprocessing step in `XML.Raw(bytes::Vector{UInt8})` (or wherever raw bytes enter XML.jl), using only Julia's stdlib (see [Strings — Unicode and UTF-8](https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8) for the underlying primitives).

**Minimum viable — BOM detection only** (~10 lines):

```julia
function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
    if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
        # UTF-16 LE BOM
        return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
    elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
        # UTF-16 BE BOM — byte-swap on little-endian hosts
        return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
    elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
        return @view bytes[4:end]   # strip UTF-8 BOM
    end
    return bytes
end
```

**More complete — adds BOM-less UTF-16 detection via `Base.isvalid`** (covers the XML 1.0 Appendix F auto-detection case, where the document declares `encoding="utf-16"` in its prologue but has no BOM):

```julia
function _normalize_xml_encoding(bytes::Vector{UInt8})::Vector{UInt8}
    # BOM-based detection (fast path)
    if length(bytes) >= 2 && bytes[1:2] == UInt8[0xFF, 0xFE]
        return Vector{UInt8}(transcode(String, reinterpret(UInt16, @view bytes[3:end])))
    elseif length(bytes) >= 2 && bytes[1:2] == UInt8[0xFE, 0xFF]
        return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, @view bytes[3:end]))))
    elseif length(bytes) >= 3 && bytes[1:3] == UInt8[0xEF, 0xBB, 0xBF]
        return @view bytes[4:end]
    end
    # No BOM — if the buffer isn't valid UTF-8, peek the first 2 bytes to
    # disambiguate UTF-16 LE vs BE (XML 1.0 Appendix F: `<?xml` has a
    # distinctive byte signature in each Unicode encoding)
    if !isvalid(String, bytes) && length(bytes) >= 4
        if bytes[1] == 0x00 && bytes[2] != 0x00          # 00 XX → UTF-16 BE
            return Vector{UInt8}(transcode(String, bswap.(reinterpret(UInt16, bytes))))
        elseif bytes[1] != 0x00 && bytes[2] == 0x00      # XX 00 → UTF-16 LE
            return Vector{UInt8}(transcode(String, reinterpret(UInt16, bytes)))
        end
    end
    return bytes
end
```

All functions used (`transcode`, `bswap`, `isvalid`, `reinterpret`, `@view`) are Julia stdlib — no new dependency.

If full UTF-16 support is out of scope, a minimum acceptable behavior would be to throw a clear `XMLEncodingError("Unsupported encoding 'utf-16'")` instead of `BoundsError`, so callers can diagnose the problem from the error message alone.

## Environment

- Julia 1.12.6 (Windows 11)
- XML.jl v0.3.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 XML input crashes parser with BoundsError (no encoding detection) #62

Symptom

Minimal reproduction

Real-world impact

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

UTF-16 XML input crashes parser with BoundsError (no encoding detection) #62

Description

Symptom

Minimal reproduction

Real-world impact

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions