Skip to content

ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C #848

Description

@wblech

Summary

The x/ansi parser truncates OSC, DCS, SOS, PM, and APC strings whose
payload contains certain UTF-8 multi-byte codepoints, because byte
0x9C is both a valid UTF-8 continuation byte and the 8-bit C1 form
of the String Terminator (ST).

When a codepoint whose UTF-8 encoding contains 0x9C appears inside
one of these string-collecting states, the parser exits the state at
0x9C and interprets the remaining payload bytes as ground-state
input. For OSC/DCS/SOS/PM/APC terminated via ESC \ (7-bit ST), the
payload bytes after 0x9C end up written into the terminal buffer as
regular characters.

The entire U+2700–U+273F Dingbats block is affected (second UTF-8 byte
is 0x9C). A concrete example: U+2733 encodes as E2 9C B3, and
is emitted by real-world tools in OSC 0 window-title updates.

Minimal reproduction

Feed this byte stream through the parser with an OSC 0 handler that
logs the payload:

ESC ] 0 ; E2 9C B3 ESC \

Expected: handler receives E2 9C B3 (one ) as the payload; parser
returns to ground after the final ESC \.

Actual: parser exits OSC state on 0x9C; OSC dispatches with payload
E2 only; bytes B3 ESC \ are processed as ground input, with B3
written to the terminal grid.

Reproducing tests

The following cases can be dropped into ansi/parser_osc_test.go and
ansi/parser_dcs_test.go alongside the existing TestOscSequence /
TestDcsSequence table tests. They fail on current main and express
the fixed behaviour: the parser must not let a UTF-8 continuation byte
act as a state-transition byte.

Add to the cases slice in TestOscSequence
(ansi/parser_osc_test.go):

{
    // U+2733 ✳ (E2 9C B3). Real-world OSC 0 window-title input.
    name:  "osc_dingbat_u2733_preserved",
    input: "\x1b]0;✳ hello\x07",
    expected: []any{
        []byte("0;✳ hello"),
    },
},
{
    // U+2736 ✶ (E2 9C B6) — another Dingbat in the 0x9C-second-byte range.
    name:  "osc_dingbat_u2736_preserved",
    input: "\x1b]0;✶ Run\x07",
    expected: []any{
        []byte("0;✶ Run"),
    },
},
{
    // Bare 0x9C at ground state must still terminate (C1 ST).
    // The following 'B' prints in ground state.
    name:  "osc_bare_9c_still_terminates",
    input: "\x1b]0;a\x9cB",
    expected: []any{
        []byte("0;a"),
        'B',
    },
},
{
    // Invalid UTF-8: E2 arms a 2-byte-remaining counter, then 0x05 is
    // outside 0x80-0xBF so the counter resets; the subsequent 0x9C
    // must therefore terminate OSC.
    name:  "osc_invalid_utf8_resets_counter",
    input: "\x1b]0;\xe2\x05\x9cZ",
    expected: []any{
        []byte("0;\xe2"),
        'Z',
    },
},
{
    // 8-bit OSC introducer path: OSC opened with raw 0x9D (not ESC ])
    // carrying U+2733 ✳ as payload.
    name:  "osc_8bit_introducer_with_dingbat",
    input: "\x9d0;✳ hi\x1b\\",
    expected: []any{
        []byte("0;✳ hi"),
        Cmd('\\'),
    },
},

Add to the cases slice in TestDcsSequence
(ansi/parser_dcs_test.go):

{
    // DCS payload carries a UTF-8 Dingbat (E2 9C B3 = U+2733 ✳).
    // The 0x9C inside it must NOT terminate the passthrough; ESC\ does.
    name:  "dcs_payload_with_dingbat_utf8",
    input: "\x1bPq\xe2\x9c\xb3 data\x1b\\",
    expected: []any{
        dcsSequence{
            Cmd:    'q',
            Params: Params{},
            Data:   []byte("\xe2\x9c\xb3 data"),
        },
        Cmd('\\'),
    },
},
{
    // Bare 0x9C (ST in C1) still terminates DCS when not in a UTF-8 sequence.
    name:  "dcs_bare_9c_terminates",
    input: "\x1bPqhello\x9cZ",
    expected: []any{
        dcsSequence{
            Cmd:    'q',
            Params: Params{},
            Data:   []byte("hello"),
        },
        'Z',
    },
},

Run:

go test ./ansi/... -run 'TestOscSequence|TestDcsSequence' -count=1

On current main the Dingbat/8-bit/DCS-UTF8 cases fail; the bare-0x9C
cases continue to pass. With the proposed fix below, all cases pass.

Proposed fix

Track UTF-8 continuation state inside the string-collecting states.
When a byte in 0x80-0xBF arrives while a multi-byte sequence is in
progress (previous byte was 110xxxxx, 1110xxxx, or 11110xxx),
consume it as continuation and skip the state-transition table for
that byte. The parser stays byte-oriented; no decoding is performed.

This is consistent with the byte-pattern rules of RFC 3629 UTF-8 and
the Williams VT500 state machine as extended by UTF-8-capable
terminal emulators: xterm, kitty, wezterm, and alacritty all treat
0x9C inside an active multi-byte sequence as data, and only treat
it as ST outside one. PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions