Summary
The x/ansi parser truncates OSC, DCS, SOS, PM, and APC strings whose
payload contains certain UTF-8 multi-byte codepoints, because byte
0x9C is both a valid UTF-8 continuation byte and the 8-bit C1 form
of the String Terminator (ST).
When a codepoint whose UTF-8 encoding contains 0x9C appears inside
one of these string-collecting states, the parser exits the state at
0x9C and interprets the remaining payload bytes as ground-state
input. For OSC/DCS/SOS/PM/APC terminated via ESC \ (7-bit ST), the
payload bytes after 0x9C end up written into the terminal buffer as
regular characters.
The entire U+2700–U+273F Dingbats block is affected (second UTF-8 byte
is 0x9C). A concrete example: U+2733 ✳ encodes as E2 9C B3, and
is emitted by real-world tools in OSC 0 window-title updates.
Minimal reproduction
Feed this byte stream through the parser with an OSC 0 handler that
logs the payload:
Expected: handler receives E2 9C B3 (one ✳) as the payload; parser
returns to ground after the final ESC \.
Actual: parser exits OSC state on 0x9C; OSC dispatches with payload
E2 only; bytes B3 ESC \ are processed as ground input, with B3
written to the terminal grid.
Reproducing tests
The following cases can be dropped into ansi/parser_osc_test.go and
ansi/parser_dcs_test.go alongside the existing TestOscSequence /
TestDcsSequence table tests. They fail on current main and express
the fixed behaviour: the parser must not let a UTF-8 continuation byte
act as a state-transition byte.
Add to the cases slice in TestOscSequence
(ansi/parser_osc_test.go):
{
// U+2733 ✳ (E2 9C B3). Real-world OSC 0 window-title input.
name: "osc_dingbat_u2733_preserved",
input: "\x1b]0;✳ hello\x07",
expected: []any{
[]byte("0;✳ hello"),
},
},
{
// U+2736 ✶ (E2 9C B6) — another Dingbat in the 0x9C-second-byte range.
name: "osc_dingbat_u2736_preserved",
input: "\x1b]0;✶ Run\x07",
expected: []any{
[]byte("0;✶ Run"),
},
},
{
// Bare 0x9C at ground state must still terminate (C1 ST).
// The following 'B' prints in ground state.
name: "osc_bare_9c_still_terminates",
input: "\x1b]0;a\x9cB",
expected: []any{
[]byte("0;a"),
'B',
},
},
{
// Invalid UTF-8: E2 arms a 2-byte-remaining counter, then 0x05 is
// outside 0x80-0xBF so the counter resets; the subsequent 0x9C
// must therefore terminate OSC.
name: "osc_invalid_utf8_resets_counter",
input: "\x1b]0;\xe2\x05\x9cZ",
expected: []any{
[]byte("0;\xe2"),
'Z',
},
},
{
// 8-bit OSC introducer path: OSC opened with raw 0x9D (not ESC ])
// carrying U+2733 ✳ as payload.
name: "osc_8bit_introducer_with_dingbat",
input: "\x9d0;✳ hi\x1b\\",
expected: []any{
[]byte("0;✳ hi"),
Cmd('\\'),
},
},
Add to the cases slice in TestDcsSequence
(ansi/parser_dcs_test.go):
{
// DCS payload carries a UTF-8 Dingbat (E2 9C B3 = U+2733 ✳).
// The 0x9C inside it must NOT terminate the passthrough; ESC\ does.
name: "dcs_payload_with_dingbat_utf8",
input: "\x1bPq\xe2\x9c\xb3 data\x1b\\",
expected: []any{
dcsSequence{
Cmd: 'q',
Params: Params{},
Data: []byte("\xe2\x9c\xb3 data"),
},
Cmd('\\'),
},
},
{
// Bare 0x9C (ST in C1) still terminates DCS when not in a UTF-8 sequence.
name: "dcs_bare_9c_terminates",
input: "\x1bPqhello\x9cZ",
expected: []any{
dcsSequence{
Cmd: 'q',
Params: Params{},
Data: []byte("hello"),
},
'Z',
},
},
Run:
go test ./ansi/... -run 'TestOscSequence|TestDcsSequence' -count=1
On current main the Dingbat/8-bit/DCS-UTF8 cases fail; the bare-0x9C
cases continue to pass. With the proposed fix below, all cases pass.
Proposed fix
Track UTF-8 continuation state inside the string-collecting states.
When a byte in 0x80-0xBF arrives while a multi-byte sequence is in
progress (previous byte was 110xxxxx, 1110xxxx, or 11110xxx),
consume it as continuation and skip the state-transition table for
that byte. The parser stays byte-oriented; no decoding is performed.
This is consistent with the byte-pattern rules of RFC 3629 UTF-8 and
the Williams VT500 state machine as extended by UTF-8-capable
terminal emulators: xterm, kitty, wezterm, and alacritty all treat
0x9C inside an active multi-byte sequence as data, and only treat
it as ST outside one. PR incoming.
Summary
The
x/ansiparser truncates OSC, DCS, SOS, PM, and APC strings whosepayload contains certain UTF-8 multi-byte codepoints, because byte
0x9Cis both a valid UTF-8 continuation byte and the 8-bit C1 formof the String Terminator (ST).
When a codepoint whose UTF-8 encoding contains
0x9Cappears insideone of these string-collecting states, the parser exits the state at
0x9Cand interprets the remaining payload bytes as ground-stateinput. For OSC/DCS/SOS/PM/APC terminated via
ESC \(7-bit ST), thepayload bytes after
0x9Cend up written into the terminal buffer asregular characters.
The entire U+2700–U+273F Dingbats block is affected (second UTF-8 byte
is
0x9C). A concrete example: U+2733✳encodes asE2 9C B3, andis emitted by real-world tools in OSC 0 window-title updates.
Minimal reproduction
Feed this byte stream through the parser with an OSC 0 handler that
logs the payload:
Expected: handler receives
E2 9C B3(one✳) as the payload; parserreturns to ground after the final
ESC \.Actual: parser exits OSC state on
0x9C; OSC dispatches with payloadE2only; bytesB3 ESC \are processed as ground input, withB3written to the terminal grid.
Reproducing tests
The following cases can be dropped into
ansi/parser_osc_test.goandansi/parser_dcs_test.goalongside the existingTestOscSequence/TestDcsSequencetable tests. They fail on currentmainand expressthe fixed behaviour: the parser must not let a UTF-8 continuation byte
act as a state-transition byte.
Add to the
casesslice inTestOscSequence(
ansi/parser_osc_test.go):{ // U+2733 ✳ (E2 9C B3). Real-world OSC 0 window-title input. name: "osc_dingbat_u2733_preserved", input: "\x1b]0;✳ hello\x07", expected: []any{ []byte("0;✳ hello"), }, }, { // U+2736 ✶ (E2 9C B6) — another Dingbat in the 0x9C-second-byte range. name: "osc_dingbat_u2736_preserved", input: "\x1b]0;✶ Run\x07", expected: []any{ []byte("0;✶ Run"), }, }, { // Bare 0x9C at ground state must still terminate (C1 ST). // The following 'B' prints in ground state. name: "osc_bare_9c_still_terminates", input: "\x1b]0;a\x9cB", expected: []any{ []byte("0;a"), 'B', }, }, { // Invalid UTF-8: E2 arms a 2-byte-remaining counter, then 0x05 is // outside 0x80-0xBF so the counter resets; the subsequent 0x9C // must therefore terminate OSC. name: "osc_invalid_utf8_resets_counter", input: "\x1b]0;\xe2\x05\x9cZ", expected: []any{ []byte("0;\xe2"), 'Z', }, }, { // 8-bit OSC introducer path: OSC opened with raw 0x9D (not ESC ]) // carrying U+2733 ✳ as payload. name: "osc_8bit_introducer_with_dingbat", input: "\x9d0;✳ hi\x1b\\", expected: []any{ []byte("0;✳ hi"), Cmd('\\'), }, },Add to the
casesslice inTestDcsSequence(
ansi/parser_dcs_test.go):{ // DCS payload carries a UTF-8 Dingbat (E2 9C B3 = U+2733 ✳). // The 0x9C inside it must NOT terminate the passthrough; ESC\ does. name: "dcs_payload_with_dingbat_utf8", input: "\x1bPq\xe2\x9c\xb3 data\x1b\\", expected: []any{ dcsSequence{ Cmd: 'q', Params: Params{}, Data: []byte("\xe2\x9c\xb3 data"), }, Cmd('\\'), }, }, { // Bare 0x9C (ST in C1) still terminates DCS when not in a UTF-8 sequence. name: "dcs_bare_9c_terminates", input: "\x1bPqhello\x9cZ", expected: []any{ dcsSequence{ Cmd: 'q', Params: Params{}, Data: []byte("hello"), }, 'Z', }, },Run:
On current
mainthe Dingbat/8-bit/DCS-UTF8 cases fail; the bare-0x9Ccases continue to pass. With the proposed fix below, all cases pass.
Proposed fix
Track UTF-8 continuation state inside the string-collecting states.
When a byte in
0x80-0xBFarrives while a multi-byte sequence is inprogress (previous byte was
110xxxxx,1110xxxx, or11110xxx),consume it as continuation and skip the state-transition table for
that byte. The parser stays byte-oriented; no decoding is performed.
This is consistent with the byte-pattern rules of RFC 3629 UTF-8 and
the Williams VT500 state machine as extended by UTF-8-capable
terminal emulators: xterm, kitty, wezterm, and alacritty all treat
0x9Cinside an active multi-byte sequence as data, and only treatit as ST outside one. PR incoming.