ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C

## Summary

The `x/ansi` parser truncates OSC, DCS, SOS, PM, and APC strings whose
payload contains certain UTF-8 multi-byte codepoints, because byte
`0x9C` is both a valid UTF-8 continuation byte and the 8-bit C1 form
of the String Terminator (ST).

When a codepoint whose UTF-8 encoding contains `0x9C` appears inside
one of these string-collecting states, the parser exits the state at
`0x9C` and interprets the remaining payload bytes as ground-state
input. For OSC/DCS/SOS/PM/APC terminated via `ESC \` (7-bit ST), the
payload bytes after `0x9C` end up written into the terminal buffer as
regular characters.

The entire U+2700–U+273F Dingbats block is affected (second UTF-8 byte
is `0x9C`). A concrete example: U+2733 `✳` encodes as `E2 9C B3`, and
is emitted by real-world tools in OSC 0 window-title updates.

## Minimal reproduction

Feed this byte stream through the parser with an OSC 0 handler that
logs the payload:

```
ESC ] 0 ; E2 9C B3 ESC \
```

Expected: handler receives `E2 9C B3` (one `✳`) as the payload; parser
returns to ground after the final `ESC \`.

Actual: parser exits OSC state on `0x9C`; OSC dispatches with payload
`E2` only; bytes `B3 ESC \` are processed as ground input, with `B3`
written to the terminal grid.

## Reproducing tests

The following cases can be dropped into `ansi/parser_osc_test.go` and
`ansi/parser_dcs_test.go` alongside the existing `TestOscSequence` /
`TestDcsSequence` table tests. They fail on current `main` and express
the fixed behaviour: the parser must not let a UTF-8 continuation byte
act as a state-transition byte.

Add to the `cases` slice in `TestOscSequence`
(`ansi/parser_osc_test.go`):

```go
{
    // U+2733 ✳ (E2 9C B3). Real-world OSC 0 window-title input.
    name:  "osc_dingbat_u2733_preserved",
    input: "\x1b]0;✳ hello\x07",
    expected: []any{
        []byte("0;✳ hello"),
    },
},
{
    // U+2736 ✶ (E2 9C B6) — another Dingbat in the 0x9C-second-byte range.
    name:  "osc_dingbat_u2736_preserved",
    input: "\x1b]0;✶ Run\x07",
    expected: []any{
        []byte("0;✶ Run"),
    },
},
{
    // Bare 0x9C at ground state must still terminate (C1 ST).
    // The following 'B' prints in ground state.
    name:  "osc_bare_9c_still_terminates",
    input: "\x1b]0;a\x9cB",
    expected: []any{
        []byte("0;a"),
        'B',
    },
},
{
    // Invalid UTF-8: E2 arms a 2-byte-remaining counter, then 0x05 is
    // outside 0x80-0xBF so the counter resets; the subsequent 0x9C
    // must therefore terminate OSC.
    name:  "osc_invalid_utf8_resets_counter",
    input: "\x1b]0;\xe2\x05\x9cZ",
    expected: []any{
        []byte("0;\xe2"),
        'Z',
    },
},
{
    // 8-bit OSC introducer path: OSC opened with raw 0x9D (not ESC ])
    // carrying U+2733 ✳ as payload.
    name:  "osc_8bit_introducer_with_dingbat",
    input: "\x9d0;✳ hi\x1b\\",
    expected: []any{
        []byte("0;✳ hi"),
        Cmd('\\'),
    },
},
```

Add to the `cases` slice in `TestDcsSequence`
(`ansi/parser_dcs_test.go`):

```go
{
    // DCS payload carries a UTF-8 Dingbat (E2 9C B3 = U+2733 ✳).
    // The 0x9C inside it must NOT terminate the passthrough; ESC\ does.
    name:  "dcs_payload_with_dingbat_utf8",
    input: "\x1bPq\xe2\x9c\xb3 data\x1b\\",
    expected: []any{
        dcsSequence{
            Cmd:    'q',
            Params: Params{},
            Data:   []byte("\xe2\x9c\xb3 data"),
        },
        Cmd('\\'),
    },
},
{
    // Bare 0x9C (ST in C1) still terminates DCS when not in a UTF-8 sequence.
    name:  "dcs_bare_9c_terminates",
    input: "\x1bPqhello\x9cZ",
    expected: []any{
        dcsSequence{
            Cmd:    'q',
            Params: Params{},
            Data:   []byte("hello"),
        },
        'Z',
    },
},
```

Run:

```
go test ./ansi/... -run 'TestOscSequence|TestDcsSequence' -count=1
```

On current `main` the Dingbat/8-bit/DCS-UTF8 cases fail; the bare-0x9C
cases continue to pass. With the proposed fix below, all cases pass.

## Proposed fix

Track UTF-8 continuation state inside the string-collecting states.
When a byte in `0x80-0xBF` arrives while a multi-byte sequence is in
progress (previous byte was `110xxxxx`, `1110xxxx`, or `11110xxx`),
consume it as continuation and skip the state-transition table for
that byte. The parser stays byte-oriented; no decoding is performed.

This is consistent with the byte-pattern rules of RFC 3629 UTF-8 and
the Williams VT500 state machine as extended by UTF-8-capable
terminal emulators: xterm, kitty, wezterm, and alacritty all treat
`0x9C` inside an active multi-byte sequence as data, and only treat
it as ST outside one. PR incoming.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C #848

Summary

Minimal reproduction

Reproducing tests

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C #848

Description

Summary

Minimal reproduction

Reproducing tests

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions