Python binary stream writer buffer overflow

A self-contained reproduction of an `IndexError: bytearray index out of range` crash in the **Yardl-generated Python binary writer** when a stream of small records is written **one item at a time** (from an iterator/generator).

## TL;DR

| Format | Write mode | Result |
|--------|------------|--------|
| binary | per-item streaming (iterator) | **CRASH** (`IndexError`) |
| binary | batch (single list) | OK |
| ndjson | per-item streaming (iterator) | OK |
| ndjson | batch (single list) | OK |
| C++ binary | per-item streaming | OK |

The bug is specific to the generated **Python binary writer's per-item streaming path**. The C++ writer handles the identical data fine, and the Python NDJSON and batch-list paths are unaffected.

## How to reproduce

Model:
```yaml
TinyOptionalRecord: !record
  fields:
    a: uint32?
    b: uint32?
    c: uint32?
    d: uint32?

StreamOfTinyRecords: !protocol
  sequence:
    items: !stream
      items: TinyOptionalRecord
```

Python:
```python
def make_records(n):
    yield from [m.TinyOptionalRecord() for _ in range(n)]

items = make_records(70000)
buffer = io.BytesIO()
with m.BinaryStreamOfTinyRecordsWriter(buffer) as w:
    w.write_items(items)    # per-item streaming
```

crashes with `IndexError: bytearray index out of range`.

## Root cause

In the generated Python runtime (`_binary.py`), the stream serializer writes a **1-byte item marker before each element** in the per-item path:

```python
# _binary.py — StreamSerializer.write
if isinstance(value, list) and len(value) > 0:
    stream.write_unsigned_varint(len(value))   # batch path: count, then elements
    for element in value:
        self._element_serializer.write(stream, element)
else:
    for element in value:
        stream.write_byte_no_check(1)          # <-- per-item marker, NO capacity check
        self._element_serializer.write(stream, element)
```

`write_byte_no_check` writes directly into the 64 KiB output buffer **without checking remaining capacity or flushing**:

```python
# _binary.py — CodedOutputStream.write_byte_no_check
def write_byte_no_check(self, value: int) -> None:
    assert 0 <= value <= UINT8_MAX
    self._buffer[self._offset] = value    # <-- IndexError when _offset == len(_buffer)
    self._offset += 1
```

Each all-default `TinyOptionalRecord` serializes to exactly 4 bytes (one `has_value = 0` byte per optional field). With the 1-byte per-item marker that is **5 bytes/item**. The element serializer's own writes use capacity checks of the form `if (len(buffer) - offset) < size: flush()`, which do **not** flush when the remaining space equals the write size exactly — so the buffer offset can land on exactly `65536`. The very next per-item `write_byte_no_check(1)` marker then indexes one past the end of the buffer and raises `IndexError`.

With the default data this happens deterministically while writing item **~65464**, which is why a few tens of thousands of items are needed to trigger it. The batch-list path avoids the bug because it emits a single `write_unsigned_varint(len)` (which *does* check capacity) instead of a per-item unchecked marker.

## Suggested fix (for maintainers)

Give the per-item marker the same capacity guarantee the batch path effectively has — for example, call `ensure_capacity(1)` (or use a capacity-checked byte write) before the marker in `StreamSerializer.write`, or have `write_byte_no_check` flush when the buffer is full. This file is generated from `tooling/internal/python/static_files/_binary.py` in the yardl repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python binary stream writer buffer overflow #289

TL;DR

How to reproduce

Root cause

Suggested fix (for maintainers)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Format	Write mode	Result
binary	per-item streaming (iterator)	CRASH (`IndexError`)
binary	batch (single list)	OK
ndjson	per-item streaming (iterator)	OK
ndjson	batch (single list)	OK
C++ binary	per-item streaming	OK

Python binary stream writer buffer overflow #289

Description

TL;DR

How to reproduce

Root cause

Suggested fix (for maintainers)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions