Skip to content

Python binary stream writer buffer overflow #289

@naegelejd

Description

@naegelejd

A self-contained reproduction of an IndexError: bytearray index out of range crash in the Yardl-generated Python binary writer when a stream of small records is written one item at a time (from an iterator/generator).

TL;DR

Format Write mode Result
binary per-item streaming (iterator) CRASH (IndexError)
binary batch (single list) OK
ndjson per-item streaming (iterator) OK
ndjson batch (single list) OK
C++ binary per-item streaming OK

The bug is specific to the generated Python binary writer's per-item streaming path. The C++ writer handles the identical data fine, and the Python NDJSON and batch-list paths are unaffected.

How to reproduce

Model:

TinyOptionalRecord: !record
  fields:
    a: uint32?
    b: uint32?
    c: uint32?
    d: uint32?

StreamOfTinyRecords: !protocol
  sequence:
    items: !stream
      items: TinyOptionalRecord

Python:

def make_records(n):
    yield from [m.TinyOptionalRecord() for _ in range(n)]

items = make_records(70000)
buffer = io.BytesIO()
with m.BinaryStreamOfTinyRecordsWriter(buffer) as w:
    w.write_items(items)    # per-item streaming

crashes with IndexError: bytearray index out of range.

Root cause

In the generated Python runtime (_binary.py), the stream serializer writes a 1-byte item marker before each element in the per-item path:

# _binary.py — StreamSerializer.write
if isinstance(value, list) and len(value) > 0:
    stream.write_unsigned_varint(len(value))   # batch path: count, then elements
    for element in value:
        self._element_serializer.write(stream, element)
else:
    for element in value:
        stream.write_byte_no_check(1)          # <-- per-item marker, NO capacity check
        self._element_serializer.write(stream, element)

write_byte_no_check writes directly into the 64 KiB output buffer without checking remaining capacity or flushing:

# _binary.py — CodedOutputStream.write_byte_no_check
def write_byte_no_check(self, value: int) -> None:
    assert 0 <= value <= UINT8_MAX
    self._buffer[self._offset] = value    # <-- IndexError when _offset == len(_buffer)
    self._offset += 1

Each all-default TinyOptionalRecord serializes to exactly 4 bytes (one has_value = 0 byte per optional field). With the 1-byte per-item marker that is 5 bytes/item. The element serializer's own writes use capacity checks of the form if (len(buffer) - offset) < size: flush(), which do not flush when the remaining space equals the write size exactly — so the buffer offset can land on exactly 65536. The very next per-item write_byte_no_check(1) marker then indexes one past the end of the buffer and raises IndexError.

With the default data this happens deterministically while writing item ~65464, which is why a few tens of thousands of items are needed to trigger it. The batch-list path avoids the bug because it emits a single write_unsigned_varint(len) (which does check capacity) instead of a per-item unchecked marker.

Suggested fix (for maintainers)

Give the per-item marker the same capacity guarantee the batch path effectively has — for example, call ensure_capacity(1) (or use a capacity-checked byte write) before the marker in StreamSerializer.write, or have write_byte_no_check flush when the buffer is full. This file is generated from tooling/internal/python/static_files/_binary.py in the yardl repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions