A self-contained reproduction of an IndexError: bytearray index out of range crash in the Yardl-generated Python binary writer when a stream of small records is written one item at a time (from an iterator/generator).
TL;DR
| Format |
Write mode |
Result |
| binary |
per-item streaming (iterator) |
CRASH (IndexError) |
| binary |
batch (single list) |
OK |
| ndjson |
per-item streaming (iterator) |
OK |
| ndjson |
batch (single list) |
OK |
| C++ binary |
per-item streaming |
OK |
The bug is specific to the generated Python binary writer's per-item streaming path. The C++ writer handles the identical data fine, and the Python NDJSON and batch-list paths are unaffected.
How to reproduce
Model:
TinyOptionalRecord: !record
fields:
a: uint32?
b: uint32?
c: uint32?
d: uint32?
StreamOfTinyRecords: !protocol
sequence:
items: !stream
items: TinyOptionalRecord
Python:
def make_records(n):
yield from [m.TinyOptionalRecord() for _ in range(n)]
items = make_records(70000)
buffer = io.BytesIO()
with m.BinaryStreamOfTinyRecordsWriter(buffer) as w:
w.write_items(items) # per-item streaming
crashes with IndexError: bytearray index out of range.
Root cause
In the generated Python runtime (_binary.py), the stream serializer writes a 1-byte item marker before each element in the per-item path:
# _binary.py — StreamSerializer.write
if isinstance(value, list) and len(value) > 0:
stream.write_unsigned_varint(len(value)) # batch path: count, then elements
for element in value:
self._element_serializer.write(stream, element)
else:
for element in value:
stream.write_byte_no_check(1) # <-- per-item marker, NO capacity check
self._element_serializer.write(stream, element)
write_byte_no_check writes directly into the 64 KiB output buffer without checking remaining capacity or flushing:
# _binary.py — CodedOutputStream.write_byte_no_check
def write_byte_no_check(self, value: int) -> None:
assert 0 <= value <= UINT8_MAX
self._buffer[self._offset] = value # <-- IndexError when _offset == len(_buffer)
self._offset += 1
Each all-default TinyOptionalRecord serializes to exactly 4 bytes (one has_value = 0 byte per optional field). With the 1-byte per-item marker that is 5 bytes/item. The element serializer's own writes use capacity checks of the form if (len(buffer) - offset) < size: flush(), which do not flush when the remaining space equals the write size exactly — so the buffer offset can land on exactly 65536. The very next per-item write_byte_no_check(1) marker then indexes one past the end of the buffer and raises IndexError.
With the default data this happens deterministically while writing item ~65464, which is why a few tens of thousands of items are needed to trigger it. The batch-list path avoids the bug because it emits a single write_unsigned_varint(len) (which does check capacity) instead of a per-item unchecked marker.
Suggested fix (for maintainers)
Give the per-item marker the same capacity guarantee the batch path effectively has — for example, call ensure_capacity(1) (or use a capacity-checked byte write) before the marker in StreamSerializer.write, or have write_byte_no_check flush when the buffer is full. This file is generated from tooling/internal/python/static_files/_binary.py in the yardl repo.
A self-contained reproduction of an
IndexError: bytearray index out of rangecrash in the Yardl-generated Python binary writer when a stream of small records is written one item at a time (from an iterator/generator).TL;DR
IndexError)The bug is specific to the generated Python binary writer's per-item streaming path. The C++ writer handles the identical data fine, and the Python NDJSON and batch-list paths are unaffected.
How to reproduce
Model:
Python:
crashes with
IndexError: bytearray index out of range.Root cause
In the generated Python runtime (
_binary.py), the stream serializer writes a 1-byte item marker before each element in the per-item path:write_byte_no_checkwrites directly into the 64 KiB output buffer without checking remaining capacity or flushing:Each all-default
TinyOptionalRecordserializes to exactly 4 bytes (onehas_value = 0byte per optional field). With the 1-byte per-item marker that is 5 bytes/item. The element serializer's own writes use capacity checks of the formif (len(buffer) - offset) < size: flush(), which do not flush when the remaining space equals the write size exactly — so the buffer offset can land on exactly65536. The very next per-itemwrite_byte_no_check(1)marker then indexes one past the end of the buffer and raisesIndexError.With the default data this happens deterministically while writing item ~65464, which is why a few tens of thousands of items are needed to trigger it. The batch-list path avoids the bug because it emits a single
write_unsigned_varint(len)(which does check capacity) instead of a per-item unchecked marker.Suggested fix (for maintainers)
Give the per-item marker the same capacity guarantee the batch path effectively has — for example, call
ensure_capacity(1)(or use a capacity-checked byte write) before the marker inStreamSerializer.write, or havewrite_byte_no_checkflush when the buffer is full. This file is generated fromtooling/internal/python/static_files/_binary.pyin the yardl repo.