fix(oxml): keep generated ids within signed int32 range#36
Merged
Conversation
_paragraph_id, _object_id, and _memo_id used uuid4().int & 0xFFFFFFFF, which yields values >= 2^31 about 50% of the time. Downstream consumers that parse the id attribute as a signed 32-bit integer treat those values as negative, which has caused interop failures. Mask one bit lower so the result stays in [0, 2^31) while keeping the practical collision rate unchanged. Adds a regression test that asserts the new range across all three generators.
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #34.
Root cause
_paragraph_id,_object_id, and_memo_idinsrc/hwpx/oxml/document.pymaskeduuid4().intwith0xFFFFFFFF. Sinceuuid4().intis uniformly distributed over 122 bits of entropy, the masked output is uniform over the full unsigned 32-bit range; about 50% of generated ids have bit 31 set (i.e.>= 2^31).Downstream consumers that parse the
idattribute as a signed 32-bit integer interpret those values as negative, which has caused interop issues.Fix
Mask one bit lower (
& 0x7FFFFFFF) so the result fits in[0, 2^31)and survives any consumer that uses signedint32. Three call sites updated:def _paragraph_id() -> str: """Generate an identifier for a new paragraph element.""" - return str(uuid4().int & 0xFFFFFFFF) + return str(uuid4().int & 0x7FFFFFFF) def _object_id() -> str: """Generate an identifier suitable for table and shape objects.""" - return str(uuid4().int & 0xFFFFFFFF) + return str(uuid4().int & 0x7FFFFFFF) def _memo_id() -> str: """Generate a lightweight identifier for memo elements.""" - return str(uuid4().int & 0xFFFFFFFF) + return str(uuid4().int & 0x7FFFFFFF)Collision risk is essentially unchanged — birthday probability for
N=10_000ids drops from ~2.3e-5(32-bit) to ~4.7e-5(31-bit), which is still negligible for any realistic document size.Regression test
tests/test_id_generator_range.py(new):test_id_generators_stay_within_signed_int32— samples 200 values from each of the three generators and asserts0 <= v < 2**31.test_id_generators_use_full_31_bit_range— samples 4,000 values from each generator and asserts at least one is>= 2^30, guarding against accidental over-restriction (e.g. a future change that masks too many bits).Without the fix,
test_id_generators_stay_within_signed_int32fails immediately:With the fix:
Wider regression check
(2 skips are pre-existing, due to absent sample HWPX fixtures in this checkout.)
Notes
_allocate_*_idhelpers (_allocate_char_property_id,_allocate_border_fill_id,_allocate_bin_item_id) propagatemax(numeric_ids) + 1. They will start producing in-range values once the input document also has in-range ids — that is a follow-up topic and not part of this PR.src/hwpx/data/Skeleton.hwpxships with<hp:p id="3121190098">(>= 2^31). That is filed separately asSkeleton.hwpxships with<hp:p id="3121190098">(out of signed int32 range) #35 and patched separately so the two fixes can land independently.