Skip to content

Fix string offset parameters to use character positions instead of byte offsets#31

Draft
Claude wants to merge 2 commits intomainfrom
claude/fix-byte-offset-issue
Draft

Fix string offset parameters to use character positions instead of byte offsets#31
Claude wants to merge 2 commits intomainfrom
claude/fix-byte-offset-issue

Conversation

@Claude
Copy link
Copy Markdown

@Claude Claude AI commented Apr 1, 2026

The Find(string, int) and Captures(string, int) methods were treating the offset parameter as a byte position in the UTF-8 encoded string, causing incorrect behavior with multi-byte characters (emojis, CJK characters, etc.).

Changes

  • Modified Find(string haystack, int offset) and Captures(string haystack, int offset): Convert character offset to byte offset using Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset)) before passing to the underlying byte-based methods
  • Updated XML documentation: Clarified that string-based methods accept character offsets while byte-based methods accept byte offsets
  • Added test coverage: 6 new tests covering multi-byte UTF-8 scenarios (4-byte emojis, 3-byte CJK characters)

Example

Regex re = new("world");
string haystack = "Hello 🌍 world";  // 🌍 is 4 bytes in UTF-8

// Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8)...
// Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11)...

// Now correctly interprets offset as character position
var match = re.Find(haystack, 8);  // Starts at character 'w', not mid-emoji
Assert.Equal("world", match.ExtractedText);

- Modified Find(string, int) to convert character offset to byte offset using Encoding.UTF8.GetByteCount
- Modified Captures(string, int) to convert character offset to byte offset
- Updated XML documentation to clarify that string methods use character offsets, while byte methods use byte offsets
- Added comprehensive tests for multi-byte UTF-8 characters (emojis and Asian characters)
- All 71 tests passing (6 new tests added)

Agent-Logs-Url: https://github.com/crispthinking/IronRe2/sessions/2d26941d-972d-4c97-a3ec-ad34e99c1ff3

Co-authored-by: corpo-iwillspeak <265613520+corpo-iwillspeak@users.noreply.github.com>
@Claude Claude AI changed the title [WIP] Fix offset interpretation in string haystack Fix string offset parameters to use character positions instead of byte offsets Apr 1, 2026
@Claude Claude AI requested a review from corpo-iwillspeak April 1, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Its not clear that offsets in string haystack refer to byte offsets

2 participants