Lacking api documentation OR bug
When specifying an offset alongside a string, I expected the offset to refer to character positions—but it’s actually interpreted as byte positions.
Referring to API at:
Expected Behavior
- Offsets supplied for strings should be interpreted as character unit offsets.
- The regex engine should correctly map character offsets to the starting boundaries of UTF-8 code points.
Impact
- No issues when all characters are ASCII (single-byte UTF-8).
- For code points encoded as multiple bytes, the byte-based offset unintendedly may fall in the middle of a code point, causing the match to fail.
Proposed Solution
Mapping character offsets to byte offsets
public Match Find(string haystack, int offset)
{
var hayBytes = Encoding.UTF8.GetBytes(haystack);
var byteOffset = Encoding.UTF8.GetByteCount(text.AsSpan(0, offset));
return Find(hayBytes, offset);
}
Improving documentation
Explicitly note and state warning that the offset refers to bytes:
/// <param name="haystack">The string to search for the pattern</param>
/// <param name="offset">The offest to start searching from **in the UTF-8 encoded haystack**</param>
/// <returns>The captures data</returns>
public Match Find(string haystack, int offset)
Removing unintuitive api
Since the api that allows string input is just a wrapper for the api using bytes, disallowing the string input would only increase the code needed to call it by one more line, being the string to byte conversion: Encoding.UTF8.GetBytes(haystack).
Having to do the encoding to byte array themselves would make it clear what the offsets are referring to.
Lacking api documentation OR bug
When specifying an offset alongside a string, I expected the offset to refer to character positions—but it’s actually interpreted as byte positions.
Referring to API at:
IronRe2/src/IronRe2/Regex.cs
Line 288 in b0281fd
Expected Behavior
Impact
Proposed Solution
Mapping character offsets to byte offsets
Improving documentation
Explicitly note and state warning that the offset refers to bytes:
Removing unintuitive api
Since the api that allows string input is just a wrapper for the api using bytes, disallowing the string input would only increase the code needed to call it by one more line, being the string to byte conversion:
Encoding.UTF8.GetBytes(haystack).Having to do the encoding to byte array themselves would make it clear what the offsets are referring to.