🔧 Fix: Improve Unicode Handling in String Diff Functions #962
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📋 Summary
This PR fixes Unicode handling issues in the string diff functions (pfx and sfx) by implementing proper grapheme cluster segmentation using the modern Intl.Segmenter API.
🐛 Problem
The previous implementation had issues with complex Unicode characters, particularly:
Multi-byte Unicode characters (emojis, accented characters)
Zero Width Joiner (ZWJ) sequences in emojis (e.g., 👨🍳)
Surrogate pairs and combining characters
Incorrect prefix/suffix calculations leading to malformed diffs
✅ Solution
Replaced character-based logic with Intl.Segmenter for proper grapheme cluster handling
Updated pfx() function to work with grapheme clusters instead of individual characters
Updated sfx() function to work with grapheme clusters instead of individual characters
Added comprehensive test case for chef emoji with ZWJ sequences
🧪 Testing
Added test case for complex emoji: 👨🍳 (chef emoji with ZWJ)
All existing tests continue to pass
Verified correct diff behavior with multi-byte Unicode characters
###🔍 Technical Details
The fix uses Intl.Segmenter with granularity: 'grapheme' to properly segment strings into grapheme clusters, ensuring that complex Unicode characters are treated as single units rather than being split across multiple characters.
🎯 Impact
✅ Correct diff behavior with all Unicode characters
✅ Better handling of emoji sequences