Skip to content

Conversation

@olegmingaleev
Copy link

@olegmingaleev olegmingaleev commented Oct 23, 2025

📋 Summary

This PR fixes Unicode handling issues in the string diff functions (pfx and sfx) by implementing proper grapheme cluster segmentation using the modern Intl.Segmenter API.

🐛 Problem

The previous implementation had issues with complex Unicode characters, particularly:
Multi-byte Unicode characters (emojis, accented characters)
Zero Width Joiner (ZWJ) sequences in emojis (e.g., 👨‍🍳)
Surrogate pairs and combining characters
Incorrect prefix/suffix calculations leading to malformed diffs

✅ Solution

Replaced character-based logic with Intl.Segmenter for proper grapheme cluster handling
Updated pfx() function to work with grapheme clusters instead of individual characters
Updated sfx() function to work with grapheme clusters instead of individual characters
Added comprehensive test case for chef emoji with ZWJ sequences

🧪 Testing

Added test case for complex emoji: 👨‍🍳 (chef emoji with ZWJ)
All existing tests continue to pass
Verified correct diff behavior with multi-byte Unicode characters
###🔍 Technical Details
The fix uses Intl.Segmenter with granularity: 'grapheme' to properly segment strings into grapheme clusters, ensuring that complex Unicode characters are treated as single units rather than being split across multiple characters.

🎯 Impact

✅ Correct diff behavior with all Unicode characters
✅ Better handling of emoji sequences

- Use Intl.Segmenter for proper grapheme cluster handling
- Fix prefix/suffix calculation for complex emoji sequences
- Add test case for chef emoji with ZWJ sequences
- Ensures correct diff behavior with multi-byte Unicode characters
@streamich
Copy link
Owner

Fixed here #964

@streamich streamich closed this Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants