-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
The following symbol - 😗, is composed of the following two characters: \ud83d and \ude17. Calling Encoding.UTF8.GetByteCount(new char[] { <one of the two characters> } instead of an array with both symbols in it, will yield in an incorrect byte counting.
To Reproduce
Create a single line CSV file or in-memory StringStream that has an emoticon inside it. If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrect ByteCount property.
Expected behavior
ByteCount to correctly count such symbols.
Additional context
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an UTF8Encoding with throwOnInvalidBytes set to true, passing either of the two characters by themselves will throw an exception, but passed together it will work.
EDIT: It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters