Skip to content

ByteCount fails to count surrogate characters properly. #2088

@kikaragyozov

Description

@kikaragyozov

Describe the bug
The following symbol - 😗, is composed of the following two characters: \ud83d and \ude17. Calling Encoding.UTF8.GetByteCount(new char[] { <one of the two characters> } instead of an array with both symbols in it, will yield in an incorrect byte counting.

To Reproduce
Create a single line CSV file or in-memory StringStream that has an emoticon inside it. If you rely on ByteCount to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrect ByteCount property.

Expected behavior
ByteCount to correctly count such symbols.

Additional context
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an UTF8Encoding with throwOnInvalidBytes set to true, passing either of the two characters by themselves will throw an exception, but passed together it will work.

EDIT: It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions