ByteCount fails to count surrogate characters properly.

**Describe the bug**
The following symbol - 😗, is composed of the following two characters: `\ud83d` and `\ude17`. Calling `Encoding.UTF8.GetByteCount(new char[] { <one of the two characters> }` instead of an array with both symbols in it, will yield in an incorrect byte counting.

**To Reproduce**
Create a single line CSV file or in-memory `StringStream` that has an emoticon inside it. If you rely on `ByteCount` to correctly count the bytes, and move the stream's position forward by that many bytes. There's a chance you won't be able to process the next line, if for example an opening quote was skipped due the incorrect `ByteCount` property.

**Expected behavior**
`ByteCount` to correctly count such symbols.

**Additional context**
I'm not really in the deep end of things with these emoticons, so I find it really weird why the incorrect byte count happens. I'm aware that if we instantiate an `UTF8Encoding` with `throwOnInvalidBytes` set to `true`, passing either of the two characters by themselves will throw an exception, but passed together it will work.

**EDIT:** It appears those are called surrogate characters. https://www.ibm.com/docs/en/i/7.3?topic=renamed-surrogate-characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ByteCount fails to count surrogate characters properly. #2088

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

ByteCount fails to count surrogate characters properly. #2088

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions