fix: strip whitespace from <t> elements without xml:space="preserve"#668
Open
dayongxie wants to merge 3 commits into
Open
fix: strip whitespace from <t> elements without xml:space="preserve"#668dayongxie wants to merge 3 commits into
dayongxie wants to merge 3 commits into
Conversation
Per the XML spec, whitespace in element content should be normalized unless xml:space="preserve" is set. calamine's read_string_with_bufs previously preserved all whitespace unconditionally, causing cells with leading/trailing spaces, tabs, or newlines to return those characters. This fix checks for xml:space="preserve" on <t> elements in shared strings and inline strings. When absent, leading/trailing whitespace (\t, \n, space, \r) is trimmed — but only if the string contains non-whitespace content, to preserve cells whose entire value is intentionally whitespace. Covers both shared strings (read_shared_strings) and inline strings (cells_reader.rs), since both call read_string_with_bufs.
Collaborator
|
Thanks. Could you add some tests as part of the PR. |
… elements without xml:space=preserve Use quick-xml's config-level trim_text(true) toggle inside <t> element parsing. This trims ASCII whitespace at raw byte level before character reference expansion, so 
 (which is &, #, x, A, ; bytes) is preserved while physical whitespace bytes are trimmed — matching ECMA-376 spec. Also trim all-whitespace strings to empty string, consistent with Excel behavior.
2070ffd to
888b972
Compare
Author
|
I’ve added unit tests and also fixed an issue in the previous implementation. |
Collaborator
|
Some questions about what is being fixed:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per the XML spec, whitespace in element content should be normalized unless xml:space="preserve" is set. calamine's read_string_with_bufs previously preserved all whitespace unconditionally, causing cells with leading/trailing spaces, tabs, or newlines to return those characters.
This fix checks for xml:space="preserve" on elements in shared strings and inline strings. When absent, leading/trailing whitespace (\t, \n, space, \r) is trimmed — but only if the string contains non-whitespace content, to preserve cells whose entire value is intentionally whitespace.
Covers both shared strings (read_shared_strings) and inline strings (cells_reader.rs), since both call read_string_with_bufs.