Skip to content

fix: strip whitespace from <t> elements without xml:space="preserve"#668

Open
dayongxie wants to merge 3 commits into
tafia:masterfrom
dayongxie:fix/strip-whitespace-from-text-nodes
Open

fix: strip whitespace from <t> elements without xml:space="preserve"#668
dayongxie wants to merge 3 commits into
tafia:masterfrom
dayongxie:fix/strip-whitespace-from-text-nodes

Conversation

@dayongxie

Copy link
Copy Markdown

Per the XML spec, whitespace in element content should be normalized unless xml:space="preserve" is set. calamine's read_string_with_bufs previously preserved all whitespace unconditionally, causing cells with leading/trailing spaces, tabs, or newlines to return those characters.

This fix checks for xml:space="preserve" on elements in shared strings and inline strings. When absent, leading/trailing whitespace (\t, \n, space, \r) is trimmed — but only if the string contains non-whitespace content, to preserve cells whose entire value is intentionally whitespace.

Covers both shared strings (read_shared_strings) and inline strings (cells_reader.rs), since both call read_string_with_bufs.

Per the XML spec, whitespace in element content should be normalized
unless xml:space="preserve" is set. calamine's read_string_with_bufs
previously preserved all whitespace unconditionally, causing cells with
leading/trailing spaces, tabs, or newlines to return those characters.

This fix checks for xml:space="preserve" on <t> elements in shared
strings and inline strings. When absent, leading/trailing whitespace
(\t, \n, space, \r) is trimmed — but only if the string contains
non-whitespace content, to preserve cells whose entire value is
intentionally whitespace.

Covers both shared strings (read_shared_strings) and inline strings
(cells_reader.rs), since both call read_string_with_bufs.
@jmcnamara

Copy link
Copy Markdown
Collaborator

Thanks. Could you add some tests as part of the PR.

@jmcnamara jmcnamara self-assigned this Jun 26, 2026
@jmcnamara jmcnamara added next_release awaiting user changes Awaiting changes to a PR to fix requested changes or CI issues. labels Jun 26, 2026
… elements without xml:space=preserve

Use quick-xml's config-level trim_text(true) toggle inside &lt;t&gt; element
parsing. This trims ASCII whitespace at raw byte level before character
reference expansion, so &#xA; (which is &, #, x, A, ; bytes) is preserved
while physical whitespace bytes are trimmed — matching ECMA-376 spec.

Also trim all-whitespace strings to empty string, consistent with Excel behavior.
@dayongxie dayongxie force-pushed the fix/strip-whitespace-from-text-nodes branch from 2070ffd to 888b972 Compare June 27, 2026 03:13
@dayongxie

Copy link
Copy Markdown
Author

I’ve added unit tests and also fixed an issue in the previous implementation.
I switched to using trim_text for whitespace normalization, which avoids inconsistencies from handling different paths separately.
Now both shared strings and inline strings follow the same logic, and the behavior matches the expected XML semantics.
If you have a moment, could you please take another look at this revision? 🙏

@jmcnamara

Copy link
Copy Markdown
Collaborator

Some questions about what is being fixed:

  1. What was the old behaviour, with an example.
  2. What is the current behaviour, with the same example.
  3. What commit changed the behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting user changes Awaiting changes to a PR to fix requested changes or CI issues. next_release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants