Fix incorrect results when ascii-control characters are used as Quote and Escape #533

y-wei · 2024-02-29T01:27:27Z

As titled, without this fix incorrect results are created:

final Reader csv = new StringReader("\u00127\u0012," +
				"\u0012EmbeddedDouble\u0012," +
				"\u0012field\u0012\u0012 t\u0012\u0012ext\u0012," +
				"\u0012field\u0012\u0012 t\u0012\u0012ext\u0012");
final CsvParserSettings settings = new CsvParserSettings();
settings.getFormat().setQuote('\u0012');
settings.getFormat().setQuoteEscape('\u0012');
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);
final CsvParser csvParser = new CsvParser(settings);
final String[] row = csvParser.parseAll(csv).get(0);
// row: ["7", "EmbeddedDouble", "field\u0012\u0012 t\u0012\u0012ext", "field\u0012\u0012 t\u0012\u0012ext"]
// expect ["7", "EmbeddedDouble", "field\u0012 t\u0012ext", "field\u0012 t\u0012ext"]

while it should be the one in the added unit test.

### What changes were proposed in this pull request? This PR proposes to change signature `CsvParser` to `AbstractParser` (its parent class). ### Why are the changes needed? - It's better to use higher classes if they fit for better extendibility and maintenance. - Univocity parser became inactive for the last three years, and we're missing bug fixes such as uniVocity/univocity-parsers#533. We should probably leverage their interface, and implement it in Spark for bug fixes and further performance improvement. This is a basework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45328 from HyukjinKwon/SPARK-47221. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>

### What changes were proposed in this pull request? This PR proposes to change signature `CsvParser` to `AbstractParser` (its parent class). ### Why are the changes needed? - It's better to use higher classes if they fit for better extendibility and maintenance. - Univocity parser became inactive for the last three years, and we're missing bug fixes such as uniVocity/univocity-parsers#533. We should probably leverage their interface, and implement it in Spark for bug fixes and further performance improvement. This is a basework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45328 from HyukjinKwon/SPARK-47221. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>

fix asciiControlCharAsQuoteAndEscape

2fdbbae

HyukjinKwon mentioned this pull request Feb 29, 2024

[SPARK-47221][SQL] Uses signatures from CsvParser to AbstractParser apache/spark#45328

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix incorrect results when ascii-control characters are used as Quote and Escape #533

Fix incorrect results when ascii-control characters are used as Quote and Escape #533

Uh oh!

y-wei commented Feb 29, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix incorrect results when ascii-control characters are used as Quote and Escape #533

Are you sure you want to change the base?

Fix incorrect results when ascii-control characters are used as Quote and Escape #533

Uh oh!

Conversation

y-wei commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

y-wei commented Feb 29, 2024 •

edited

Loading