Skip to content

Commit 65a476e

Browse files
HyukjinKwonericm-db
authored andcommitted
[SPARK-47221][SQL] Uses signatures from CsvParser to AbstractParser
### What changes were proposed in this pull request? This PR proposes to change signature `CsvParser` to `AbstractParser` (its parent class). ### Why are the changes needed? - It's better to use higher classes if they fit for better extendibility and maintenance. - Univocity parser became inactive for the last three years, and we're missing bug fixes such as uniVocity/univocity-parsers#533. We should probably leverage their interface, and implement it in Spark for bug fixes and further performance improvement. This is a basework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45328 from HyukjinKwon/SPARK-47221. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>
1 parent 5672ec0 commit 65a476e

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@
1717

1818
package org.apache.spark.sql.catalyst.csv
1919

20-
import com.univocity.parsers.csv.CsvParser
20+
import com.univocity.parsers.common.AbstractParser
21+
import com.univocity.parsers.csv.{CsvParser, CsvParserSettings}
2122

2223
import org.apache.spark.SparkIllegalArgumentException
2324
import org.apache.spark.internal.Logging
@@ -110,7 +111,7 @@ class CSVHeaderChecker(
110111
}
111112

112113
// This is currently only used to parse CSV with multiLine mode.
113-
private[csv] def checkHeaderColumnNames(tokenizer: CsvParser): Unit = {
114+
private[csv] def checkHeaderColumnNames(tokenizer: AbstractParser[CsvParserSettings]): Unit = {
114115
assert(options.multiLine, "This method should be executed with multiLine.")
115116
if (options.headerFlag) {
116117
val firstRecord = tokenizer.parseNext()
@@ -119,7 +120,8 @@ class CSVHeaderChecker(
119120
}
120121

121122
// This is currently only used to parse CSV with non-multiLine mode.
122-
private[csv] def checkHeaderColumnNames(lines: Iterator[String], tokenizer: CsvParser): Unit = {
123+
private[csv] def checkHeaderColumnNames(
124+
lines: Iterator[String], tokenizer: AbstractParser[CsvParserSettings]): Unit = {
123125
assert(!options.multiLine, "This method should not be executed with multiline.")
124126
// Checking that column names in the header are matched to field names of the schema.
125127
// The header will be removed from lines.

0 commit comments

Comments
 (0)