-
Notifications
You must be signed in to change notification settings - Fork 1k
Feat(duckdb): Transpile INITCAP with custom delimiters #6302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@treysp do we actually need to check whether the delimiter is If it's null, won't the null value bubble up eventually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for transpiling the INITCAP function with custom delimiters to DuckDB, as well as implementing default delimiter handling across multiple SQL dialects (BigQuery, Snowflake, Spark, Hive, Presto).
- Adds parser support to attach default delimiters when not explicitly provided
- Implements DuckDB transpilation using
ARRAY_TO_STRING,LIST_TRANSFORM, andREGEXP_EXTRACT_ALLto handle custom delimiters - Adds generator logic to suppress default delimiters during round-tripping and warn about unsupported custom delimiters
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sqlglot/parser.py | Adds _parse_initcap() method to attach dialect-specific default delimiters |
| sqlglot/generator.py | Adds initcap_sql() to handle delimiter generation and unsupported delimiter warnings |
| sqlglot/dialects/dialect.py | Defines base INITCAP_SUPPORTS_CUSTOM_DELIMITERS and INITCAP_DEFAULT_DELIMITER_CHARS properties |
| sqlglot/dialects/bigquery.py | Sets BigQuery-specific default delimiter characters |
| sqlglot/dialects/snowflake.py | Sets Snowflake-specific default delimiter characters |
| sqlglot/dialects/spark2.py | Sets Spark-specific default delimiter characters |
| sqlglot/dialects/presto.py | Implements Presto transpilation using REGEXP_REPLACE with custom delimiter warning |
| sqlglot/dialects/duckdb.py | Implements complex DuckDB transpilation with regex-based string segmentation and capitalization |
| tests/dialects/test_dialect.py | Adds comprehensive tests for INITCAP with default and custom delimiters across dialects |
| tests/dialects/test_hive.py | Adds test for Hive INITCAP transpilation to DuckDB |
| tests/dialects/test_presto.py | Adds test for Presto INITCAP transpilation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
georgesittas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems legit, feel free to merge when ready
dc1b209 to
40b1988
Compare
INITCAPtakes a string and a set of delimiters, capitalizing the string segments between delimiters.Dialects may have different default delimiters, and Bigquery and Snowflake accept a custom delimiters arg. This PR adds DuckDB transpilation support for default and custom delimiters.
The implementation in DuckDB is not intuitive - we explain our motivations below.
General problem statement
Consider mutually exclusive sets of characters, "delimiters" and "non-delimiters."
Given a string containing both delimiters and non-delimiters:
The delimiter set may be provided by the user as:
Implementation approach
[{delimiter string}]+|[^{delimiter string}]+returns list of alternating segment typesProblem
We must determine whether each segment contains delimiters and should/shouldn't be capitalized.
We could examine each segment as we walk the list, but that doesn't work if the custom delimiters arg is a sub-query. (DuckDB doesn't allow subqueries in lambdas.)
However, we know the segment list alternates between delimiters and non-delimiters. Therefore, we can infer which list indexes need capitalization if we know any list entry's delimiter status.
Instead of examining a list entry directly, it is simpler to just examine the first character of the entire string. If it is not a delimiter, the first list entry should be capitalized along with all odd indexes (first, third, etc.).
Example:
'aB11cD'INITCAP('aB11cD', '1')# custom delimiter is "1"'Ab11Cd''a'--> NON-delimiter: capitalize odd indexes[1]+|[^1]+['aB', '11', 'cD']'aB' --> 'Ab''11' --> '11''cD' --> 'Cd''Ab' || '11' || 'Cd'-->'Ab11Cd'Transpiled DuckDB query corresponding to example: