Skip to content

Conversation

@treysp
Copy link
Collaborator

@treysp treysp commented Nov 11, 2025

INITCAP takes a string and a set of delimiters, capitalizing the string segments between delimiters.

Dialects may have different default delimiters, and Bigquery and Snowflake accept a custom delimiters arg. This PR adds DuckDB transpilation support for default and custom delimiters.

The implementation in DuckDB is not intuitive - we explain our motivations below.

General problem statement

Consider mutually exclusive sets of characters, "delimiters" and "non-delimiters."

Given a string containing both delimiters and non-delimiters:

  • Divide the string into segments, where each segment consists of sequential characters from one of the two sets
  • For the segments that contain non-delimiters, convert them to capital case (capitalize first letter, lowercase subsequent letters)
  • Concatenate the transformed segments together, such that the output string has the same composition as the input other than capitalization changes to non-delimiter segments

The delimiter set may be provided by the user as:

  • String literal
  • SQL expression: column reference, subquery, NULL

Implementation approach

  • Split the string into segments: regexp_extract_all on [{delimiter string}]+|[^{delimiter string}]+ returns list of alternating segment types
  • Walk through the list
    • If segment is delimiter, do nothing
    • If segment is non-delimiter, capitalize
  • Concat processed list

Problem

We must determine whether each segment contains delimiters and should/shouldn't be capitalized.

We could examine each segment as we walk the list, but that doesn't work if the custom delimiters arg is a sub-query. (DuckDB doesn't allow subqueries in lambdas.)

However, we know the segment list alternates between delimiters and non-delimiters. Therefore, we can infer which list indexes need capitalization if we know any list entry's delimiter status.

Instead of examining a list entry directly, it is simpler to just examine the first character of the entire string. If it is not a delimiter, the first list entry should be capitalized along with all odd indexes (first, third, etc.).

Example:

  • Setup
    • Input string: 'aB11cD'
    • Function call: INITCAP('aB11cD', '1') # custom delimiter is "1"
    • Expected output: 'Ab11Cd'
  • Operations
    • First letter of input string is 'a' --> NON-delimiter: capitalize odd indexes
    • Construct regex: [1]+|[^1]+
    • Regex extraction returns ['aB', '11', 'cD']
    • Walk list
      • Index 1, capitalize: 'aB' --> 'Ab'
      • Index 2, pass through: '11' --> '11'
      • Index 3, capitalize: 'cD' --> 'Cd'
    • Aggregate string
      • 'Ab' || '11' || 'Cd' --> 'Ab11Cd'

Transpiled DuckDB query corresponding to example:

ARRAY_TO_STRING(
    CASE 
      -- is first character a delimiter?
      WHEN REGEXP_MATCHES(LEFT('aB11cD', 1), '[1]')
        -- if so, capitalize EVEN indexes: idx % 2 = 0
        THEN LIST_TRANSFORM(
           REGEXP_EXTRACT_ALL('aB11cD', '([1]+|[^1]+)'),
           (seg, idx) -> CASE WHEN idx % 2 = 0 THEN UPPER(LEFT(seg, 1)) || LOWER(SUBSTRING(seg, 2)) ELSE seg END
           )
        -- if not, capitalize ODD indexes: idx % 2 = 1
        ELSE LIST_TRANSFORM(
            REGEXP_EXTRACT_ALL('aB11cD', '([1]+|[^1]+)'), 
            (seg, idx) -> CASE WHEN idx % 2 = 1 THEN UPPER(LEFT(seg, 1)) || LOWER(SUBSTRING(seg, 2)) ELSE seg END
            )
         END, 
  ''
  ) 

@georgesittas
Copy link
Collaborator

@treysp do we actually need to check whether the delimiter is NULL?

CASE WHEN 'aB11cD' IS NULL THEN NULL ELSE ...

If it's null, won't the null value bubble up eventually?

@treysp treysp requested a review from Copilot November 14, 2025 19:14
Copilot finished reviewing on behalf of treysp November 14, 2025 19:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for transpiling the INITCAP function with custom delimiters to DuckDB, as well as implementing default delimiter handling across multiple SQL dialects (BigQuery, Snowflake, Spark, Hive, Presto).

  • Adds parser support to attach default delimiters when not explicitly provided
  • Implements DuckDB transpilation using ARRAY_TO_STRING, LIST_TRANSFORM, and REGEXP_EXTRACT_ALL to handle custom delimiters
  • Adds generator logic to suppress default delimiters during round-tripping and warn about unsupported custom delimiters

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sqlglot/parser.py Adds _parse_initcap() method to attach dialect-specific default delimiters
sqlglot/generator.py Adds initcap_sql() to handle delimiter generation and unsupported delimiter warnings
sqlglot/dialects/dialect.py Defines base INITCAP_SUPPORTS_CUSTOM_DELIMITERS and INITCAP_DEFAULT_DELIMITER_CHARS properties
sqlglot/dialects/bigquery.py Sets BigQuery-specific default delimiter characters
sqlglot/dialects/snowflake.py Sets Snowflake-specific default delimiter characters
sqlglot/dialects/spark2.py Sets Spark-specific default delimiter characters
sqlglot/dialects/presto.py Implements Presto transpilation using REGEXP_REPLACE with custom delimiter warning
sqlglot/dialects/duckdb.py Implements complex DuckDB transpilation with regex-based string segmentation and capitalization
tests/dialects/test_dialect.py Adds comprehensive tests for INITCAP with default and custom delimiters across dialects
tests/dialects/test_hive.py Adds test for Hive INITCAP transpilation to DuckDB
tests/dialects/test_presto.py Adds test for Presto INITCAP transpilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit, feel free to merge when ready

@treysp treysp force-pushed the trey/initcap branch 2 times, most recently from dc1b209 to 40b1988 Compare November 17, 2025 18:29
@treysp treysp marked this pull request as ready for review November 17, 2025 19:42
@treysp treysp merged commit ca81217 into main Nov 17, 2025
7 checks passed
@treysp treysp deleted the trey/initcap branch November 17, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants