Feat(duckdb): Transpile INITCAP with custom delimiters by treysp · Pull Request #6302 · tobymao/sqlglot

treysp · 2025-11-11T01:04:40Z

INITCAP takes a string and a set of delimiters, capitalizing the string segments between delimiters.

Dialects may have different default delimiters, and Bigquery and Snowflake accept a custom delimiters arg. This PR adds DuckDB transpilation support for default and custom delimiters.

The implementation in DuckDB is not intuitive - we explain our motivations below.

General problem statement

Consider mutually exclusive sets of characters, "delimiters" and "non-delimiters."

Given a string containing both delimiters and non-delimiters:

Divide the string into segments, where each segment consists of sequential characters from one of the two sets
For the segments that contain non-delimiters, convert them to capital case (capitalize first letter, lowercase subsequent letters)
Concatenate the transformed segments together, such that the output string has the same composition as the input other than capitalization changes to non-delimiter segments

The delimiter set may be provided by the user as:

String literal
SQL expression: column reference, subquery, NULL

Implementation approach

Split the string into segments: regexp_extract_all on [{delimiter string}]+|[^{delimiter string}]+ returns list of alternating segment types
Walk through the list
- If segment is delimiter, do nothing
- If segment is non-delimiter, capitalize
Concat processed list

Problem

We must determine whether each segment contains delimiters and should/shouldn't be capitalized.

We could examine each segment as we walk the list, but that doesn't work if the custom delimiters arg is a sub-query. (DuckDB doesn't allow subqueries in lambdas.)

However, we know the segment list alternates between delimiters and non-delimiters. Therefore, we can infer which list indexes need capitalization if we know any list entry's delimiter status.

Instead of examining a list entry directly, it is simpler to just examine the first character of the entire string. If it is not a delimiter, the first list entry should be capitalized along with all odd indexes (first, third, etc.).

Example:

Setup
- Input string: 'aB11cD'
- Function call: INITCAP('aB11cD', '1') # custom delimiter is "1"
- Expected output: 'Ab11Cd'
Operations
- First letter of input string is 'a' --> NON-delimiter: capitalize odd indexes
- Construct regex: [1]+|[^1]+
- Regex extraction returns ['aB', '11', 'cD']
- Walk list
  - Index 1, capitalize: 'aB' --> 'Ab'
  - Index 2, pass through: '11' --> '11'
  - Index 3, capitalize: 'cD' --> 'Cd'
- Aggregate string
  - 'Ab' || '11' || 'Cd' --> 'Ab11Cd'

Transpiled DuckDB query corresponding to example:

ARRAY_TO_STRING(
    CASE 
      -- is first character a delimiter?
      WHEN REGEXP_MATCHES(LEFT('aB11cD', 1), '[1]')
        -- if so, capitalize EVEN indexes: idx % 2 = 0
        THEN LIST_TRANSFORM(
           REGEXP_EXTRACT_ALL('aB11cD', '([1]+|[^1]+)'),
           (seg, idx) -> CASE WHEN idx % 2 = 0 THEN UPPER(LEFT(seg, 1)) || LOWER(SUBSTRING(seg, 2)) ELSE seg END
           )
        -- if not, capitalize ODD indexes: idx % 2 = 1
        ELSE LIST_TRANSFORM(
            REGEXP_EXTRACT_ALL('aB11cD', '([1]+|[^1]+)'), 
            (seg, idx) -> CASE WHEN idx % 2 = 1 THEN UPPER(LEFT(seg, 1)) || LOWER(SUBSTRING(seg, 2)) ELSE seg END
            )
         END, 
  ''
  )

georgesittas · 2025-11-12T15:26:23Z

@treysp do we actually need to check whether the delimiter is NULL?

CASE WHEN 'aB11cD' IS NULL THEN NULL ELSE ...

If it's null, won't the null value bubble up eventually?

sqlglot/dialects/bigquery.py

sqlglot/dialects/dialect.py

sqlglot/dialects/duckdb.py

sqlglot/generator.py

tests/dialects/test_dialect.py

Copilot

Pull Request Overview

This PR adds support for transpiling the INITCAP function with custom delimiters to DuckDB, as well as implementing default delimiter handling across multiple SQL dialects (BigQuery, Snowflake, Spark, Hive, Presto).

Adds parser support to attach default delimiters when not explicitly provided
Implements DuckDB transpilation using ARRAY_TO_STRING, LIST_TRANSFORM, and REGEXP_EXTRACT_ALL to handle custom delimiters
Adds generator logic to suppress default delimiters during round-tripping and warn about unsupported custom delimiters

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
sqlglot/parser.py	Adds `_parse_initcap()` method to attach dialect-specific default delimiters
sqlglot/generator.py	Adds `initcap_sql()` to handle delimiter generation and unsupported delimiter warnings
sqlglot/dialects/dialect.py	Defines base `INITCAP_SUPPORTS_CUSTOM_DELIMITERS` and `INITCAP_DEFAULT_DELIMITER_CHARS` properties
sqlglot/dialects/bigquery.py	Sets BigQuery-specific default delimiter characters
sqlglot/dialects/snowflake.py	Sets Snowflake-specific default delimiter characters
sqlglot/dialects/spark2.py	Sets Spark-specific default delimiter characters
sqlglot/dialects/presto.py	Implements Presto transpilation using `REGEXP_REPLACE` with custom delimiter warning
sqlglot/dialects/duckdb.py	Implements complex DuckDB transpilation with regex-based string segmentation and capitalization
tests/dialects/test_dialect.py	Adds comprehensive tests for INITCAP with default and custom delimiters across dialects
tests/dialects/test_hive.py	Adds test for Hive INITCAP transpilation to DuckDB
tests/dialects/test_presto.py	Adds test for Presto INITCAP transpilation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sqlglot/dialects/duckdb.py

sqlglot/dialects/snowflake.py

sqlglot/dialects/duckdb.py

tests/dialects/test_dialect.py

georgesittas

Seems legit, feel free to merge when ready

sqlglot/dialects/duckdb.py

Co-authored-by: Jo <46752250+georgesittas@users.noreply.github.com>

sqlglot/dialects/duckdb.py

treysp force-pushed the trey/initcap branch from 78fd1e8 to c69c96c Compare November 12, 2025 01:09

georgesittas reviewed Nov 12, 2025

View reviewed changes

treysp force-pushed the trey/initcap branch from bcb3871 to 7ecae1b Compare November 14, 2025 19:13

treysp requested a review from Copilot November 14, 2025 19:14

Copilot started reviewing on behalf of treysp November 14, 2025 19:14 View session

Copilot finished reviewing on behalf of treysp November 14, 2025 19:19

Copilot AI reviewed Nov 14, 2025

View reviewed changes

georgesittas approved these changes Nov 15, 2025

View reviewed changes

sqlglot/dialects/duckdb.py Outdated Show resolved Hide resolved

treysp and others added 5 commits November 17, 2025 12:26

Transpile INITCAP with custom delimiters

a415771

Handle escaping when converting delimiters to regex expr

256b51f

Clean up from review

4300403

Co-authored-by: Jo <46752250+georgesittas@users.noreply.github.com>

PR feedback, clean up

4b3a394

Add presto unsupported warning, tests

23d2ad9

treysp force-pushed the trey/initcap branch 2 times, most recently from dc1b209 to 40b1988 Compare November 17, 2025 18:29

georgesittas approved these changes Nov 17, 2025

View reviewed changes

sqlglot/dialects/duckdb.py Outdated Show resolved Hide resolved

treysp added 3 commits November 17, 2025 12:39

Handle hive special control chars

bd6c021

Condense tests

d877e65

address omitted exp.Initcap delim arg

5512e0f

treysp force-pushed the trey/initcap branch from 40b1988 to 5512e0f Compare November 17, 2025 19:21

use itertool.groupby for WS control chars

6bde919

treysp marked this pull request as ready for review November 17, 2025 19:42

Fix typing

885099d

treysp force-pushed the trey/initcap branch from f199fc4 to 885099d Compare November 17, 2025 20:47

treysp merged commit ca81217 into main Nov 17, 2025
7 checks passed

treysp deleted the trey/initcap branch November 17, 2025 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(duckdb): Transpile INITCAP with custom delimiters#6302

Feat(duckdb): Transpile INITCAP with custom delimiters#6302
treysp merged 10 commits intomainfrom
trey/initcap

treysp commented Nov 11, 2025 •

edited

Loading

Uh oh!

georgesittas commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

treysp commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General problem statement

Implementation approach

Problem

Uh oh!

georgesittas commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

treysp commented Nov 11, 2025 •

edited

Loading