Feat(duckdb): Transpile INITCAP with custom delimiters#6302
Merged
Conversation
Collaborator
|
@treysp do we actually need to check whether the delimiter is If it's null, won't the null value bubble up eventually? |
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for transpiling the INITCAP function with custom delimiters to DuckDB, as well as implementing default delimiter handling across multiple SQL dialects (BigQuery, Snowflake, Spark, Hive, Presto).
- Adds parser support to attach default delimiters when not explicitly provided
- Implements DuckDB transpilation using
ARRAY_TO_STRING,LIST_TRANSFORM, andREGEXP_EXTRACT_ALLto handle custom delimiters - Adds generator logic to suppress default delimiters during round-tripping and warn about unsupported custom delimiters
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sqlglot/parser.py | Adds _parse_initcap() method to attach dialect-specific default delimiters |
| sqlglot/generator.py | Adds initcap_sql() to handle delimiter generation and unsupported delimiter warnings |
| sqlglot/dialects/dialect.py | Defines base INITCAP_SUPPORTS_CUSTOM_DELIMITERS and INITCAP_DEFAULT_DELIMITER_CHARS properties |
| sqlglot/dialects/bigquery.py | Sets BigQuery-specific default delimiter characters |
| sqlglot/dialects/snowflake.py | Sets Snowflake-specific default delimiter characters |
| sqlglot/dialects/spark2.py | Sets Spark-specific default delimiter characters |
| sqlglot/dialects/presto.py | Implements Presto transpilation using REGEXP_REPLACE with custom delimiter warning |
| sqlglot/dialects/duckdb.py | Implements complex DuckDB transpilation with regex-based string segmentation and capitalization |
| tests/dialects/test_dialect.py | Adds comprehensive tests for INITCAP with default and custom delimiters across dialects |
| tests/dialects/test_hive.py | Adds test for Hive INITCAP transpilation to DuckDB |
| tests/dialects/test_presto.py | Adds test for Presto INITCAP transpilation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
georgesittas
approved these changes
Nov 15, 2025
Collaborator
georgesittas
left a comment
There was a problem hiding this comment.
Seems legit, feel free to merge when ready
Co-authored-by: Jo <46752250+georgesittas@users.noreply.github.com>
dc1b209 to
40b1988
Compare
georgesittas
approved these changes
Nov 17, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
INITCAPtakes a string and a set of delimiters, capitalizing the string segments between delimiters.Dialects may have different default delimiters, and Bigquery and Snowflake accept a custom delimiters arg. This PR adds DuckDB transpilation support for default and custom delimiters.
The implementation in DuckDB is not intuitive - we explain our motivations below.
General problem statement
Consider mutually exclusive sets of characters, "delimiters" and "non-delimiters."
Given a string containing both delimiters and non-delimiters:
The delimiter set may be provided by the user as:
Implementation approach
[{delimiter string}]+|[^{delimiter string}]+returns list of alternating segment typesProblem
We must determine whether each segment contains delimiters and should/shouldn't be capitalized.
We could examine each segment as we walk the list, but that doesn't work if the custom delimiters arg is a sub-query. (DuckDB doesn't allow subqueries in lambdas.)
However, we know the segment list alternates between delimiters and non-delimiters. Therefore, we can infer which list indexes need capitalization if we know any list entry's delimiter status.
Instead of examining a list entry directly, it is simpler to just examine the first character of the entire string. If it is not a delimiter, the first list entry should be capitalized along with all odd indexes (first, third, etc.).
Example:
'aB11cD'INITCAP('aB11cD', '1')# custom delimiter is "1"'Ab11Cd''a'--> NON-delimiter: capitalize odd indexes[1]+|[^1]+['aB', '11', 'cD']'aB' --> 'Ab''11' --> '11''cD' --> 'Cd''Ab' || '11' || 'Cd'-->'Ab11Cd'Transpiled DuckDB query corresponding to example: