Skip to content

fix(duckdb)!: Clean up representation of exp.HexString#6045

Merged
VaggelisD merged 1 commit intomainfrom
vaggelisd/duckdb_hex_string
Oct 9, 2025
Merged

fix(duckdb)!: Clean up representation of exp.HexString#6045
VaggelisD merged 1 commit intomainfrom
vaggelisd/duckdb_hex_string

Conversation

@VaggelisD
Copy link
Collaborator

Fixes #6035

  • Snowflake:
SELECT x'ABCD', system$typeof(x'ABCD')
ABCD  | BINARY
  • Transpiled DuckDB before this PR:
D SELECT FROM_HEX('ABCD');
┌──────────────────┐
│ from_hex('ABCD') │
│       blob       │
├──────────────────┤
│ \xAB\xCD         │
└──────────────────┘
  • Transpiled DuckDB after this PR:
D SELECT CAST(HEX(FROM_HEX('ABCD')) AS VARBINARY);
┌─────────────────────────────────────┐
│ CAST(hex(from_hex('ABCD')) AS BLOB) │
│                blob                 │
├─────────────────────────────────────┤
│ ABCD                                │
└─────────────────────────────────────┘

Note that DuckDB does not allow hex strings:

D SELECT x'AB', X'CD';
┌─────────┬─────────┐
│  'xAB''xCD'  │
│ varcharvarchar │
├─────────┼─────────┤
│ xAB     │ xCD     │
└─────────┴─────────┘

@georgesittas
Copy link
Collaborator

So both with and without this PR's changes, the DuckDB SQL returns a BLOB. It seems, though, that the before vs after values don't match:

D SELECT CAST(HEX(FROM_HEX('ABCD')) AS VARBINARY) = FROM_HEX('ABCD');
┌──────────────────────────────────────────────────────────┐
│ (CAST(hex(from_hex('ABCD')) AS BLOB) = from_hex('ABCD')) │
│                         boolean                          │
├──────────────────────────────────────────────────────────┤
│ false                                                    │
└──────────────────────────────────────────────────────────┘

Can you provide some more details on why the new one is correct and matches Snowflake's behavior better? (I'm assuming DuckDB roundtrip isn't affected in any way? Do we test this?)

@VaggelisD
Copy link
Collaborator Author

VaggelisD commented Oct 8, 2025

(I'm assuming DuckDB roundtrip isn't affected in any way? Do we test this?)

DuckDB doesn't have a hex string afaik, so exp.HexString will always be an "alien" / transpiled node to it.

Can you provide some more details on why the new one is correct and matches Snowflake's behavior better

It seemed to me that the DuckDB post-PR results matched better with Snowflake's, e.g I'd expect that fetching them directly would net 'ABCD' in both cases instead of 'ABCD' != '\xAB\xCD'. The actual hex value might not differ, but as a STRING diff it would, so at best this would be a QoL.

However, interestingly enough this what their DataFrames actually return (using SQLMesh's engine adapter):

  • Snowflake
>>> ctx.engine_adapter
<sqlmesh.core.engine_adapter.snowflake.SnowflakeEngineAdapter object at ...>
>>> ctx.engine_adapter.fetchdf("SELECT x'ABCD'")
       X'ABCD'
0  b'\xab\xcd'
  • DuckDB
>>> ctx.engine_adapter
<sqlmesh.core.engine_adapter.duckdb.DuckDBEngineAdapter object at ...>
>>> ctx.engine_adapter.fetchdf("SELECT FROM_HEX('ABCD')")
  from_hex('ABCD')
0       [171, 205]
>>> ctx.engine_adapter.fetchdf("SELECT CAST(HEX(FROM_HEX('ABCD')) AS BLOB)")
  CAST(hex(from_hex('ABCD')) AS BLOB)
0       [65, 66, 67, 68]

So, it looks to me like UNHEX(...) interprets each hex byte into a decimal, while CAST(HEX(UNHEX(...)) AS BLOB) turns each character into an ASCII code, thus the mismatch.

In terms of actual DF values, it seems to me that the previous one matches better semantically (?)

@georgesittas
Copy link
Collaborator

Yeah. I don't really understand the motivation behind the original issue, tbh.

@kyle-cheung, can you shed some light on this?

@kyle-cheung
Copy link
Contributor

kyle-cheung commented Oct 8, 2025

@georgesittas Sigma (the BI tool) uses Hex strings for many of their queries that involve user inputs. Sigma casts the readable hex strings to text and subsequently applies a filter on them. For example below:

WHERE
     CAST_HEX_TO_STRING_124 = '4EDBAFC2AFF94B44A2D50950B3073560'

However, since SQLGlot transpiles this to create the value as N\xDB\xAF\xC2\xAF\xF9KD\xA2\xD5\x09P\xB3\x075 :: varchar the condition fails.

Here's an actual query
image

@georgesittas
Copy link
Collaborator

I see, thank you for clarifying.

@VaggelisD can you take a look at what other dialects do? Hex strings are natively supported in BigQuery, Postgres, etc. I wonder if "fixing" DuckDB's generator to match Snowflake's behavior will break others. In that case, we should probably match the most common semantics (if possible).

@VaggelisD
Copy link
Collaborator Author

@georgesittas These are the results from a quick pass:

Dialect Query Result
BigQuery SELECT 0xABCD 43981
Clickhouse SELECT 0xABCD 43981
SQLite SELECT 0xABCD 43981
Trino / Presto SELECT 0xABCD 43981
Snowflake SELECT x'ABCD' ABCD
Spark SELECT x'ABCD' �� (Non readable)
MySQL SELECT x'ABCD' �� (Non readable)
Postgres SELECT x'ABCD' 1010101111001101 (i.e 43981)
T-SQL SELECT 0xABCD 0x (it seems empty / non readable)
Redshift SELECT x'ABCD' false (wat)

So, it looks to me like most dialects convert hex strings to the decimal representation.

@VaggelisD
Copy link
Collaborator Author

VaggelisD commented Oct 9, 2025

These are the dataframe results for most of the dialects referenced, will be more representative:

Dialect DataFrame
BigQuery 0 43981
Snowflake 0 b'\xab\xcd'
T-SQL 0 b'\xab\xcd'
Spark 0 [171, 205]
Postgres 0 1010101111001101
Redshift 0 1010101111001101

@georgesittas
Copy link
Collaborator

Ok, it doesn't seem like there's a standard representation. I don't see a compelling reason to change the existing behavior without controlling the output with a new arg to accommodate other transpilation paths as well.

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in Slack and concluded that this is fine after all.

For dialects like BigQuery, doing CAST(0xA AS STRING) results in 10, so this should be transpiled to DuckDB correctly today.

There are a couple exceptions that return non-readable characters when casting these bytes sequences to a string, but I think it's fine to not deal with them for now.

LGTM @VaggelisD 👍

@VaggelisD VaggelisD merged commit 2c7cc29 into main Oct 9, 2025
6 checks passed
@VaggelisD VaggelisD deleted the vaggelisd/duckdb_hex_string branch October 9, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snowflake to DuckDB Hex Formatting

3 participants