Skip to content

feat: add support for parse_url, url_encode, url_decode#4152

Draft
andygrove wants to merge 8 commits intoapache:mainfrom
andygrove:url-functions
Draft

feat: add support for parse_url, url_encode, url_decode#4152
andygrove wants to merge 8 commits intoapache:mainfrom
andygrove:url-functions

Conversation

@andygrove
Copy link
Copy Markdown
Member

Summary

  • Add support for parse_url, url_encode, and url_decode Spark expressions using native implementations from the datafusion-spark crate
  • parse_url maps to parse_url/try_parse_url based on ANSI mode (failOnError), supporting both 2-arg and 3-arg forms
  • url_encode/url_decode are RuntimeReplaceable in Spark (rewritten to StaticInvoke(UrlCodec, "encode"/"decode")) — added handlers in CometStaticInvoke to intercept and route to native functions
  • Includes SQL file tests for all three functions covering column refs, literals, NULLs, empty strings, special characters, multibyte UTF-8, and roundtrip encode/decode

Partial fix for #4150

Test plan

  • New SQL file tests in spark/src/test/resources/sql-tests/expressions/url/
    • parse_url.sql — HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO extraction; 3-arg QUERY key extraction; NULL handling
    • url_encode.sql — special characters, multibyte UTF-8, empty string, NULL
    • url_decode.sql — percent-encoded strings, plus-as-space, roundtrip, multibyte UTF-8, NULL
  • Rust compiles (cargo check passes)
  • Scala compiles with no errors (mvn compile passes)

🤖 Generated with Claude Code

Wire up Spark's URL functions to the datafusion-spark implementations:

- parse_url: maps to parse_url/try_parse_url based on failOnError
  (ANSI mode). Supports 2-arg (url, part) and 3-arg (url, part, key)
  forms with all Spark URL parts (HOST, PATH, QUERY, REF, etc.)

- url_encode/url_decode: Spark rewrites these as RuntimeReplaceable
  to StaticInvoke(UrlCodec, "encode"/"decode"). Added handlers in
  CometStaticInvoke to intercept these and route to the native
  url_encode/url_decode functions.

Closes apache#4150 (partial)
@andygrove andygrove marked this pull request as draft April 29, 2026 23:49
parse_url diverges from Spark for empty-string URLs (Comet returns NULL
where Spark returns "") and for FILE extraction on URLs without an
explicit path (Comet inserts a leading "/"). Mark CometParseUrl as
Incompatible so it falls back to Spark by default. Split the SQL tests
into a default-mode fallback assertion and an opt-in parse_url_native.sql
that exercises the native implementation under inputs that match Spark.
Reference apache/datafusion#21943 from the
CometParseUrl serde and the two parse_url SQL test files so the source of
the divergence is traceable.
Add boundary, already-encoded, and whitespace-control rows to the
url_encode SQL test, and run it across both Parquet dictionary settings
via ConfigMatrix. Record per-Spark-version audit notes for parse_url and
url_encode in spark_expressions_support.md.
The url_encode test data has no duplicate rows so dictionary encoding is
not exercised. Run the test once instead of doubling the runtime.
Add a `query expect_error(%2s)` case locking in the contract that both
Spark and Comet error on malformed percent-encoding with the bad
sequence in the message. Add lowercase-hex inputs to verify case-
insensitive percent-decoding. Record per-version url_decode audit notes
in spark_expressions_support.md, including a pointer to issue apache#4155 for
the Spark 4.0 `try_url_decode` gap.
- Trim CometParseUrl.incompatibleReason to one sentence linking to the
  upstream issue, since the reason appears in EXPLAIN output.
- parse_url.sql: document the three known divergent shapes and add a
  fallback assertion for trailing-slash PATH (Spark "/", native "").
- parse_url_native.sql: add an ANSI-mode invalid-URL expect_error case
  (locks in agreement on the "The url is invalid" message).
- try_parse_url.sql: new file gated on MinSparkVersion 4.0 covering
  valid input, malformed input (NULL fallback), and NULL input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant