feat: add support for parse_url, url_encode, url_decode#4152
Draft
andygrove wants to merge 8 commits intoapache:mainfrom
Draft
feat: add support for parse_url, url_encode, url_decode#4152andygrove wants to merge 8 commits intoapache:mainfrom
andygrove wants to merge 8 commits intoapache:mainfrom
Conversation
Wire up Spark's URL functions to the datafusion-spark implementations: - parse_url: maps to parse_url/try_parse_url based on failOnError (ANSI mode). Supports 2-arg (url, part) and 3-arg (url, part, key) forms with all Spark URL parts (HOST, PATH, QUERY, REF, etc.) - url_encode/url_decode: Spark rewrites these as RuntimeReplaceable to StaticInvoke(UrlCodec, "encode"/"decode"). Added handlers in CometStaticInvoke to intercept these and route to the native url_encode/url_decode functions. Closes apache#4150 (partial)
parse_url diverges from Spark for empty-string URLs (Comet returns NULL where Spark returns "") and for FILE extraction on URLs without an explicit path (Comet inserts a leading "/"). Mark CometParseUrl as Incompatible so it falls back to Spark by default. Split the SQL tests into a default-mode fallback assertion and an opt-in parse_url_native.sql that exercises the native implementation under inputs that match Spark.
Reference apache/datafusion#21943 from the CometParseUrl serde and the two parse_url SQL test files so the source of the divergence is traceable.
Add boundary, already-encoded, and whitespace-control rows to the url_encode SQL test, and run it across both Parquet dictionary settings via ConfigMatrix. Record per-Spark-version audit notes for parse_url and url_encode in spark_expressions_support.md.
The url_encode test data has no duplicate rows so dictionary encoding is not exercised. Run the test once instead of doubling the runtime.
Add a `query expect_error(%2s)` case locking in the contract that both Spark and Comet error on malformed percent-encoding with the bad sequence in the message. Add lowercase-hex inputs to verify case- insensitive percent-decoding. Record per-version url_decode audit notes in spark_expressions_support.md, including a pointer to issue apache#4155 for the Spark 4.0 `try_url_decode` gap.
- Trim CometParseUrl.incompatibleReason to one sentence linking to the upstream issue, since the reason appears in EXPLAIN output. - parse_url.sql: document the three known divergent shapes and add a fallback assertion for trailing-slash PATH (Spark "/", native ""). - parse_url_native.sql: add an ANSI-mode invalid-URL expect_error case (locks in agreement on the "The url is invalid" message). - try_parse_url.sql: new file gated on MinSparkVersion 4.0 covering valid input, malformed input (NULL fallback), and NULL input.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
parse_url,url_encode, andurl_decodeSpark expressions using native implementations from thedatafusion-sparkcrateparse_urlmaps toparse_url/try_parse_urlbased on ANSI mode (failOnError), supporting both 2-arg and 3-arg formsurl_encode/url_decodeareRuntimeReplaceablein Spark (rewritten toStaticInvoke(UrlCodec, "encode"/"decode")) — added handlers inCometStaticInvoketo intercept and route to native functionsPartial fix for #4150
Test plan
spark/src/test/resources/sql-tests/expressions/url/parse_url.sql— HOST, PATH, QUERY, REF, PROTOCOL, FILE, AUTHORITY, USERINFO extraction; 3-arg QUERY key extraction; NULL handlingurl_encode.sql— special characters, multibyte UTF-8, empty string, NULLurl_decode.sql— percent-encoded strings, plus-as-space, roundtrip, multibyte UTF-8, NULLcargo checkpasses)mvn compilepasses)🤖 Generated with Claude Code