Skip to content

url_decode: try_url_decode (Spark 4.0) errors instead of returning NULL on malformed input #4155

@andygrove

Description

@andygrove

Describe the bug

On Spark 4.0, try_url_decode(x) should return NULL when x is not a valid percent-encoded string. Comet errors instead.

Root cause

Spark 4.0 rewrites try_url_decode(x) through Catalyst's RuntimeReplaceable chain:

TryUrlDecode(x)
  -> UrlDecode(x, failOnError = false)
    -> StaticInvoke(UrlCodec, "decode", Seq(x, Literal(false)), Seq(StringTypeWithCollation, BooleanType))

CometUrlDecodeStaticInvoke (in spark/src/main/scala/org/apache/comet/serde/statics.scala) only reads expr.children.head and emits scalarFunctionExprToProto("url_decode", child). The Literal(false) carrying the failOnError flag is silently dropped. The native SparkUrlDecode (datafusion-spark) always errors on malformed percent-encoding, so the Comet path errors where Spark would have returned NULL.

This affects only Spark 4.0+ (try_url_decode does not exist in 3.4 / 3.5).

Reproduction (Spark 4.0)

-- Spark returns NULL
SELECT try_url_decode('http%3A%2F%2spark.apache.org');

-- With Comet enabled, this throws "Invalid percent-encoding: ..." instead of returning NULL.

Suggested fix

Either of:

  1. Detect the Spark-4.0 StaticInvoke(UrlCodec, "decode", Seq(child, Literal(false)), ...) shape in CometUrlDecodeStaticInvoke and emit a try_url_decode scalar function. datafusion_spark::function::url::try_url_decode::TryUrlDecode already exists in the datafusion-spark crate but is not registered in native/core/src/execution/jni_api.rs. Registering it (similar to how SparkTryParseUrl is already registered) would make the dispatch trivial.
  2. Mark this StaticInvoke shape as Unsupported / Incompatible so Comet falls back to Spark whenever failOnError=false.

Notes

  • For url_decode(x) (i.e. failOnError=true, the default), Comet's behavior matches Spark: both error on malformed input. The error class differs (CANNOT_DECODE_URL vs DataFusion's Invalid percent-encoding) but both fail.
  • This was discovered while auditing the url_decode expression added in feat: add support for parse_url, url_encode, url_decode #4152. A test for url_decode malformed input was added in that PR. A try_url_decode test should be added once this issue is fixed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions