Skip to content

feat: support Spark expression slice#4149

Open
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:feat/array-slice
Open

feat: support Spark expression slice#4149
andygrove wants to merge 2 commits intoapache:mainfrom
andygrove:feat/array-slice

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Add native support for Spark's slice(array, start, length) expression so it runs on Comet instead of falling back to Spark.

The datafusion-spark crate already ships a SparkSlice, but it is not Spark-compatible: when a negative start lies before the beginning of the array (e.g. slice([a], -2, 2)), it returns the first element instead of an empty array. We can upstream the fix later; for now this PR ships a Comet-local implementation.

What changes are included in this PR?

  • native/spark-expr/src/array_funcs/array_slice.rs: new SparkArraySlice UDF (spark_array_slice) implementing Spark's slice semantics, including 1-based indexing, negative-start-from-end, error on start = 0 or length < 0, and clamping length to the array end. Supports both List and LargeList element storage.
  • native/spark-expr/src/comet_scalar_funcs.rs: register the new UDF.
  • spark/src/main/scala/org/apache/comet/serde/arrays.scala: CometSlice serde casts the start/length args to Long and serialises a call to spark_array_slice, promising containsNull = true to match DataFusion's list nullability.
  • spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: register Slice in arrayExpressions.

How are these changes tested?

  • 12 native unit tests in array_slice.rs covering positive / negative / zero / overflowing start, length 0, length past end, null inputs, empty arrays, and the error cases.
  • New SQL test file spark/src/test/resources/sql-tests/expressions/array/slice.sql covering all-literal, column + literal, and column-only argument combinations across boolean, tinyint, smallint, int, bigint, float, double, decimal, date, timestamp, timestamp_ntz, string, and nested array element types, plus the negative-start-overflow case that exposed the upstream bug.

@andygrove andygrove marked this pull request as ready for review April 30, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant