feat: `CreateArray` support by Kimahriman · Pull Request #793 · apache/datafusion-comet

Kimahriman · 2024-08-08T11:52:43Z

Which issue does this PR close?

Closes #792

Rationale for this change

What changes are included in this PR?

Adds support for the CreateArray expression. Currently we only support when the element type is nullable, as that is all that DataFusion's make_array supports.

How are these changes tested?

New UT, and I had to disable the same test as #735 in the Spark diff for SubqueryBroadcastExec support

Kimahriman · 2024-08-08T11:54:15Z

+        // datafusion's make_array only supports nullable element types
+        case array @ CreateArray(children, _) if array.dataType.containsNull =>
+          val childExprs = children.map(exprToProto(_, inputs, binding))
+          val dataType = serializeDataType(array.dataType)


I could "fake" the datatype here when the element isn't nullable and tell DataFusion it is to get it to work, but I wasn't sure if that would have unintended downstream consequences. I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.

I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.

Does there exist a datafusion issue for this? Otherwise I think it would be good to create one to track the issue.

Probably should be but was waiting to see if this was even the right way to use a ScalarUDF. Looking at it some more I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not. Still learning how DataFusion works. Should I somehow be using the other thing created by make_udf_expr_and_func:

datafusion_functions_nested::make_array pub fn make_array(arg: Vec<datafusion_expr::Expr>) -> datafusion_expr::Expr

?

I am also still learning more on Datafusion. But my understanding aligns with this comment you made:

I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not.

My understanding is that an API breakage would be needed in Datafusion to make it possible to implement make_array with correct nullability for the element.

Hmm looking a little more there's a chance I might be wrong. I think return_type is only used to infer the return type if it's not specified? I think the real issue is that invoke (which uses fn array_array) under the hood, returns an Array with an always nullable element. But here we have the ArrayRef's which have an is_nullable attribute, so it might be possible. I'll try to test that out to see if that's the case. The actual error you get is:
Cause: org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0

Ok it's definitely possible to support. I quickly got a test working with just updating invoke, but there's also a return_type_from_exprs function you can override instead of just return_type which would let you inspect the nullability too. I'll make an issue

More digging made me realize this is a somewhat larger issue with ScalarUDFs, in that they don't support setting nullability at all (every ScalarUDF is assumed to be nullable). So how has this been handled elsewhere if at all? Is the best approach just to "pretend" the column is nullable for DataFusion, knowing that logically it should not contain any nulls and keep it as non-nullable on the Spark side? Otherwise any expression backed by a ScalarUDF can't support non-nullable expressions.

andygrove · 2024-08-13T19:32:06Z

Thanks for the contribution @Kimahriman. I plan on reviewing this in the next day or two.

Kimahriman · 2024-08-14T00:00:45Z

Thanks for the contribution @Kimahriman. I plan on reviewing this in the next day or two.

Thanks, definitely interested in your thoughts on the nullability issue.

kazuyukitanimura

LGTM
@Kimahriman Do you mind resolving the conflict?

Kimahriman · 2024-08-14T11:00:57Z

LGTM @Kimahriman Do you mind resolving the conflict?

Fixed

andygrove

LGTM. I agree that is seems like a flaw in DataFusion that we cannot define the nullability correctly.

Kimahriman · 2024-08-14T20:34:01Z

LGTM. I agree that is seems like a flaw in DataFusion that we cannot define the nullability correctly.

Since this may come up more and more, does it make sense to just "lie" to DataFusion to tell it it's nullable even when Spark thinks it's non-nullable? Technically anything that's not a complex type, this will likely just silently already happen and be happy. The thing that actually complains is https://github.com/apache/arrow-rs/blob/master/arrow-array/src/record_batch.rs#L203 when creating the record batch. It makes sure the data types of the schema match the data types of the columns, but data type doesn't include nullability for non-complex types, but for complex types that check includes nullability. So basically top level column nullability isn't checked, but any nested or complex type will verify the nullability. Arguably is just a bug with that check.

* Add CreateArray support * Update Spark SQL test diffs * Use scalaExprToProto * Specify data type * Only do nullable elements again * Remove unused import * Add null to the test and add nullable element datafusion issue * Rename test * Update lock

Kimahriman added 4 commits August 3, 2024 11:39

Add CreateArray support

9357c74

Merge branch 'main' into create-array

1f0244b

Update Spark SQL test diffs

05b4a0c

Merge branch 'main' into create-array

862942e

Kimahriman commented Aug 8, 2024

View reviewed changes

eejbyfeldt mentioned this pull request Aug 8, 2024

fix: Supported nested types in HashJoin #735

Merged

kazuyukitanimura changed the title ~~feature: CreateArray support~~ feat: CreateArray support Aug 8, 2024

kazuyukitanimura reviewed Aug 8, 2024

View reviewed changes

Comment thread dev/diffs/3.4.3.diff

Kimahriman added 6 commits August 8, 2024 19:50

Use scalaExprToProto

661e673

Specify data type

0c9ac17

Only do nullable elements again

170ab0c

Remove unused import

52d9e32

Add null to the test and add nullable element datafusion issue

9ccd9a2

Rename test

8afe723

kazuyukitanimura approved these changes Aug 14, 2024

View reviewed changes

Kimahriman added 2 commits August 14, 2024 06:56

Merge branch 'main' into create-array

29b856d

Update lock

faa8aff

andygrove approved these changes Aug 14, 2024

View reviewed changes

andygrove merged commit 5e81650 into apache:main Aug 14, 2024

andygrove mentioned this pull request Aug 19, 2024

Comet 0.2.0 Release #843

Closed

5 tasks

Conversation

Kimahriman commented Aug 8, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Uh oh!

Kimahriman Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

Kimahriman Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

Kimahriman Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

Kimahriman Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Aug 13, 2024

Uh oh!

Kimahriman commented Aug 14, 2024

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Aug 14, 2024

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kimahriman Aug 9, 2024 •

edited

Loading