Skip to content

feat: CreateArray support#793

Merged
andygrove merged 12 commits intoapache:mainfrom
Kimahriman:create-array
Aug 14, 2024
Merged

feat: CreateArray support#793
andygrove merged 12 commits intoapache:mainfrom
Kimahriman:create-array

Conversation

@Kimahriman
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #792

Rationale for this change

What changes are included in this PR?

Adds support for the CreateArray expression. Currently we only support when the element type is nullable, as that is all that DataFusion's make_array supports.

How are these changes tested?

New UT, and I had to disable the same test as #735 in the Spark diff for SubqueryBroadcastExec support

Comment thread native/core/src/execution/datafusion/planner.rs Outdated
// datafusion's make_array only supports nullable element types
case array @ CreateArray(children, _) if array.dataType.containsNull =>
val childExprs = children.map(exprToProto(_, inputs, binding))
val dataType = serializeDataType(array.dataType)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could "fake" the datatype here when the element isn't nullable and tell DataFusion it is to get it to work, but I wasn't sure if that would have unintended downstream consequences. I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.

Does there exist a datafusion issue for this? Otherwise I think it would be good to create one to track the issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be but was waiting to see if this was even the right way to use a ScalarUDF. Looking at it some more I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not. Still learning how DataFusion works. Should I somehow be using the other thing created by make_udf_expr_and_func:

datafusion_functions_nested::make_array
pub fn make_array(arg: Vec<datafusion_expr::Expr>) -> datafusion_expr::Expr

?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also still learning more on Datafusion. But my understanding aligns with this comment you made:

I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not.

My understanding is that an API breakage would be needed in Datafusion to make it possible to implement make_array with correct nullability for the element.

Copy link
Copy Markdown
Contributor Author

@Kimahriman Kimahriman Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looking a little more there's a chance I might be wrong. I think return_type is only used to infer the return type if it's not specified? I think the real issue is that invoke (which uses fn array_array) under the hood, returns an Array with an always nullable element. But here we have the ArrayRef's which have an is_nullable attribute, so it might be possible. I'll try to test that out to see if that's the case. The actual error you get is:
Cause: org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it's definitely possible to support. I quickly got a test working with just updating invoke, but there's also a return_type_from_exprs function you can override instead of just return_type which would let you inspect the nullability too. I'll make an issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More digging made me realize this is a somewhat larger issue with ScalarUDFs, in that they don't support setting nullability at all (every ScalarUDF is assumed to be nullable). So how has this been handled elsewhere if at all? Is the best approach just to "pretend" the column is nullable for DataFusion, knowing that logically it should not contain any nulls and keep it as non-nullable on the Spark side? Otherwise any expression backed by a ScalarUDF can't support non-nullable expressions.

@kazuyukitanimura kazuyukitanimura changed the title feature: CreateArray support feat: CreateArray support Aug 8, 2024
Comment thread dev/diffs/3.4.3.diff
@andygrove
Copy link
Copy Markdown
Member

Thanks for the contribution @Kimahriman. I plan on reviewing this in the next day or two.

@Kimahriman
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution @Kimahriman. I plan on reviewing this in the next day or two.

Thanks, definitely interested in your thoughts on the nullability issue.

Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@Kimahriman Do you mind resolving the conflict?

@Kimahriman
Copy link
Copy Markdown
Contributor Author

LGTM @Kimahriman Do you mind resolving the conflict?

Fixed

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I agree that is seems like a flaw in DataFusion that we cannot define the nullability correctly.

@andygrove andygrove merged commit 5e81650 into apache:main Aug 14, 2024
@Kimahriman
Copy link
Copy Markdown
Contributor Author

LGTM. I agree that is seems like a flaw in DataFusion that we cannot define the nullability correctly.

Since this may come up more and more, does it make sense to just "lie" to DataFusion to tell it it's nullable even when Spark thinks it's non-nullable? Technically anything that's not a complex type, this will likely just silently already happen and be happy. The thing that actually complains is https://github.com/apache/arrow-rs/blob/master/arrow-array/src/record_batch.rs#L203 when creating the record batch. It makes sure the data types of the schema match the data types of the columns, but data type doesn't include nullability for non-complex types, but for complex types that check includes nullability. So basically top level column nullability isn't checked, but any nested or complex type will verify the nullability. Arguably is just a bug with that check.

@andygrove andygrove mentioned this pull request Aug 19, 2024
5 tasks
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* Add CreateArray support

* Update Spark SQL test diffs

* Use scalaExprToProto

* Specify data type

* Only do nullable elements again

* Remove unused import

* Add null to the test and add nullable element datafusion issue

* Rename test

* Update lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add CreateArray support

4 participants