Conversation
| // datafusion's make_array only supports nullable element types | ||
| case array @ CreateArray(children, _) if array.dataType.containsNull => | ||
| val childExprs = children.map(exprToProto(_, inputs, binding)) | ||
| val dataType = serializeDataType(array.dataType) |
There was a problem hiding this comment.
I could "fake" the datatype here when the element isn't nullable and tell DataFusion it is to get it to work, but I wasn't sure if that would have unintended downstream consequences. I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.
There was a problem hiding this comment.
I can try to update DataFusion at some point to support non-nullable elements if all children are non-nullable.
Does there exist a datafusion issue for this? Otherwise I think it would be good to create one to track the issue.
There was a problem hiding this comment.
Probably should be but was waiting to see if this was even the right way to use a ScalarUDF. Looking at it some more I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not. Still learning how DataFusion works. Should I somehow be using the other thing created by make_udf_expr_and_func:
datafusion_functions_nested::make_array
pub fn make_array(arg: Vec<datafusion_expr::Expr>) -> datafusion_expr::Expr
?
There was a problem hiding this comment.
I am also still learning more on Datafusion. But my understanding aligns with this comment you made:
I'm not sure how it could even be updated with the way it currently works, since ScalarUDFImpl.return_type just has DataType and not Field to know whether the elements are nullable or not.
My understanding is that an API breakage would be needed in Datafusion to make it possible to implement make_array with correct nullability for the element.
There was a problem hiding this comment.
Hmm looking a little more there's a chance I might be wrong. I think return_type is only used to infer the return type if it's not specified? I think the real issue is that invoke (which uses fn array_array) under the hood, returns an Array with an always nullable element. But here we have the ArrayRef's which have an is_nullable attribute, so it might be possible. I'll try to test that out to see if that's the case. The actual error you get is:
Cause: org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, expected List(Field { name: "item", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }) but found List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) at column index 0
There was a problem hiding this comment.
Ok it's definitely possible to support. I quickly got a test working with just updating invoke, but there's also a return_type_from_exprs function you can override instead of just return_type which would let you inspect the nullability too. I'll make an issue
There was a problem hiding this comment.
More digging made me realize this is a somewhat larger issue with ScalarUDFs, in that they don't support setting nullability at all (every ScalarUDF is assumed to be nullable). So how has this been handled elsewhere if at all? Is the best approach just to "pretend" the column is nullable for DataFusion, knowing that logically it should not contain any nulls and keep it as non-nullable on the Spark side? Otherwise any expression backed by a ScalarUDF can't support non-nullable expressions.
CreateArray supportCreateArray support
|
Thanks for the contribution @Kimahriman. I plan on reviewing this in the next day or two. |
Thanks, definitely interested in your thoughts on the nullability issue. |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
LGTM
@Kimahriman Do you mind resolving the conflict?
Fixed |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. I agree that is seems like a flaw in DataFusion that we cannot define the nullability correctly.
Since this may come up more and more, does it make sense to just "lie" to DataFusion to tell it it's nullable even when Spark thinks it's non-nullable? Technically anything that's not a complex type, this will likely just silently already happen and be happy. The thing that actually complains is https://github.com/apache/arrow-rs/blob/master/arrow-array/src/record_batch.rs#L203 when creating the record batch. It makes sure the data types of the schema match the data types of the columns, but data type doesn't include nullability for non-complex types, but for complex types that check includes nullability. So basically top level column nullability isn't checked, but any nested or complex type will verify the nullability. Arguably is just a bug with that check. |
* Add CreateArray support * Update Spark SQL test diffs * Use scalaExprToProto * Specify data type * Only do nullable elements again * Remove unused import * Add null to the test and add nullable element datafusion issue * Rename test * Update lock
Which issue does this PR close?
Closes #792
Rationale for this change
What changes are included in this PR?
Adds support for the
CreateArrayexpression. Currently we only support when the element type is nullable, as that is all that DataFusion'smake_arraysupports.How are these changes tested?
New UT, and I had to disable the same test as #735 in the Spark diff for SubqueryBroadcastExec support