-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6947: [Rust] [DataFusion] Scalar UDF support #6749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The code itself looks really good. My only concern is that the API for registering the UDF has a lot of boilerplate code. Per test, it requires the following amount of code to register a simple IMO this is too much: 20 LOC, with an Arc, downcast, a Builder, and ensuring that the schema matches the type of the downcast and Builder, just for a simple scalar operation. What if we provide a macro to simplify this declaration, e.g. such that it is the macro's responsibility to pick the downcast type and builder? |
|
Yes, I think that would be the next logical step here now that there is a mechanism for executing scalar functions. Many of the math expressions (sqrt, sin, cos, tan, etc) are going to be very similar so it would make sense to use macros. I think this could be a good follow up JIRA & PR. |
|
@paddyhoran @nevi-me PTAL if you have the time. |
|
@jorgecarleitao Thanks for the suggestion, please take another look if you have time. |
|
I went through it. Super simple to add a float64 function for us now. I also like a lot what you brought about adding stateful. From my side, I think that this is a significant improvement and should be merged. I am sorry that I was not clear in my previous comment: my argument is that for the developers using this library - our users - the API to declare and register a new UDF is cumbersome. Compare the 26 LOC with spark's counterpart: I understand that this is not scala and stuff, but I do believe that the boilerplate code can be mitigated by exposing, as part of our (public) API, a macro that can help our users declaring UDFs. My hypothesis is that we should be able to write a macro as follows:
so that the user can write and the macro figures itself what is the builder and downcasts should be. A common use-case is string manipulation. IMO this would dramatically improve the user experience of adding a UDF, since the user could now focus exclusively on the functionality itself. Of course there are optimizations that are possible only when we write the full implementation (e.g. use SIMD or native ops), but for the average data engineer wanting to use Rust and DataFusion, I think that reducing the barrier to a UDF significantly helps DataFusion's adoption, since UDFs is a feature that makes spark so much useful when compared to the alternatives. |
| ScalarFunction::new( | ||
| $NAME, | ||
| vec![Field::new("n", DataType::Float64, true)], | ||
| DataType::Float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I'll try and do a proper review soon. Are there any considerations we need to make here for the fact that math_unary_function assumes f64, will this panic on f32 input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point. We need the type coercion optimizer rule to take care of this by automatically casting expressions to the required type where possible or failing at that stage if types are not compatible. I will work on this today.
|
@jorgecarleitao I see what you mean now. Yes, that's a great point. I will do that either as part of this PR or a follow-up PR. |
|
@paddyhoran I suppose it would make sense to move the implementation of these functions out of DataFusion and into Arrow since they could be used directly there? |
|
Yea, I was thinking the same thing but didn't want to hold this PR up. I also think that |
There's 2 other alternatives which are similar.
// this is what I'm doing in the rust-dataframe library
#[derive(Clone)]
pub struct ChunkedArray {
chunks: Vec<Arc<dyn Array>>,
num_rows: usize,
null_count: usize,
}
pub enum Datum {
Scalar(some_value), // haven't thought of how this would look like
Array(Arc<dyn Array>),
Chunk(ChunkedArray),
...
}This would help DataFusion and other compute users handle literal values and (array ~ scalar) better by avoiding creating an array out of scalars. |
|
@nevi-me @paddyhoran It's time to release 0.17 ... are you OK with merging this one and we can follow up with another PR to move the math expressions to Arrow itself? The main reason I'd like to merge this PR for the release is that it enables DataFusion users to add their own UDFs. |
paddyhoran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'm fine with merging this and following up on the other points when we come to a consensus.
|
Thanks @paddyhoran |
Support for Scalar UDFs, allowing custom Rust code to run as an expression. Scalar UDFs are supported both in SQL and in plans built via LogicalPlanBuilder.
This will allow users of DataFusion to add their own expressions and also provides a framework to start adding useful expressions to DataFusion.
The following unary math expressions are implemented as a starting point:
Macros are used to generate convenience methods for creating these expressions in a logical plan, so it is now possible to write something like: