-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types #7974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FYI @andygrove @alamb @houqp : this is a draft because it depends on other PRs being reviewed and accepted. |
rust/datafusion/src/logicalplan.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to drop partialEq for expressions, and therefore also dropped this since it was strictly not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
Currently a UDF's argument can only be a single type. This PR adds support for multiple types per argument, thus allowing users to register UDFs that can handle multiple types at once.
This operation is already by the optimizer and is more complete.
This allows to declare UDFs that support multiple types. The existing UDFs (math) now support float32 and float64.
| #[derive(Debug)] | ||
| struct SumAccumulator { | ||
| sum: Option<ScalarValue>, | ||
| pub sum: Option<ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
|
|
||
| pub mod dataframe; | ||
| pub mod datasource; | ||
| mod datatyped; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub so that others can implement a different DataTyped?
| args: Vec<Expr>, | ||
| /// The `DataType` the expression will yield | ||
| return_type: DataType, | ||
| return_type: ReturnType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THIS IS A MAJOR CHANGE: it makes Expr hold an Arc<&dyn>, which make them unserializable. I was unable to find another way, unfortunately.
| Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()), | ||
| Expr::ScalarFunction { | ||
| args, return_type, .. | ||
| } => return_type(&args.iter().map(|x| x.as_datatyped()).collect(), schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very beautiful, but was unable to find a better pattern for this :(
| match self.schema_provider.get_function_meta(&name) { | ||
| Some(fm) => { | ||
| let args = if name == "count" { | ||
| // optimization to avoid computing expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be moved to another place (an optimizer?). For now I left it where it here as it was.
This PR is done on top of #7971 ,
Its current runtime consequence is that all our math functions now return float32 or float64 depending on their incoming column (and use float32 for other numeric types). Its API consequences is that it allows to register UDFs of variable return types.
This PR hits both physical plans and logical plans, and abstracts a bit the typing of some of our structs. In particular, it introduces a new trait that only cares about the return datatype of an object (PhysicalExpr and LogicalExpr).