ARROW-6947: [Rust] [DataFusion] Scalar UDF support #6749

andygrove · 2020-03-28T13:47:27Z

Support for Scalar UDFs, allowing custom Rust code to run as an expression. Scalar UDFs are supported both in SQL and in plans built via LogicalPlanBuilder.

This will allow users of DataFusion to add their own expressions and also provides a framework to start adding useful expressions to DataFusion.

The following unary math expressions are implemented as a starting point:

    ctx.register_udf(math_unary_function!("sqrt", sqrt));
    ctx.register_udf(math_unary_function!("sin", sin));
    ctx.register_udf(math_unary_function!("cos", cos));
    ctx.register_udf(math_unary_function!("tan", tan));
    ctx.register_udf(math_unary_function!("asin", asin));
    ctx.register_udf(math_unary_function!("acos", acos));
    ctx.register_udf(math_unary_function!("atan", atan));
    ctx.register_udf(math_unary_function!("floor", floor));
    ctx.register_udf(math_unary_function!("ceil", ceil));
    ctx.register_udf(math_unary_function!("round", round));
    ctx.register_udf(math_unary_function!("trunc", trunc));
    ctx.register_udf(math_unary_function!("abs", abs));
    ctx.register_udf(math_unary_function!("signum", signum));
    ctx.register_udf(math_unary_function!("exp", exp));
    ctx.register_udf(math_unary_function!("log", ln));
    ctx.register_udf(math_unary_function!("log2", log2));
    ctx.register_udf(math_unary_function!("log10", log10));

Macros are used to generate convenience methods for creating these expressions in a logical plan, so it is now possible to write something like:

let plan = LogicalPlanBuilder::scan("", "", &schema, None)?
    .project(vec![sqrt(col("a")), log(col("b"))])?
    .build()?;

github-actions · 2020-03-28T14:02:09Z

https://issues.apache.org/jira/browse/ARROW-6947

andygrove · 2020-03-28T16:26:21Z

@kyle-mccarthy @jorgecarleitao FYI

jorgecarleitao · 2020-03-28T17:26:01Z

The code itself looks really good. My only concern is that the API for registering the UDF has a lot of boilerplate code. Per test, it requires the following amount of code to register a simple sqrt function:

    let sqrt: ScalarUdf = |args: &Vec<ArrayRef>| {
        let input = &args[0]
            .as_any()
            .downcast_ref::<Float64Array>()
            .expect("cast failed");

        let mut builder = Float64Builder::new(input.len());
        for i in 0..input.len() {
            builder.append_value(input.value(i).sqrt())?;
        }
        Ok(Arc::new(builder.finish()))
    };

    let sqrt_meta = ScalarFunction::new(
        "sqrt",
        vec![Field::new("n", DataType::Float64, true)],
        DataType::Float64,
        sqrt,
    );

IMO this is too much: 20 LOC, with an Arc, downcast, a Builder, and ensuring that the schema matches the type of the downcast and Builder, just for a simple scalar operation.

What if we provide a macro to simplify this declaration, e.g.

udf!(Float64, Float64, |x| x.sqrt())

such that it is the macro's responsibility to pick the downcast type and builder?

andygrove · 2020-03-28T18:14:21Z

Yes, I think that would be the next logical step here now that there is a mechanism for executing scalar functions. Many of the math expressions (sqrt, sin, cos, tan, etc) are going to be very similar so it would make sense to use macros.

I think this could be a good follow up JIRA & PR.

andygrove · 2020-03-28T18:15:19Z

@paddyhoran @nevi-me PTAL if you have the time.

andygrove · 2020-03-28T19:32:14Z

@jorgecarleitao Thanks for the suggestion, please take another look if you have time.

jorgecarleitao · 2020-03-29T06:35:37Z

I went through it. Super simple to add a float64 function for us now. I also like a lot what you brought about adding stateful. From my side, I think that this is a significant improvement and should be merged.

I am sorry that I was not clear in my previous comment: my argument is that for the developers using this library - our users - the API to declare and register a new UDF is cumbersome.

Compare the 26 LOC with spark's counterpart:

// Define a regular Scala function
val upper: String => String = _.toUpperCase

// Define a UDF that wraps the upper Scala function defined above
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)

I understand that this is not scala and stuff, but I do believe that the boilerplate code can be mitigated by exposing, as part of our (public) API, a macro that can help our users declaring UDFs.

My hypothesis is that we should be able to write a macro as follows:

scalarUDF!(&str, &dyn Fn(a: in_type) -> out_type)

so that the user can write

let udf = scalarUDF!("sqrt", |x: f32| -> sqrt(x) as f16)
ctx.register_udf(udf);

and the macro figures itself what is the builder and downcasts should be. A common use-case is string manipulation.

IMO this would dramatically improve the user experience of adding a UDF, since the user could now focus exclusively on the functionality itself. Of course there are optimizations that are possible only when we write the full implementation (e.g. use SIMD or native ops), but for the average data engineer wanting to use Rust and DataFusion, I think that reducing the barrier to a UDF significantly helps DataFusion's adoption, since UDFs is a feature that makes spark so much useful when compared to the alternatives.

paddyhoran · 2020-03-29T14:22:30Z

rust/datafusion/src/execution/physical_plan/math_expressions.rs

+        ScalarFunction::new(
+            $NAME,
+            vec![Field::new("n", DataType::Float64, true)],
+            DataType::Float64,


This looks great, I'll try and do a proper review soon. Are there any considerations we need to make here for the fact that math_unary_function assumes f64, will this panic on f32 input?

That's a great point. We need the type coercion optimizer rule to take care of this by automatically casting expressions to the required type where possible or failing at that stage if types are not compatible. I will work on this today.

andygrove · 2020-03-29T15:04:36Z

@jorgecarleitao I see what you mean now. Yes, that's a great point. I will do that either as part of this PR or a follow-up PR.

andygrove · 2020-03-29T15:47:09Z

@jorgecarleitao I filed https://issues.apache.org/jira/browse/ARROW-8253

andygrove · 2020-03-30T14:19:02Z

@paddyhoran I suppose it would make sense to move the implementation of these functions out of DataFusion and into Arrow since they could be used directly there?

paddyhoran · 2020-03-30T17:20:37Z

Yea, I was thinking the same thing but didn't want to hold this PR up. I also think that arrow could benefit from having a sub-module in compute that deals with RecordBatch objects as a lot of crates that build on top will use that abstraction.

nevi-me · 2020-04-04T05:01:56Z

Yea, I was thinking the same thing but didn't want to hold this PR up. I also think that arrow could benefit from having a sub-module in compute that deals with RecordBatch objects as a lot of crates that build on top will use that abstraction.

There's 2 other alternatives which are similar.

We could adopt the CPP/Python implementation's ChunkedArray, which is a Vec<Arc<dyn Array>> like the below. Compute functions that currently take arrays could take a chunked array, where for convenience an Array could be represented as a ChunkedArray with 1 chunk.

// this is what I'm doing in the rust-dataframe library
#[derive(Clone)]
pub struct ChunkedArray {
    chunks: Vec<Arc<dyn Array>>,
    num_rows: usize,
    null_count: usize,
}

We could adopt the Datum from CPP, which IIRC behaves like an enum of:

pub enum Datum {
  Scalar(some_value), // haven't thought of how this would look like
  Array(Arc<dyn Array>),
  Chunk(ChunkedArray),
  ...
}

This would help DataFusion and other compute users handle literal values and (array ~ scalar) better by avoiding creating an array out of scalars.

andygrove · 2020-04-06T17:05:56Z

@nevi-me @paddyhoran It's time to release 0.17 ... are you OK with merging this one and we can follow up with another PR to move the math expressions to Arrow itself? The main reason I'd like to merge this PR for the release is that it enables DataFusion users to add their own UDFs.

paddyhoran

Yea, I'm fine with merging this and following up on the other points when we come to a consensus.

andygrove · 2020-04-06T18:02:17Z

Thanks @paddyhoran

andygrove added Component: Rust Component: Rust - DataFusion labels Mar 28, 2020

andygrove force-pushed the ARROW-6947 branch from e1bc1e4 to 945877a Compare March 28, 2020 16:20

andygrove requested review from nevi-me and paddyhoran March 28, 2020 16:25

andygrove changed the title ~~ARROW-6947: [Rust] [DataFusion] Scalar UDF support [WIP]~~ ARROW-6947: [Rust] [DataFusion] Scalar UDF support Mar 28, 2020

andygrove marked this pull request as ready for review March 28, 2020 16:26

andygrove force-pushed the ARROW-6947 branch from 2cd057c to edbc30d Compare March 28, 2020 17:28

andygrove added 8 commits March 28, 2020 12:08

Implement Scalar UDF support (WIP)

14c72e8

remove panic

967da36

rebase

13305ac

unit test passes

905cd64

remove unwrap from test

16d9a53

implement sqrt as first built-in scalar function

a4d7631

code cleanup

b7bacbd

rebase

2e440ce

andygrove force-pushed the ARROW-6947 branch from 59fbb20 to 2e440ce Compare March 28, 2020 18:09

implement some common unary math expressions

fee950a

paddyhoran reviewed Mar 29, 2020

View reviewed changes

andygrove added 3 commits March 29, 2020 09:25

Implement type coercion for scalar function arguments

d490521

add convenience methods for creating logical unary math expressions

32496a6

code cleanup

08bd701

formatting

8250e90

paddyhoran approved these changes Apr 6, 2020

View reviewed changes

andygrove closed this in b24bddf Apr 6, 2020

This was referenced Apr 6, 2020

[Rust] [DataFusion] Add support for scalar UDFs #23266

Closed

[Rust] [DataFusion] Improve ergonomics of registering UDFs #24449

Closed

ARROW-6947: [Rust] [DataFusion] Scalar UDF support #6749

ARROW-6947: [Rust] [DataFusion] Scalar UDF support #6749

Uh oh!

Conversation

andygrove commented Mar 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 28, 2020

Uh oh!

andygrove commented Mar 28, 2020

Uh oh!

jorgecarleitao commented Mar 28, 2020

Uh oh!

andygrove commented Mar 28, 2020

Uh oh!

andygrove commented Mar 28, 2020

Uh oh!

andygrove commented Mar 28, 2020

Uh oh!

jorgecarleitao commented Mar 29, 2020

Uh oh!

paddyhoran Mar 29, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove Mar 29, 2020

Choose a reason for hiding this comment

Uh oh!

andygrove commented Mar 29, 2020

Uh oh!

andygrove commented Mar 29, 2020

Uh oh!

andygrove commented Mar 30, 2020

Uh oh!

paddyhoran commented Mar 30, 2020

Uh oh!

nevi-me commented Apr 4, 2020

Uh oh!

andygrove commented Apr 6, 2020

Uh oh!

paddyhoran left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Apr 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andygrove commented Mar 28, 2020 •

edited

Loading