-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions. #9376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jorgecarleitao yes this looks good 👍 This is going to take some time to apply but I like the idea of passing a function to apply to the string (for example). It would be good to get this merged soon so I can apply to the big Postgres functions PR. |
|
@alamb , when you wrote the If the latter, do you think it would be ok to require granularity to always be a scalar as a regression / feature of this PR? I am asking because |
|
FWIW, PostgreSQL also doesn't support the non-scalar granularity for |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is headed @jorgecarleitao ❤️ I skimmed it quickly but will be happy to review it thoroughly when it is ready
Another thing this PR's functionality is likely to enable is more general 'constant folding' -- e.g the kind of thing @houqp is doing for booleans in #9309
Imagine an expression like A < (5 + 1) -- using this code we could evalaute the 5+1 into 6 at PLAN time rather than during execution. A more likely useful example is date_truc(date_col, 'month') < date_trun(2021-01-31, 'month') and other similar things
|
@jorgecarleitao -- I agree with @Dandandan -- think |
|
This is ready for a first take. Some notes:
Sorry for the long PR, but I did not find an easy way to split it up (besides #9378, which this PR is built on top of). |
Having been trying to implement a lot of the Postgres logic recently they all logically support the use of arrays for each of the parameters (i.e. |
This is a good point @seddonm1 - I am just starting to look at this pR, but I wonder if we could do something like have a default implementation for that way we would only have to supply |
|
@alamb here is a proof-of-concept for the regexp_replace Postgres function which has been built to support the possibility of passing in different function parameters for each row. If it were possible to tell whether the value was a Scalar or Array there would be major optimisation opportunities. I did some basic memoization of the Regex objects but that would not be as necessary if we knew Scalar v Array. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wrote up a big comment that github seems to have lost :(
Basically I spent 30 minutes studying this PR. I think it is one of the most important PRs for expression evaluation I have seen in DataFusion.
My major concern (other than the annoyance of the backward compatibility) is that it might put the implementation of built in functions even further out of reach for new contributors. The cognitive load of understanding ArrayRef in general and working with them is already large, and having to handle ColumnValues will just make it more so.
Also this PR is too large for me to effectively review -- even with some more time (which I am not sure I can find) I am not sure I could fully grok it.
I wonder if we could do something like the following to both ease new developers as well as allow this PR to be implemented in chunks.
Make a new type that has the old array signature:
pub type SimpleScalarFunctionImplementation =
Arc<dyn Fn(&[ArrayRef]) -> Result<ArrayRef> + Send + Sync>;And then make an adapter function that takes a simple signature and implements the more general one:
fn make_scalar_function(inner: SimpleScalarFunctionImplementation) -> ScalarFunctionImplementation {
// invoke inner correctly, creating `ArrayRef` to hold a bunch of scalar values
}That way you could keep most signatures the same and then implement the general cases one by one.
Thanks again @jorgecarleitao -- this is really awesome work,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!) | |
| // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1] and the base < 0 because that panics!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This represents an open interval: 0 <= exponent < 1 because (x)^1 = x for all x (including negative ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool -- sorry I guess I am used to seeing an open interval using a ) --so in this case something like [0, 1) to represent 0 <= exponent < 1 (e.g. here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding an example of using this pattern for implementing UDfs might be really helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing it -- but this function only seems to be invoked once with the same set of type arguments -- does it need to be generic? Or more broadly, can we hoist this seemingly repeated handle pattern somewhere it can be reused (and reduce the cognitive load on people writing new functions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern depends on the input and output of the function. I.e. when input is &str, then Utf8/LargeUtf8. When output is a Native, then the output is PrimitiveArray<O::Native>. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (string_to_timestamp_nanos in this case).
In crypto_expressions we have a similar pattern, but in this case the function is &str -> AsRef<[u8]>, which allowed to write all cripto sha in a succinct manner. However, in that case, the output type is always Binary instead of LargeBinary for LargeStringArray, because the hashes are always smaller than i32::MAX. All of this was already written, I just expanded it for the two variants (scalar and vector).
Note that crypto_expressions::handle and crypto_expressions::md5 are very similar, but their return types are different: handle receives a GenericStringArray, but returns a BinaryArray. This is because MD5's signature is string -> string, while sha is string -> binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense -- thank you for the explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is effectively applying a str->str function op to all values in args? Some comments might be helpful
|
Hi @alamb , thanks a lot for your review and comments. I generally agree that this makes it more difficult to write a UDF and a function implementation in general. I like that idea. I have now pushed a commit with it. I will now address the remaining ones. |
|
Ok, I have addressed the comments. The API change to UDFs is that people need to call Out of curiosity, did anyone run the benchmarks? I do not have a machine suitable for that, but I am obviously curious :) |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is looking good now @jorgecarleitao -- thank you 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool -- sorry I guess I am used to seeing an open interval using a ) --so in this case something like [0, 1) to represent 0 <= exponent < 1 (e.g. here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
@jorgecarleitao what kind of benchmarks are you interested in? AFAIK, most benchmarks are not very depending on this, I expect it being mostly impactfully in cases where the projection itself is a lot of the time, but most benchmarks are spending most of the time on joins/aggregates. |
|
@andygrove / @Dandandan / @seddonm1 / @maxburke / @jhorstmann -- what are your thoughts on this one? It is a significant enough change I think someone using DataFusion in their projects, other than myself should weigh in. I like it a lot -- and I think it could serve as the basis for constant folding (e.g. transform |
|
I think changes here look good, and will be helpful to implement functions which typically operate on @alamb I am not sure I understand why folding should be done on the physical plan level or really depending on this PR , it should be possible without changes in this PR (just by having some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = (inner)(&args); | |
| let result = inner(&args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec()); | |
| let result = a.as_ref().map(|x| op(x).as_ref().to_vec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec()); | |
| let result = a.as_ref().map(|x| op(x).as_ref().to_vec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = a.as_ref().map(|x| md5_process(x)); | |
| let result = a.as_ref().map(md5_process); |
@Dandandan -- yes, I think constant folding looks like We'll see if that turns out to be possible, but I think it should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Ok(array.iter().map(|x| x.map(|x| op(x))).collect()) | |
| Ok(array.iter().map(|x| x.map(op).collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let result = a.as_ref().map(|x| md5_process(x)); | |
| let result = a.as_ref().map(md5_process); |
On the surface this looks like it'll be a fantastic change to have; I am curious to see what the measured impact on query times will be. |
|
Rebased. 💦 |
|
This is now ready to merge. This will collide with an also large PR, #9243. We have a function to enable compatibility, but it is still some work :( |
|
Unfortunately (for me) this logically does go first as being able to identify ScalarValue would give a huge performance advantage. I am happy to rework the other one after this is merged. |
|
Sounds good -- @jorgecarleitao let's get it merged ! It looks like it needs another rebase and then we'll get it in |
Codecov Report
@@ Coverage Diff @@
## master #9376 +/- ##
==========================================
- Coverage 82.27% 82.11% -0.17%
==========================================
Files 234 235 +1
Lines 54594 54714 +120
==========================================
+ Hits 44919 44929 +10
- Misses 9675 9785 +110
Continue to review full report at Codecov.
|
|
Commenting since @maxburke pinged me. On the surface I think this is a great change from the performance perspective. I totally agree that being able to deal with scalars instead of just arrays adds huge room for optimization. I have always thought that always needing intermediate arrays slowed down the processing of DataFusion significantly, for certain cases, and is also cache unfriendly. On the other hand, I hear what @alamb and others are saying that it adds complexity to what is already nontrivial. I agree with that. I wonder if it is possible to get the best of both worlds, by extending the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely one of the functions I would expect to benefit the most from this change, given that NULLIF most of the time the right hand argument is likely to be a scalar.
I also have many instances where knowing the |
|
@velvia , @seddonm1 My opinion atm is that would not really help much from the compute side, as we would still need to write separate logic for the 4 cases For example, a subtraction of two arrays goes along the lines of let iter = lhs.values().zip(rhs.values()).map(|(r, l)| r - l);
let array = unsafe { PrimitiveArray<T>::from_trusted_len_iter(iter) };
// .values() is a sliceIf we had an implementation of To change the Array trait we basically need a change in the arrow specification so that all implementations agree on how to communicate such information via IPC, ffi, etc, so, that is mailing list material :) Note that this PR addresses @alamb 's concern by introducing a adapter that people can use if they do not want to bother implementing the scalar variants. |
Yes, the adapter function I think lowers the barrier to implementing initial scalar functions and should be an easier upgrade path. TLDR it is that one can call I think this PR is ready, is a great step forward. Let's merge it and continue to iterate. |
…nctions. This PR adds support for `ScalarValue` to all builtin-functions and UDFs from DataFusion. This allows to execute builtin functions without having to create arrays and pass then to the kernels. With this change, coding a new kernel and UDFs is a more time consuming, as it is necessary to cater for both the scalar and array case. OTOH, this leads to a major performance improvement as we only need to perform 1 operation instead of the usual number of rows operations. Closes apache#9376 from jorgecarleitao/scalar_fnc Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds support for
ScalarValueto all builtin-functions and UDFs from DataFusion.This allows to execute builtin functions without having to create arrays and pass then to the kernels.
With this change, coding a new kernel and UDFs is a more time consuming, as it is necessary to cater for both the scalar and array case. OTOH, this leads to a major performance improvement as we only need to perform 1 operation instead of the usual number of rows operations.