-
Notifications
You must be signed in to change notification settings - Fork 0
Initial base implementation for demean Kernel #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ARROW_ASSIGN_OR_RAISE(mean_datum, arrow::compute::CallFunction("mean", {input})); | ||
| ARROW_ASSIGN_OR_RAISE(demean_datum, | ||
| arrow::compute::CallFunction("subtract", {input, mean_datum})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure we don't want to call the functions like this (possibly a lot of overhead) but I am unsure if we have any technique to call other kernels from an existing kernel.
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
@pitrou, this is probably pretty bad (that's why is pointing against my own fork). I was treating it as a learning exercise (both for C++ and for Arrow Kernels) and as an initial discussion on what / how should I(?) do it. Let me know if we can go over it and I can keep iterating, possibly opening an issue and an initial "real PR" |
| ("Returns an array where every element has been removed from \n" | ||
| "the mean of the array."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO, fix desc
a87aadc to
0b6cb7c
Compare
|
@raulcd @icexelloss is there a clear definition somewhere of what the demean function should do? |
I assumed demeaning as subtracting the mean from each slot so that they are mean zero but it is fair that it might be an incorrect assumption. |
where |
| ExecContext* ctx) const override { | ||
| switch (args[0].kind()) { | ||
| case Datum::ARRAY: | ||
| case Datum::CHUNKED_ARRAY: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say support for chunked_array is not needed - more of a "nice to have". The main application of these kernels are to be used with Acero which doesn't use chunk arrays.
| "the mean of the array."), | ||
| {"array"}}; | ||
|
|
||
| class DemeanMetaFunction : public MetaFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me what MetaFunction does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - is a "MetaFunction" a function that is composed of other functions?
| ASSERT_OK(input_builder.AppendValues(test_values)); | ||
| ASSERT_OK_AND_ASSIGN(auto test_inputs, input_builder.Finish()); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(Datum demean_result, CallFunction("demean", {test_inputs})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this works with Acero - ultimately we want to use the demean function with Acero similar to [this] (The link is an aggregate node but we have something similar but uses vector kernels instead) (https://github.com/apache/arrow/blob/da3c6dde44575d3b68a20a0493bd3e9a1588aa14/cpp/src/arrow/acero/aggregate_internal.h#L229).
Which uses the kernel abstraction instead of the function abstraction in the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the MetaFunction supports DispatchExact it should work:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/aggregate_internal.cc#L81C57-L81C70
@raulcd Can you test this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I think we should probably add a "test Acero vector node" to make this easier - basically an simple Acero node that pass each batch to the vector kernel and output the result - this would be useful to test other vector kernels too.
Hmm, so is there a reason to make a dedicated function for this? It will trivially execute the formula above. |
I think the main reason is to be used within Acero in a Streaming way, Let's say we have streaming input and we want to demean v for each time value, with a demean vector kernel we can do this pretty easily (basically group the input rows by "time" (as time segment) and invoke the demean kernel on the time segment). If we didn't have demean kernel, then we would need to do sth like: which is more complex (both code and computation). Also for functions of the same "shape" (N->N mapping with grouping, e.g. rank, winsorize), we would need to do something different. Although the demean formula is trivial - IMO having it a vector kernel makes it easier to use. I think a future step is to share implementation between kernels, but for this particular task, since the demean kernel is trivial I don't think we need to share implementation (between mean and demean). |
|
Ah, so what you're looking for is really some kind of "grouped demean" for segmented keys, right? |
Yeah - we already have implemented Acero node to do the segmentation part so just need these vector kernels to be implemented |
|
Ok, so you really mean a vector kernel doing the trivial thing? What stops you from implementing it yourself? |
Nothing really. Demean is a simple kernel and to us it doesn't really matter where it lives (Arrow or internally). IMO there are a couple of benefits of implementing demean upstream: (1) A recipe/example to reuse/share implementation between kernels (mean and demean) |
This PR is pointing against my own repo because I am using it as a learning exercise
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?