Skip to content

Conversation

@raulcd
Copy link
Owner

@raulcd raulcd commented Dec 12, 2024

This PR is pointing against my own repo because I am using it as a learning exercise

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Comment on lines 62 to 64
ARROW_ASSIGN_OR_RAISE(mean_datum, arrow::compute::CallFunction("mean", {input}));
ARROW_ASSIGN_OR_RAISE(demean_datum,
arrow::compute::CallFunction("subtract", {input, mean_datum}));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure we don't want to call the functions like this (possibly a lot of overhead) but I am unsure if we have any technique to call other kernels from an existing kernel.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@raulcd
Copy link
Owner Author

raulcd commented Dec 12, 2024

@pitrou, this is probably pretty bad (that's why is pointing against my own fork). I was treating it as a learning exercise (both for C++ and for Arrow Kernels) and as an initial discussion on what / how should I(?) do it. Let me know if we can go over it and I can keep iterating, possibly opening an issue and an initial "real PR"

Comment on lines +32 to +33
("Returns an array where every element has been removed from \n"
"the mean of the array."),
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO, fix desc

@pitrou
Copy link

pitrou commented Dec 16, 2024

@raulcd @icexelloss is there a clear definition somewhere of what the demean function should do?

@raulcd
Copy link
Owner Author

raulcd commented Dec 16, 2024

@raulcd @icexelloss is there a clear definition somewhere of what the demean function should do?

I assumed demeaning as subtracting the mean from each slot so that they are mean zero but it is fair that it might be an incorrect assumption.

@icexelloss
Copy link

@icexelloss is there a clear definition somewhere of what the demean function should do?

y = x - mean(x) 

where x is the input vector, y is the output vector

ExecContext* ctx) const override {
switch (args[0].kind()) {
case Datum::ARRAY:
case Datum::CHUNKED_ARRAY: {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say support for chunked_array is not needed - more of a "nice to have". The main application of these kernels are to be used with Acero which doesn't use chunk arrays.

"the mean of the array."),
{"array"}};

class DemeanMetaFunction : public MetaFunction {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me what MetaFunction does?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - is a "MetaFunction" a function that is composed of other functions?

ASSERT_OK(input_builder.AppendValues(test_values));
ASSERT_OK_AND_ASSIGN(auto test_inputs, input_builder.Finish());

ASSERT_OK_AND_ASSIGN(Datum demean_result, CallFunction("demean", {test_inputs}));
Copy link

@icexelloss icexelloss Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this works with Acero - ultimately we want to use the demean function with Acero similar to [this] (The link is an aggregate node but we have something similar but uses vector kernels instead) (https://github.com/apache/arrow/blob/da3c6dde44575d3b68a20a0493bd3e9a1588aa14/cpp/src/arrow/acero/aggregate_internal.h#L229).

Which uses the kernel abstraction instead of the function abstraction in the node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the MetaFunction supports DispatchExact it should work:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/aggregate_internal.cc#L81C57-L81C70

@raulcd Can you test this out?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I think we should probably add a "test Acero vector node" to make this easier - basically an simple Acero node that pass each batch to the vector kernel and output the result - this would be useful to test other vector kernels too.

@pitrou
Copy link

pitrou commented Dec 16, 2024

@icexelloss is there a clear definition somewhere of what the demean function should do?

y = x - mean(x) 

where x is the input vector, y is the output vector

Hmm, so is there a reason to make a dedicated function for this? It will trivially execute the formula above.

@icexelloss
Copy link

icexelloss commented Dec 16, 2024

@icexelloss is there a clear definition somewhere of what the demean function should do?

y = x - mean(x) 

where x is the input vector, y is the output vector

Hmm, so is there a reason to make a dedicated function for this? It will trivially execute the formula above.

I think the main reason is to be used within Acero in a Streaming way,

Let's say we have streaming input

time v
t0    v0
t0    v1
t1    v2
t1    v3

and we want to demean v for each time value, with a demean vector kernel we can do this pretty easily (basically group the input rows by "time" (as time segment) and invoke the demean kernel on the time segment).

If we didn't have demean kernel, then we would need to do sth like:

(1) aggregate by time to get the mean
(2) join the mean back to the original data
(3) substract the mean

which is more complex (both code and computation). Also for functions of the same "shape" (N->N mapping with grouping, e.g. rank, winsorize), we would need to do something different. Although the demean formula is trivial - IMO having it a vector kernel makes it easier to use.

I think a future step is to share implementation between kernels, but for this particular task, since the demean kernel is trivial I don't think we need to share implementation (between mean and demean).

@pitrou
Copy link

pitrou commented Dec 16, 2024

Ah, so what you're looking for is really some kind of "grouped demean" for segmented keys, right?

@icexelloss
Copy link

Ah, so what you're looking for is really some kind of "grouped demean" for segmented keys, right?

Yeah - we already have implemented Acero node to do the segmentation part so just need these vector kernels to be implemented

@pitrou
Copy link

pitrou commented Dec 16, 2024

Ok, so you really mean a vector kernel doing the trivial thing? What stops you from implementing it yourself?

@icexelloss
Copy link

icexelloss commented Dec 17, 2024

Ok, so you really mean a vector kernel doing the trivial thing? What stops you from implementing it yourself?

Nothing really. Demean is a simple kernel and to us it doesn't really matter where it lives (Arrow or internally). IMO there are a couple of benefits of implementing demean upstream:

(1) A recipe/example to reuse/share implementation between kernels (mean and demean)
(2) A simple kernel to test out "exposing kernels as separate so file"
(3) Potentially optimizations (e.g. SIMD) in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants