-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15641: [C++][Python] UDF Aggregate Function Implementation #14527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
|
979b9f1 to
cef71af
Compare
|
cc @westonpace appreciate an initial review on the draft PR. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape of this looks correct. I believe you are on the right track. I think I would like to see at least one example of using a custom UDF in an exec plan. Also, similar to scalar UDFs, I think we will eventually want an example that uses something other than pyarrow (e.g. numpy) to do the computation (at least for the consume function)
| @property | ||
| def non_null(self): | ||
| return self._non_null | ||
|
|
||
| @non_null.setter | ||
| def non_null(self, value): | ||
| self._non_null = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not much of a python expert but getters and setters seem like overkill here. Are they needed?
| state = State(0) | ||
| return state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| state = State(0) | |
| return state | |
| return State(0) |
| def consume(ctx, x): | ||
| if isinstance(x, pa.Array): | ||
| non_null = pc.sum(pc.invert(pc.is_nan(x))).as_py() | ||
| elif isinstance(x, pa.Scalar): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a unary aggregate ever be called with a scalar?
| def finalize(ctx): | ||
| return pa.array([ctx.state.non_null]) | ||
|
|
||
| func_name = "simple_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe valid_count or non_null_count?
| @property | ||
| def non_null(self): | ||
| return self._non_null | ||
|
|
||
| @non_null.setter | ||
| def non_null(self, value): | ||
| self._non_null = value | ||
|
|
||
| @property | ||
| def null(self): | ||
| return self._null | ||
|
|
||
| @null.setter | ||
| def null(self, value): | ||
| self._null = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on getters and setters
| To define a varargs function, pass a callable that takes | ||
| varargs. The last in_type will be the type of all varargs | ||
| arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do varargs work? Do I define a *args?
| must be merged with. The current state can be retrieved from | ||
| the context object which can be acessed by `context.state`. | ||
| The state doesn't need to be set in the Python side and it is | ||
| autonomously handled in the C++ backend. The updated state must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the sentence that starts with "The state doesn't need to be set..."
| This function returns the updated state after consuming the | ||
| received data. | ||
| merge_func: callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concept of a "merge function" is not going to be obvious to most users. It is very possible for an engine to be defined that does not have to worry about the concept of a merge. I think we need to describe some background here for why a merge is needed in the first place. Something like:
Aggregates may be calculated across many threads in parallel. Each thread will call the init function once to generate a state for that thread. Once all values have been consumed then the threads from each state will be merged together to get the final result state. The merge function should take two states and combine them.
| The first argument is the context argument of type | ||
| ScalarUdfContext. | ||
| Using the context argument the state can be extracted and return | ||
| type must be an array matching the `out_type`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your C++ example the return type is a scalar?
| A callable implementing the user-defined finalize function. | ||
| The first argument is the context argument of type | ||
| ScalarUdfContext. | ||
| Using the context argument the state can be extracted and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the state can be extracted" is not very obvious. Maybe something like:
The purpose of the finalize function is to transform the state (which is available in the context argument) into an array. This array will be the final result of the aggregation.
Initial Draft of Aggregate UDFs for Python