ARROW-15641: [C++][Python] UDF Aggregate Function Implementation #14527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

vibhatha wants to merge 13 commits into apache:main from vibhatha:arrow-15641

Contributor

vibhatha commented Oct 27, 2022 •

edited

Loading

Initial Draft of Aggregate UDFs for Python

Improve the existing logic with more tests
Find out a missing aggregate function and try to apply it with the current API (check if the applied theory sustains)
Using a custom object for state holding rather than Arrow data structures
Improvement function docs and comments
Adding binary and other arity tests
Self-review 1
Self-review 2

github-actions bot commented Oct 27, 2022

https://issues.apache.org/jira/browse/ARROW-15641

github-actions bot commented Oct 27, 2022

⚠️ Ticket has no components in JIRA, make sure you assign one.

github-actions bot commented Oct 27, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions bot added Component: C++ Component: Python labels

vibhatha force-pushed the arrow-15641 branch 2 times, most recently from 979b9f1 to cef71af Compare

November 1, 2022 10:13

vibhatha added 9 commits

November 2, 2022 12:04


          feat(initial): custom aggregate udf added

cb128ba


          feat(temp-init)

65d40d1


          fix(init-method)

47b8530


          feat(initial-python): wip

52f9fba


          feat(initial): functional vanilla aggregate udfs

f3c8df5


          feat(status-custom): initial

f723ad1


          fix(state): state-based computations testing

4f6c537


          fix(format)

19b8551


          fix(cleanup)

f945b40

vibhatha force-pushed the arrow-15641 branch from cef71af to f945b40 Compare

November 2, 2022 08:27

vibhatha added 4 commits

November 2, 2022 14:15


          fix(cleanup)

838adf4


          fix(cleanup-python)

96f8ba1


          fix(minor-style)

8207f99


          feat(updated-docs)

a6679cf

Contributor Author

vibhatha commented Nov 2, 2022

cc @westonpace appreciate an initial review on the draft PR.

westonpace requested changes

View reviewed changes

Member

westonpace left a comment

The shape of this looks correct. I believe you are on the right track. I think I would like to see at least one example of using a custom UDF in an exec plan. Also, similar to scalar UDFs, I think we will eventually want an example that uses something other than pyarrow (e.g. numpy) to do the computation (at least for the consume function)

python/pyarrow/tests/test_udf.py

Comment on lines +514 to +520

    
                      @property

                      def non_null(self):

                          return self._non_null

                      @non_null.setter

                      def non_null(self, value):

                          self._non_null = value

Member

westonpace Nov 16, 2022

I'm not much of a python expert but getters and setters seem like overkill here. Are they needed?

python/pyarrow/tests/test_udf.py

Comment on lines +529 to +530

    
                      state = State(0)

                      return state

Member

westonpace Nov 16, 2022

Suggested change

      
                    state = State(0)
          
                    return state
          
                    return State(0)

python/pyarrow/tests/test_udf.py

    
                  def consume(ctx, x):

                      if isinstance(x, pa.Array):

                          non_null = pc.sum(pc.invert(pc.is_nan(x))).as_py()

                      elif isinstance(x, pa.Scalar):

Member

westonpace Nov 16, 2022

Can a unary aggregate ever be called with a scalar?

python/pyarrow/tests/test_udf.py

    
                  def finalize(ctx):

                      return pa.array([ctx.state.non_null])

                  func_name = "simple_count"

Member

westonpace Nov 16, 2022

Maybe valid_count or non_null_count?

python/pyarrow/tests/test_udf.py

Comment on lines +571 to +585

    
                      @property

                      def non_null(self):

                          return self._non_null

                      @non_null.setter

                      def non_null(self, value):

                          self._non_null = value

                      @property

                      def null(self):

                          return self._null

                      @null.setter

                      def null(self, value):

                          self._null = value

Member

westonpace Nov 16, 2022

Same comment on getters and setters

python/pyarrow/_compute.pyx

Comment on lines +2748 to +2750

    
                      To define a varargs function, pass a callable that takes

                      varargs. The last in_type will be the type of all varargs

                      arguments.

Member

westonpace Nov 16, 2022

How do varargs work? Do I define a *args?

python/pyarrow/_compute.pyx

Comment on lines +2760 to +2763

    
                      must be merged with. The current state can be retrieved from

                      the context object which can be acessed by `context.state`.

                      The state doesn't need to be set in the Python side and it is

                      autonomously handled in the C++ backend. The updated state must

Member

westonpace Nov 16, 2022

I'm not sure I understand the sentence that starts with "The state doesn't need to be set..."

python/pyarrow/_compute.pyx

    
                      This function returns the updated state after consuming the 

                      received data.

                  merge_func: callable

Member

westonpace Nov 16, 2022

The concept of a "merge function" is not going to be obvious to most users. It is very possible for an engine to be defined that does not have to worry about the concept of a merge. I think we need to describe some background here for why a merge is needed in the first place. Something like:

Aggregates may be calculated across many threads in parallel. Each thread will call the init function once to generate a state for that thread. Once all values have been consumed then the threads from each state will be merged together to get the final result state. The merge function should take two states and combine them.

python/pyarrow/_compute.pyx

    
                      The first argument is the context argument of type

                      ScalarUdfContext.

                      Using the context argument the state can be extracted and return

                      type must be an array matching the `out_type`.

Member

westonpace Nov 16, 2022

In your C++ example the return type is a scalar?

python/pyarrow/_compute.pyx

    
                      A callable implementing the user-defined finalize function.

                      The first argument is the context argument of type

                      ScalarUdfContext.

                      Using the context argument the state can be extracted and return

Member

westonpace Nov 16, 2022

"the state can be extracted" is not very obvious. Maybe something like:

The purpose of the finalize function is to transform the state (which is available in the context argument) into an array. This array will be the final result of the aggregation.

westonpace mentioned this pull request

[Python] Is there a way to call a custom compute function on a table.group_by aggregation? #14860

Open

vibhatha mentioned this pull request

GH-32916: [C++] [Python] User-defined tabular functions #14682

Merged

vibhatha closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

westonpace westonpace requested changes

AlenkaF Awaiting requested review from AlenkaF AlenkaF will be requested when the pull request is marked ready for review AlenkaF is a code owner

raulcd Awaiting requested review from raulcd raulcd will be requested when the pull request is marked ready for review raulcd is a code owner

rok Awaiting requested review from rok rok will be requested when the pull request is marked ready for review rok is a code owner

Labels

Component: C++ Component: Python