Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import sys
import warnings
from unittest import mock
from docutils.parsers.rst import Directive, directives

import pyarrow

Expand Down Expand Up @@ -463,3 +464,49 @@ def setup(app):
# This will also rebuild appropriately when the value changes.
app.add_config_value('cuda_enabled', cuda_enabled, 'env')
app.add_config_value('flight_enabled', flight_enabled, 'env')
app.add_directive('arrow-computefuncs', ComputeFunctionsTableDirective)


class ComputeFunctionsTableDirective(Directive):
"""Generate a table of Arrow compute functions.

.. arrow-computefuncs::
:kind: hash_aggregate

The generated table will include function name,
description and option class reference.

The functions listed in the table can be restricted
with the :kind: option.
"""
has_content = True
option_spec = {
"kind": directives.unchanged
}

def run(self):
from docutils.statemachine import ViewList
from docutils import nodes
import pyarrow.compute as pc

result = ViewList()
function_kind = self.options.get('kind', None)

result.append(".. csv-table::", "<computefuncs>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's the "" for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I got the question, which "" are you referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, annoying github rendering that hides things between angle brackets :) I meant "<computefuncs>"

Copy link
Member Author

@amol- amol- Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice is the name of the source file where the rst code was written. In this case I use <computefuncs> so that if there is a syntax error in the generated rst code it will tell you "line blahblah in <computefuncs>" and we know it's in this directive. I mimic a bit python style which uses things like File "<stdin>", line 1, in <module>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see!

result.append(" :widths: 20, 60, 20", "<computefuncs>")
result.append(" ", "<computefuncs>")
for fname in pc.list_functions():
func = pc.get_function(fname)
option_class = ""
if func._doc.options_class:
option_class = f":class:`{func._doc.options_class}`"
if not function_kind or func.kind == function_kind:
result.append(
f' "{fname}", "{func._doc.summary}", "{option_class}"',
"<computefuncs>"
)

node = nodes.section()
node.document = self.state.document
self.state.nested_parse(result, 0, node)
return node.children
69 changes: 47 additions & 22 deletions docs/source/python/api/compute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,28 +45,6 @@ Aggregations
tdigest
variance

Grouped Aggregations
--------------------

.. autosummary::
:toctree: ../generated/

hash_all
hash_any
hash_approximate_median
hash_count
hash_count_distinct
hash_distinct
hash_max
hash_mean
hash_min
hash_min_max
hash_product
hash_stddev
hash_sum
hash_tdigest
hash_variance

Arithmetic Functions
--------------------

Expand Down Expand Up @@ -498,3 +476,50 @@ Structural Transforms
make_struct
replace_with_mask
struct_field

Compute Options
---------------

.. autosummary::
:toctree: ../generated/

ArraySortOptions
AssumeTimezoneOptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it has much value to list them here, as long as they have no docstring.. (this will create a lot new doc pages, which will basically be empty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I list them here because it provides the signature and thus which arguments they support. As you said there won't be any docstring but given that in many cases you can guess what the arguments do from their name it's better than nothing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's true that the signatures show something. It only looks kind of "bad" to have an empty docstring page ..

Actually, what happens if you leave out the :toctree: ../generated/ (to only have the table) ? Although that will make just sphinx complain about nonexisting references ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it also removes the reference page where you can see the signature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it also removes the reference page where you can see the signature

The table doesn't show the signature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

CastOptions
CountOptions
CountOptions
DayOfWeekOptions
DictionaryEncodeOptions
ElementWiseAggregateOptions
ExtractRegexOptions
FilterOptions
IndexOptions
JoinOptions
MakeStructOptions
MatchSubstringOptions
ModeOptions
NullOptions
PadOptions
PartitionNthOptions
QuantileOptions
ReplaceSliceOptions
ReplaceSubstringOptions
RoundOptions
RoundToMultipleOptions
ScalarAggregateOptions
ScalarAggregateOptions
SelectKOptions
SetLookupOptions
SliceOptions
SortOptions
SplitOptions
SplitPatternOptions
StrftimeOptions
StrptimeOptions
StructFieldOptions
TakeOptions
TDigestOptions
TDigestOptions
TrimOptions
VarianceOptions
WeekOptions
111 changes: 105 additions & 6 deletions docs/source/python/compute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,33 @@ Compute Functions
=================

Arrow supports logical compute operations over inputs of possibly
varying types. Many compute functions support both array (chunked or not)
and scalar inputs, but some will mandate either. For example,
``sort_indices`` requires its first and only input to be an array.
varying types.

Below are a few simple examples:
The standard compute operations are provided by the :mod:`pyarrow.compute`
module and can be used directly::

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> a = pa.array([1, 1, 2, 3])
>>> pc.sum(a)
<pyarrow.Int64Scalar: 7>

The grouped aggregation functions raise an exception instead
and need to be used through the :meth:`pyarrow.Table.group_by` capabilities.
See :ref:`py-grouped-aggrs` for more details.

Standard Compute Functions
==========================

Many compute functions support both array (chunked or not)
and scalar inputs, but some will mandate either. For example,
``sort_indices`` requires its first and only input to be an array.

Below are a few simple examples::

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> a = pa.array([1, 1, 2, 3])
>>> b = pa.array([4, 1, 2, 8])
>>> pc.equal(a, b)
<pyarrow.lib.BooleanArray object at 0x7f686e4eef30>
Expand All @@ -48,7 +64,7 @@ Below are a few simple examples:
<pyarrow.DoubleScalar: 72.54>

These functions can do more than just element-by-element operations.
Here is an example of sorting a table:
Here is an example of sorting a table::

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
Expand All @@ -62,8 +78,91 @@ Here is an example of sorting a table:
0
]


For a complete list of the compute functions that PyArrow provides
you can refer to :ref:`api.compute` reference.

.. seealso::

:ref:`Available compute functions (C++ documentation) <compute-function-list>`.

.. _py-grouped-aggrs:

Grouped Aggregations
====================

PyArrow supports grouped aggregations over :class:`pyarrow.Table` through the
:meth:`pyarrow.Table.group_by` method.
The method will return a grouping declaration
to which the hash aggregation functions can be applied::

>>> import pyarrow as pa
>>> t = pa.table([
... pa.array(["a", "a", "b", "b", "c"]),
... pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])
>>> t.group_by("keys").aggregate([("values", "sum")])
pyarrow.Table
values_sum: int64
keys: string
----
values_sum: [[3,7,5]]
keys: [["a","b","c"]]

The ``"sum"`` aggregation passed to the ``aggregate`` method in the previous
example is the ``hash_sum`` compute function.

Multiple aggregations can be performed at the same time by providing them
to the ``aggregate`` method::

>>> import pyarrow as pa
>>> t = pa.table([
... pa.array(["a", "a", "b", "b", "c"]),
... pa.array([1, 2, 3, 4, 5]),
... ], names=["keys", "values"])
>>> t.group_by("keys").aggregate([
... ("values", "sum"),
... ("keys", "count")
... ])
pyarrow.Table
values_sum: int64
keys_count: int64
keys: string
----
values_sum: [[3,7,5]]
keys_count: [[2,2,1]]
keys: [["a","b","c"]]

Aggregation options can also be provided for each aggregation function,
for example we can use :class:`CountOptions` to change how we count
null values::

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> table_with_nulls = pa.table([
... pa.array(["a", "a", "a"]),
... pa.array([1, None, None])
... ], names=["keys", "values"])
>>> table_with_nulls.group_by(["keys"]).aggregate([
... ("values", "count", pc.CountOptions(mode="all"))
... ])
pyarrow.Table
values_count: int64
keys: string
----
values_count: [[3]]
keys: [["a"]]
>>> table_with_nulls.group_by(["keys"]).aggregate([
... ("values", "count", pc.CountOptions(mode="only_valid"))
... ])
pyarrow.Table
values_count: int64
keys: string
----
values_count: [[1]]
keys: [["a"]]

Following is a list of all supported grouped aggregation functions.
You can use them with or without the ``"hash_"`` prefix.

.. arrow-computefuncs::
:kind: hash_aggregate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!