ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088

kszucs · 2020-08-31T23:53:36Z

Targets of the refactoring:

PythonToArrow converters based on a common API
PyBytesView to use Result return values and contain is_utf8 flag
PyConversionOptions is now available from all converters so we can honor its flags

Fixes

ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output
ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values
~~ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names~~
ARROW-9999 [Python] Support constructing dictionary array directly through pa.array()
ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs
ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar
ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array
ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity
ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

Backward incompatibility

~~Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the .as_py() representation to return with a list of key-value pairs.~~

TODOs:

ensure that the large memory tests are passing
benchmark and check binary size again

Library size

Before:

 12M Sep 25 15:05 libarrow.200.0.0.dylib
2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib

After:

 12M Sep 25 15:46 libarrow.200.0.0.dylib
2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib

Benchmarks

Executed the following ASV benchmark:

asv continuous --bench convert_builtins master py2ar --no-only-changed --split

After some optimization:

Benchmarks that have improved:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
-     2.78±0.03ms      2.45±0.03ms     0.88  convert_builtins.ConvertPyListToArray.time_convert('bool')
-     3.59±0.01ms      3.12±0.02ms     0.87  convert_builtins.ConvertPyListToArray.time_convert('int32')
-     3.37±0.01ms      2.73±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('uint32')
-     3.74±0.02ms      3.03±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('int64')
-     3.38±0.01ms      2.69±0.01ms     0.80  convert_builtins.ConvertPyListToArray.time_convert('uint64')
-     2.83±0.01ms      2.24±0.01ms     0.79  convert_builtins.ConvertPyListToArray.time_convert('float32')
-     3.92±0.02ms      2.99±0.02ms     0.76  convert_builtins.ConvertPyListToArray.time_convert('binary10')
-     14.1±0.04ms      8.89±0.05ms     0.63  convert_builtins.ConvertPyListToArray.time_convert('unicode')
-     5.60±0.01ms      3.24±0.03ms     0.58  convert_builtins.ConvertPyListToArray.time_convert('ascii')
-     5.37±0.02ms      2.91±0.04ms     0.54  convert_builtins.ConvertPyListToArray.time_convert('binary')

Benchmarks that have stayed the same:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
      14.8±0.02ms       15.5±0.1ms     1.05  convert_builtins.ConvertPyListToArray.time_convert('decimal')
       16.4±0.7ms       15.1±0.6ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('struct from tuples')
       34.4±0.3ms       31.5±0.4ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('int64 list')
       16.7±0.7ms       15.1±0.6ms    ~0.91  convert_builtins.ConvertPyListToArray.time_convert('struct')
      2.42±0.02ms      2.05±0.03ms    ~0.85  convert_builtins.ConvertPyListToArray.time_convert('float64')

wesm · 2020-09-01T12:58:00Z

Are there some related JIRAs that will be fixed by this?

kszucs · 2020-09-01T14:02:27Z

Not directly, but this will help to solve multiple ones. I'm going to create one.

kszucs · 2020-09-01T14:08:38Z

Just executed the ASV benchmarks on this PR and the master branch with the two sample sizes and the results showed slighty better performance in both cases:

==================== ========= =========
        type            PR      master 
-------------------- --------- ---------
       int32          136±0ms   138±0ms
       uint32         129±0ms   133±0ms
       int64          131±0ms   138±0ms
       uint64         131±0ms   137±0ms
      float32         131±0ms   131±0ms
      float64         130±0ms   132±0ms
        bool          128±0ms   131±0ms
      decimal         232±0ms   227±0ms
       binary         149±0ms   180±0ms
      binary10        141±0ms   151±0ms
       ascii          158±0ms   160±0ms
      unicode         300±0ms   319±0ms
     int64 list       310±0ms   307±0ms
       struct         224±0ms   241±0ms
 struct from tuples   231±0ms   230±0ms
==================== ========= =========

Also compared the libarrow_python binary size:
master: 2.7M
this PR: 2.2M

So far there is no performance degradation - if we can trust the ASV benchmarks.

pitrou

Why separate PyArrayConverter and ArrayConverter?

cpp/src/arrow/array/builder_base.h

cpp/src/arrow/ipc/metadata_internal.cc

cpp/src/arrow/util/converter.h

kszucs · 2020-09-01T15:31:35Z

Why separate PyArrayConverter and ArrayConverter?

My hope is that the classes in converter.h can be reused in the future to implement different converter APIs, as an example it could be used in the R bindings as well. It would be even better if we could centralize converter agnostic logic here like auto chunking.

bkietz

This looks like a neat cleanup of the conversion code, and potentially reusable as well. Thanks for doing this! I have a few comments on the overall structure below

cpp/src/arrow/python/python_test.cc

cpp/src/arrow/util/hashing.h

cpp/src/arrow/util/converter.h

cpp/src/arrow/python/python_to_arrow.cc

cpp/src/arrow/util/converter.h

kszucs · 2020-09-08T09:05:44Z

@romainfrancois since the conversion code in python_to_arrow.cc and r/src/array_from_vector.cpp are pretty similar it would be nice to reuse as much code as we could (including new features like auto chunking large inputs).

Could you please take a look at the base classes in converter.h to see whether would it help/improve the R side implementation? I have not documented it yet because it's still under discussion but you can take a look how I'm using it in the python bindings.

kszucs · 2020-09-15T12:58:14Z

@github-actions crossbow submit test-spark

github-actions · 2020-09-15T13:22:28Z

Revision: 79e8229d6828f7f63f80db75a4ef89af9deb03d3

Submitted crossbow builds: ursa-labs/crossbow @ actions-541

Task	Status
test-conda-python-3.7-spark-branch-3.0
test-conda-python-3.8-spark-master

kszucs · 2020-09-15T13:57:53Z

@ursabot build

python/pyarrow/scalar.pxi

bkietz

Chunker is neat, thanks for working on this

My main concern here is overcomplexity of the Converter<> interface, since it currently bears responsibilities both as the abstract base class for conversion and as a trait mapping from DataType subclasses to concrete converters (used to enable generic construction).

I think Converter<>, Chunker<> and the mixins PrimitiveConverter<> etc can remain in util/converter.h but costruction logic (MakeConverterImpl) should be moved back to python_to_arrow.h. Fully generic construction is definitely a worthwhile goal but I also think it's worth doing so in a dedicated follow up. Establishing the base classes for generic conversion is already a great step forward

cpp/src/arrow/array/builder_binary.h

cpp/src/arrow/python/python_to_arrow.h

cpp/src/arrow/scalar_test.cc

cpp/src/arrow/util/converter.h

bkietz

This is looking much better, just a few more simplifications to construction logic and I think this will be ready to go

python/pyarrow/array.pxi

cpp/src/arrow/python/python_to_arrow.cc

cpp/src/arrow/util/converter.h

bkietz · 2020-09-20T01:57:08Z

cpp/src/arrow/util/converter.h

The converter's Reserve() might fail due to overflow, but this shouldn't be an error for Chunker::Reserve since an overflow can be averted by finishing the chunk. See for example the logic surrounding ChunkedBinaryBuilder::extra_capacity_. This doesn't need to be handled in this PR since Reserve is a performance hint; for now I think we can just ignore errors emitted by this call to Reserve

cpp/src/arrow/util/converter.h

bkietz

Just a few more comments:

cpp/src/arrow/util/converter.h

cpp/src/arrow/python/python_to_arrow.cc

cpp/src/arrow/util/converter.h

jorisvandenbossche

(only briefly looked at the python code + tests, impressive set of changes ;-))

python/pyarrow/tests/test_types.py

python/pyarrow/array.pxi

python/pyarrow/scalar.pxi

jorisvandenbossche · 2020-09-21T19:36:56Z

python/pyarrow/tests/test_convert_builtin.py

It doesn't raise on "d" being present here?

It tolerates extra items (similarly when converting from dicts). I can restrict it if you think that would be more appropiate. We may want to alter this behavior with specific conversion flags.

Since the order matters when constructing from pairs (based on the test below), I would personally expect additional elements would also not be allowed (BTW, what actually happens if the additional element is not in the last position? Maybe add a test for that as well?).
While when constructing from dicts also the order does not matter I suppose? So this could be more tolerant

It will raise since the key field equality is checked explicitly. It also supports converting less elements than the number of fields.

python/pyarrow/tests/test_convert_builtin.py

cpp/src/arrow/python/python_to_arrow.cc

bkietz

Merging

pitrou reviewed Sep 1, 2020

View reviewed changes

cpp/src/arrow/array/builder_base.h Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/metadata_internal.cc Outdated Show resolved Hide resolved

cpp/src/arrow/util/converter.h Outdated Show resolved Hide resolved

cpp/src/arrow/util/converter.h Outdated Show resolved Hide resolved

kszucs changed the title ~~[C++][Python] Refactor python to arrow conversions based on a reusable conversion API [WIP]~~ [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 1, 2020

kszucs force-pushed the py2ar branch from 9336c74 to 124f917 Compare September 1, 2020 16:01

bkietz self-requested a review September 2, 2020 15:53

bkietz requested changes Sep 4, 2020

View reviewed changes

kszucs force-pushed the py2ar branch from 0a384d5 to 728ea49 Compare September 10, 2020 11:55

kszucs mentioned this pull request Sep 10, 2020

ARROW-9369: [Python] Support conversion from python sequence to dictionary type #8008

Closed

kszucs force-pushed the py2ar branch from 9d2125c to f1edc10 Compare September 11, 2020 07:15

kszucs changed the title ~~[C++][Python] Refactor python to arrow conversions based on a reusable conversion API~~ ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 14, 2020

apache deleted a comment from github-actions bot Sep 14, 2020

kszucs force-pushed the py2ar branch from 86b791b to 0de4046 Compare September 15, 2020 12:36

kszucs force-pushed the py2ar branch from 4fcfd1b to db92c29 Compare September 15, 2020 14:08

kszucs commented Sep 15, 2020

View reviewed changes

python/pyarrow/scalar.pxi Outdated Show resolved Hide resolved

bkietz requested changes Sep 19, 2020

View reviewed changes

kszucs force-pushed the py2ar branch 2 times, most recently from 33c5aa7 to f3be748 Compare September 19, 2020 10:18

bkietz requested changes Sep 20, 2020

View reviewed changes

kszucs force-pushed the py2ar branch from 76f6b55 to 1e7e80a Compare September 21, 2020 16:56

bkietz requested changes Sep 21, 2020

View reviewed changes

cpp/src/arrow/util/converter.h Outdated Show resolved Hide resolved

cpp/src/arrow/python/python_to_arrow.cc Outdated Show resolved Hide resolved

cpp/src/arrow/python/python_to_arrow.cc Outdated Show resolved Hide resolved

bkietz reviewed Sep 21, 2020

View reviewed changes

cpp/src/arrow/util/converter.h Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Sep 21, 2020

View reviewed changes

kszucs commented Sep 22, 2020

View reviewed changes

python/pyarrow/tests/test_convert_builtin.py Outdated Show resolved Hide resolved

kszucs added 16 commits September 25, 2020 16:10

address review comments

21e932b

fix specialization

5545281

clang format

c52c367

fix overflow checking

4b426d5

address review comments

d823c64

implement ValidateOverflow for map type; test list overflow chunking

57e6c7f

address review comments

45a93cf

test PYARROW_IGNORE_TIMEZONE; test map array overflow

3359f47

address remaining review comments

e8a7475

remove leftover

e04e245

test cases for two additional auto chunking issues

fe3ceb2

additional overflow tests

01a89fc

missing newline in docstrings

aaf92b8

conversion optimizations

3115303

please gcc 4.8

6c709e2

some inline notes describing performance improvements

b551d12

kszucs force-pushed the py2ar branch from 56b01c2 to 61255ba Compare September 25, 2020 14:10

bkietz reviewed Sep 25, 2020

View reviewed changes

cpp/src/arrow/python/python_to_arrow.cc Outdated Show resolved Hide resolved

try to please compilers

806e4c0

kszucs force-pushed the py2ar branch from a3c2502 to 806e4c0 Compare September 25, 2020 14:44

kszucs added 2 commits September 25, 2020 16:48

fix msvc issues

b527a0e

cast to offset type when appending to string builders

2213c25

bkietz reviewed Sep 26, 2020

View reviewed changes

bkietz closed this in 87dd7e9 Sep 26, 2020

bkietz mentioned this pull request Oct 27, 2020

ARROW-10399: [R] Fix performance regression from cpp11::r_string #8536

Closed

asfimport mentioned this pull request Sep 26, 2020

[C++][Python] Refactor python to arrow conversions based on a reusable conversion API #26017

Closed

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088

ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088

Uh oh!

Conversation

kszucs commented Aug 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Targets of the refactoring:

Fixes

Backward incompatibility

TODOs:

Library size

Benchmarks

Uh oh!

wesm commented Sep 1, 2020

Uh oh!

kszucs commented Sep 1, 2020

Uh oh!

kszucs commented Sep 1, 2020

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kszucs commented Sep 1, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kszucs commented Sep 8, 2020

Uh oh!

kszucs commented Sep 15, 2020

Uh oh!

github-actions bot commented Sep 15, 2020

Uh oh!

kszucs commented Sep 15, 2020

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kszucs commented Aug 31, 2020 •

edited

Loading

kszucs Sep 22, 2020 •

edited

Loading