Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Aug 31, 2020

Targets of the refactoring:

  • PythonToArrow converters based on a common API
  • PyBytesView to use Result return values and contain is_utf8 flag
  • PyConversionOptions is now available from all converters so we can honor its flags

Fixes

  • ARROW-9993 [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
  • ARROW-9994 [C++][Python] Auto chunking nested array containing binary-like fields result malformed output
  • ARROW-9996 [C++] Dictionary is unset when calling DictionaryArray.GetScalar for null values
  • ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field names
  • ARROW-9999 [Python] Support constructing dictionary array directly through pa.array()
  • ARROW-10000 [C++][Python] Support constructing StructArray from list of key-value pairs
  • ARROW-9593 [Python] Add custom pickle reducers for DictionaryScalar
  • ARROW-6281 [Python] Produce chunked arrays for nested types in pyarrow.array
  • ARROW-2367 [Python] ListArray has trouble with sizes greater than kMaximumCapacity
  • ARROW-9976: [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe

Backward incompatibility

Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the .as_py() representation to return with a list of key-value pairs.

TODOs:

  • ensure that the large memory tests are passing
  • benchmark and check binary size again

Library size

Before:

 12M Sep 25 15:05 libarrow.200.0.0.dylib
2.7M Sep 25 15:07 libarrow_python.200.0.0.dylib

After:

 12M Sep 25 15:46 libarrow.200.0.0.dylib
2.1M Sep 25 15:50 libarrow_python.200.0.0.dylib

Benchmarks

Executed the following ASV benchmark:

asv continuous --bench convert_builtins master py2ar --no-only-changed --split

After some optimization:

Benchmarks that have improved:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
-     2.78±0.03ms      2.45±0.03ms     0.88  convert_builtins.ConvertPyListToArray.time_convert('bool')
-     3.59±0.01ms      3.12±0.02ms     0.87  convert_builtins.ConvertPyListToArray.time_convert('int32')
-     3.37±0.01ms      2.73±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('uint32')
-     3.74±0.02ms      3.03±0.01ms     0.81  convert_builtins.ConvertPyListToArray.time_convert('int64')
-     3.38±0.01ms      2.69±0.01ms     0.80  convert_builtins.ConvertPyListToArray.time_convert('uint64')
-     2.83±0.01ms      2.24±0.01ms     0.79  convert_builtins.ConvertPyListToArray.time_convert('float32')
-     3.92±0.02ms      2.99±0.02ms     0.76  convert_builtins.ConvertPyListToArray.time_convert('binary10')
-     14.1±0.04ms      8.89±0.05ms     0.63  convert_builtins.ConvertPyListToArray.time_convert('unicode')
-     5.60±0.01ms      3.24±0.03ms     0.58  convert_builtins.ConvertPyListToArray.time_convert('ascii')
-     5.37±0.02ms      2.91±0.04ms     0.54  convert_builtins.ConvertPyListToArray.time_convert('binary')

Benchmarks that have stayed the same:

       before           after         ratio
     [f358a29b]       [18d1c052]
     <master>         <py2ar>
      14.8±0.02ms       15.5±0.1ms     1.05  convert_builtins.ConvertPyListToArray.time_convert('decimal')
       16.4±0.7ms       15.1±0.6ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('struct from tuples')
       34.4±0.3ms       31.5±0.4ms     0.92  convert_builtins.ConvertPyListToArray.time_convert('int64 list')
       16.7±0.7ms       15.1±0.6ms    ~0.91  convert_builtins.ConvertPyListToArray.time_convert('struct')
      2.42±0.02ms      2.05±0.03ms    ~0.85  convert_builtins.ConvertPyListToArray.time_convert('float64')

@wesm
Copy link
Member

wesm commented Sep 1, 2020

Are there some related JIRAs that will be fixed by this?

@kszucs
Copy link
Member Author

kszucs commented Sep 1, 2020

Not directly, but this will help to solve multiple ones. I'm going to create one.

@kszucs
Copy link
Member Author

kszucs commented Sep 1, 2020

Just executed the ASV benchmarks on this PR and the master branch with the two sample sizes and the results showed slighty better performance in both cases:

==================== ========= =========
        type            PR      master 
-------------------- --------- ---------
       int32          136±0ms   138±0ms
       uint32         129±0ms   133±0ms
       int64          131±0ms   138±0ms
       uint64         131±0ms   137±0ms
      float32         131±0ms   131±0ms
      float64         130±0ms   132±0ms
        bool          128±0ms   131±0ms
      decimal         232±0ms   227±0ms
       binary         149±0ms   180±0ms
      binary10        141±0ms   151±0ms
       ascii          158±0ms   160±0ms
      unicode         300±0ms   319±0ms
     int64 list       310±0ms   307±0ms
       struct         224±0ms   241±0ms
 struct from tuples   231±0ms   230±0ms
==================== ========= =========

Also compared the libarrow_python binary size:
master: 2.7M
this PR: 2.2M

So far there is no performance degradation - if we can trust the ASV benchmarks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate PyArrayConverter and ArrayConverter?

@kszucs
Copy link
Member Author

kszucs commented Sep 1, 2020

Why separate PyArrayConverter and ArrayConverter?

My hope is that the classes in converter.h can be reused in the future to implement different converter APIs, as an example it could be used in the R bindings as well. It would be even better if we could centralize converter agnostic logic here like auto chunking.

@kszucs kszucs changed the title [C++][Python] Refactor python to arrow conversions based on a reusable conversion API [WIP] [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 1, 2020
@bkietz bkietz self-requested a review September 2, 2020 15:53
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a neat cleanup of the conversion code, and potentially reusable as well. Thanks for doing this! I have a few comments on the overall structure below

@kszucs
Copy link
Member Author

kszucs commented Sep 8, 2020

@romainfrancois since the conversion code in python_to_arrow.cc and r/src/array_from_vector.cpp are pretty similar it would be nice to reuse as much code as we could (including new features like auto chunking large inputs).

Could you please take a look at the base classes in converter.h to see whether would it help/improve the R side implementation? I have not documented it yet because it's still under discussion but you can take a look how I'm using it in the python bindings.

@kszucs kszucs changed the title [C++][Python] Refactor python to arrow conversions based on a reusable conversion API ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Sep 14, 2020
@apache apache deleted a comment from github-actions bot Sep 14, 2020
@kszucs
Copy link
Member Author

kszucs commented Sep 15, 2020

@github-actions crossbow submit test-spark

@github-actions
Copy link

Revision: 79e8229d6828f7f63f80db75a4ef89af9deb03d3

Submitted crossbow builds: ursa-labs/crossbow @ actions-541

Task Status
test-conda-python-3.7-spark-branch-3.0 Github Actions
test-conda-python-3.8-spark-master Github Actions

@kszucs
Copy link
Member Author

kszucs commented Sep 15, 2020

@ursabot build

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunker is neat, thanks for working on this

My main concern here is overcomplexity of the Converter<> interface, since it currently bears responsibilities both as the abstract base class for conversion and as a trait mapping from DataType subclasses to concrete converters (used to enable generic construction).

I think Converter<>, Chunker<> and the mixins PrimitiveConverter<> etc can remain in util/converter.h but costruction logic (MakeConverterImpl) should be moved back to python_to_arrow.h. Fully generic construction is definitely a worthwhile goal but I also think it's worth doing so in a dedicated follow up. Establishing the base classes for generic conversion is already a great step forward

@kszucs kszucs force-pushed the py2ar branch 2 times, most recently from 33c5aa7 to f3be748 Compare September 19, 2020 10:18
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking much better, just a few more simplifications to construction logic and I think this will be ready to go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter's Reserve() might fail due to overflow, but this shouldn't be an error for Chunker::Reserve since an overflow can be averted by finishing the chunk. See for example the logic surrounding ChunkedBinaryBuilder::extra_capacity_. This doesn't need to be handled in this PR since Reserve is a performance hint; for now I think we can just ignore errors emitted by this call to Reserve

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments:

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(only briefly looked at the python code + tests, impressive set of changes ;-))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't raise on "d" being present here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tolerates extra items (similarly when converting from dicts). I can restrict it if you think that would be more appropiate. We may want to alter this behavior with specific conversion flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the order matters when constructing from pairs (based on the test below), I would personally expect additional elements would also not be allowed (BTW, what actually happens if the additional element is not in the last position? Maybe add a test for that as well?).
While when constructing from dicts also the order does not matter I suppose? So this could be more tolerant

Copy link
Member Author

@kszucs kszucs Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will raise since the key field equality is checked explicitly. It also supports converting less elements than the number of fields.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants