Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 13, 2017

This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in pyarrow/adapters/pandas.cc. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.

The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be timestamp('ns', tz='US/Eastern'), then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.

Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.

I also changed the C++ pyarrow namespace to arrow::py which will make it less painful to move that code tree to cpp/src/arrow/python at some point

@wesm
Copy link
Member Author

wesm commented Mar 13, 2017

rebased

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference instead of pointer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google style guide recommends pointers for mutable arguments / not using mutable reference arguments https://google.github.io/styleguide/cppguide.html#Reference_Arguments -- our code isn't 100% consistent about this, but we should try to follow that convention. In the case of std::ostream you can see this applied in Protocol Buffers and other Google codebases: https://github.com/google/protobuf/blob/fd046f6263fb17383cafdbb25c361e3451c31105/src/google/protobuf/io/zero_copy_stream_impl.h#L265

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing it here as private seems like useless as I thought the use of the imports here was mainly to expose the public interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was pretty annoying, after from pyarrow.schema import schema the statement import pyarrow.schema returns pyarrow.schema.schema instead of the module. Other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, then we keep it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably reduce the usage of the reserved word type in our codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's __builtin__.type I haven't worried too much about it, but I'm OK with another naming convention, like type_, what do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either type_ or dtype would be ok for me but as it's internally used, we can keep it. I wasn't 100% sure about the implications of using a reserved keyword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not thread safe, do we care?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, as we never delete anything from the cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is not thread safe here? The GIL is held here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed the part where you hold the GIL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, we could have two threads adding unit to the cache. Yeah I think it's fine. I was thinking about C-level threadsafety

wesm added 3 commits March 13, 2017 13:18
… using

passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow
namespace to arrow::py
Change-Id: Ia05a8eba9833874b48726ed6a221577937920eed
Change-Id: Ib8841ed34d6153f354f3364f20464e42ae885d7d
@wesm
Copy link
Member Author

wesm commented Mar 13, 2017

Thanks -- rebased, will merge on green build

@wesm
Copy link
Member Author

wesm commented Mar 13, 2017

Looks like the OS X builds are consistently failing (https://travis-ci.org/wesm/arrow/jobs/210644296). This seems to be something wrong with the miniconda environment bootstrap. Merging and we can investigate further if it is not transient

@asfgit asfgit closed this in 00df40c Mar 13, 2017
@wesm wesm deleted the ARROW-618 branch March 13, 2017 20:20
jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Mar 15, 2017
This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in `pyarrow/adapters/pandas.cc`. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.

The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be `timestamp('ns', tz='US/Eastern')`, then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.

Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.

I also changed the C++ `pyarrow` namespace to `arrow::py` which will make it less painful to move that code tree to `cpp/src/arrow/python` at some point

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#375 from wesm/ARROW-618 and squashes the following commits:

4b18bfa [Wes McKinney] Fix rebase conflict
5bc3724 [Wes McKinney] Fix rebase issues
870986f [Wes McKinney] Refactor ArrowSerializer to not be a template and use visitor pattern using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
… 90-character line width

The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits:

b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width

Change-Id: Ib789b38872430bb3903f233a795b84357df6385a
wesm added a commit to wesm/arrow that referenced this pull request Sep 27, 2018
… 90-character line width

The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits:

b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width

Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
wesm added a commit that referenced this pull request Sep 27, 2018
… 90-character line width

The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #375 from wesm/PARQUET-1068 and squashes the following commits:

b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width

Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants