-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-618: [Python/C++] Support timestamp+timezone conversion to pandas #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
rebased |
cpp/src/arrow/type.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference instead of pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Google style guide recommends pointers for mutable arguments / not using mutable reference arguments https://google.github.io/styleguide/cppguide.html#Reference_Arguments -- our code isn't 100% consistent about this, but we should try to follow that convention. In the case of std::ostream you can see this applied in Protocol Buffers and other Google codebases: https://github.com/google/protobuf/blob/fd046f6263fb17383cafdbb25c361e3451c31105/src/google/protobuf/io/zero_copy_stream_impl.h#L265
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's some examples in TensorFlow too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. Thanks!
python/pyarrow/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing it here as private seems like useless as I thought the use of the imports here was mainly to expose the public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was pretty annoying, after from pyarrow.schema import schema the statement import pyarrow.schema returns pyarrow.schema.schema instead of the module. Other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, then we keep it.
python/pyarrow/array.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably reduce the usage of the reserved word type in our codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's __builtin__.type I haven't worried too much about it, but I'm OK with another naming convention, like type_, what do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either type_ or dtype would be ok for me but as it's internally used, we can keep it. I wasn't 100% sure about the implications of using a reserved keyword.
python/pyarrow/schema.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is not thread safe, do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, as we never delete anything from the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is not thread safe here? The GIL is held here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed the part where you hold the GIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, we could have two threads adding unit to the cache. Yeah I think it's fine. I was thinking about C-level threadsafety
… using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py
Change-Id: Ia05a8eba9833874b48726ed6a221577937920eed
Change-Id: Ib8841ed34d6153f354f3364f20464e42ae885d7d
|
Thanks -- rebased, will merge on green build |
|
Looks like the OS X builds are consistently failing (https://travis-ci.org/wesm/arrow/jobs/210644296). This seems to be something wrong with the miniconda environment bootstrap. Merging and we can investigate further if it is not transient |
This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in `pyarrow/adapters/pandas.cc`. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.
The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be `timestamp('ns', tz='US/Eastern')`, then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.
Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.
I also changed the C++ `pyarrow` namespace to `arrow::py` which will make it less painful to move that code tree to `cpp/src/arrow/python` at some point
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes apache#375 from wesm/ARROW-618 and squashes the following commits:
4b18bfa [Wes McKinney] Fix rebase conflict
5bc3724 [Wes McKinney] Fix rebase issues
870986f [Wes McKinney] Refactor ArrowSerializer to not be a template and use visitor pattern using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: Ib789b38872430bb3903f233a795b84357df6385a
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in
pyarrow/adapters/pandas.cc. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be
timestamp('ns', tz='US/Eastern'), then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.
I also changed the C++
pyarrownamespace toarrow::pywhich will make it less painful to move that code tree tocpp/src/arrow/pythonat some point