-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-452: [C++/Python] Incorporate C++ and Python codebases for Feather file format #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Another issue I have to deal with is that Feather uses int32 days since UNIX epoch for dates, and Arrow currently is int64 millis since the epoch, so zero copy date reads isn't possible at the moment |
|
@xhochy I added a 32-bit date type. While it adds complexity having 2 different date types, since the int32 days-since-UNIX-epoch occurs quite frequently in the wild, not being able to load with zero copy would be a deficiency. Luckily adding new logical types doesn't involve too much new code |
|
@wesm If we want to keep this type, we should add a two JIRAs:
(I will probably have some time on Friday to review the changes here. Given the scale, Gerrit would be preferable but I'm ok with Github if it turns out to be problematic currently.) |
|
@xhochy we need to turn on Forge Committer in Gerrit because it's rejecting the commits by authors that aren't registered in the system. I'll reach out to some folks at Cloudera to see if that's possible. I don't think we necessarily need to add the Date32 type to the Java implementation so long as we have a function to cast to the proper Date type in an IPC setting, we should certainly have a discussion about it though. Agree about having casting functions |
|
I should have the timestamp-with-tz issues resolved today, and I sent a message to ASF committers at Cloudera to change the settings on the Arrow project there so I can upload the review |
|
@xhochy if it helps, I think I can divide this patch into two pieces, the first is the IPC refactor (that creates |
|
Completing the timestamp+tz functionality is going to be a bit invasive because of pandas's "synthetic" dtypes, which means we need to create the Arrow type object before calling |
…lass. Add Date32Type These are various changes introduced to support the Feather merge in ARROW-452 #361 Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #365 from wesm/array-loader and squashes the following commits: bc22872 [Wes McKinney] Revert Array::type_id to type_enum since Parquet uses this API 344e6b1 [Wes McKinney] fix compiler warning 997b7a2 [Wes McKinney] Refactor IPC adapter code into generic ArrayLoader class. Add Date32Type
…oring to maximize code reuse with IPC reader/writer classes. Get C++ test suite passing.
|
Rebased. I will complete the datetime-with-tz issue when I can, maybe tomorrow or Saturday |
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good here 👍
|
I need to do some refactoring for the datetime+tz issue to get the test suite passing, so I'll ping you again when I have that all working |
|
Actually, I'm going to disable those tests and merge this patch once the build is green. That way we can review the refactoring work more easily as a new patch |
Change-Id: Idad749d82d9a999f449a621eca7ee3eabeca5b20
Change-Id: Ia6cf85453c5926205fbbb180536e75e5a331a21a
Change-Id: I3a4dd6a03caad000ffe077514ef540cf53c9af17
Closes #345. I had mostly done this in #361 so this adds tests to `ipc-adapter-test` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #371 from wesm/ARROW-534 and squashes the following commits: cab6d4f [Wes McKinney] Add functions to make record batches for date, date32, timestamp, time. Fix bugs
…her file format The goal for this patch is to provide an eventual migration path for Feather (https://github.com/wesm/feather) users to use the batch and streaming Arrow file formats internally. Eventually the Feather metadata can be deprecated, but we will need to wait for the R community to build and ship Arrow bindings for R before that can happen. In the meantime, we won't need to maintain multiple Python/C++ codebases for the Python side of things. The test suite isn't yet passing because support for timestamps with time zones has not been implemented in the conversion to pandas.DataFrame, so I will do that when I can, but this can be reviewed in the meantime. I would upload a Gerrit code review, but there are some access control settings on gerrit.cloudera.org that need changing. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#361 from wesm/ARROW-452 and squashes the following commits: b7bfd30 [Wes McKinney] Add missing license header 06cbdca [Wes McKinney] Fix -Wconversion error 244959c [Wes McKinney] Mark datetime+tz tests as xfail 9a95094 [Wes McKinney] Incorporate Feather C++ and Python codebases and do associated refactoring to maximize code reuse with IPC reader/writer classes. Get C++ test suite passing.
Closes apache#345. I had mostly done this in apache#361 so this adds tests to `ipc-adapter-test` Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#371 from wesm/ARROW-534 and squashes the following commits: cab6d4f [Wes McKinney] Add functions to make record batches for date, date32, timestamp, time. Fix bugs
Also fixes a gcc 4.8 buglet (std::vector::erase does not accept a const_iterator in gcc 4.8) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#361 from wesm/PARQUET-1043 and squashes the following commits: 4451c5c [Wes McKinney] Raise minimum CMake version to 3.2, delete cruft. Fix gcc 4.8 buglet with std::vector::erase Change-Id: I41793599f5ea8e07d16a850e9b43af812a1814a3
The goal for this patch is to provide an eventual migration path for Feather (https://github.com/wesm/feather) users to use the batch and streaming Arrow file formats internally. Eventually the Feather metadata can be deprecated, but we will need to wait for the R community to build and ship Arrow bindings for R before that can happen. In the meantime, we won't need to maintain multiple Python/C++ codebases for the Python side of things.
The test suite isn't yet passing because support for timestamps with time zones has not been implemented in the conversion to pandas.DataFrame, so I will do that when I can, but this can be reviewed in the meantime.
I would upload a Gerrit code review, but there are some access control settings on gerrit.cloudera.org that need changing.