Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Mar 7, 2017

The goal for this patch is to provide an eventual migration path for Feather (https://github.com/wesm/feather) users to use the batch and streaming Arrow file formats internally. Eventually the Feather metadata can be deprecated, but we will need to wait for the R community to build and ship Arrow bindings for R before that can happen. In the meantime, we won't need to maintain multiple Python/C++ codebases for the Python side of things.

The test suite isn't yet passing because support for timestamps with time zones has not been implemented in the conversion to pandas.DataFrame, so I will do that when I can, but this can be reviewed in the meantime.

I would upload a Gerrit code review, but there are some access control settings on gerrit.cloudera.org that need changing.

@wesm
Copy link
Member Author

wesm commented Mar 7, 2017

Another issue I have to deal with is that Feather uses int32 days since UNIX epoch for dates, and Arrow currently is int64 millis since the epoch, so zero copy date reads isn't possible at the moment

@wesm
Copy link
Member Author

wesm commented Mar 8, 2017

@xhochy I added a 32-bit date type. While it adds complexity having 2 different date types, since the int32 days-since-UNIX-epoch occurs quite frequently in the wild, not being able to load with zero copy would be a deficiency. Luckily adding new logical types doesn't involve too much new code

@xhochy
Copy link
Member

xhochy commented Mar 8, 2017

@wesm If we want to keep this type, we should add a two JIRAs:

  • Add this type to the JVM implementation
  • Add a convenience cast function between the two types. (We should probably soon add a general cast interface, at least in parquet-cpp that would reduce some lines of code).

(I will probably have some time on Friday to review the changes here. Given the scale, Gerrit would be preferable but I'm ok with Github if it turns out to be problematic currently.)

@wesm
Copy link
Member Author

wesm commented Mar 8, 2017

@xhochy we need to turn on Forge Committer in Gerrit because it's rejecting the commits by authors that aren't registered in the system. I'll reach out to some folks at Cloudera to see if that's possible.

I don't think we necessarily need to add the Date32 type to the Java implementation so long as we have a function to cast to the proper Date type in an IPC setting, we should certainly have a discussion about it though.

Agree about having casting functions

@wesm
Copy link
Member Author

wesm commented Mar 8, 2017

I should have the timestamp-with-tz issues resolved today, and I sent a message to ASF committers at Cloudera to change the settings on the Arrow project there so I can upload the review

@wesm
Copy link
Member Author

wesm commented Mar 8, 2017

@xhochy if it helps, I think I can divide this patch into two pieces, the first is the IPC refactor (that creates arrow/loader.h) and support machinery (i.e. the Date32Array etc.), and the 2nd part is the Feather code itself

@wesm
Copy link
Member Author

wesm commented Mar 8, 2017

@xhochy I split the unrelated changes into ARROW-605, so let's review that first and then I will rebase. going to get this branch passing now

@wesm wesm changed the title WIP ARROW-452: [C++/Python] Incorporate C++ and Python codebases for Feather file format ARROW-452: [C++/Python] Incorporate C++ and Python codebases for Feather file format Mar 9, 2017
@wesm
Copy link
Member Author

wesm commented Mar 9, 2017

Completing the timestamp+tz functionality is going to be a bit invasive because of pandas's "synthetic" dtypes, which means we need to create the Arrow type object before calling PandasToArrow. After #364 and #365 are in I will take the opportunity to do some refactoring in pandas.cc to do this more cleanly

asfgit pushed a commit that referenced this pull request Mar 9, 2017
…lass. Add Date32Type

These are various changes introduced to support the Feather merge in ARROW-452 #361

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #365 from wesm/array-loader and squashes the following commits:

bc22872 [Wes McKinney] Revert Array::type_id to type_enum since Parquet uses this API
344e6b1 [Wes McKinney] fix compiler warning
997b7a2 [Wes McKinney] Refactor IPC adapter code into generic ArrayLoader class. Add Date32Type
…oring to

maximize code reuse with IPC reader/writer classes. Get C++ test suite
passing.
@wesm
Copy link
Member Author

wesm commented Mar 9, 2017

Rebased. I will complete the datetime-with-tz issue when I can, maybe tomorrow or Saturday

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good here 👍

@wesm
Copy link
Member Author

wesm commented Mar 11, 2017

I need to do some refactoring for the datetime+tz issue to get the test suite passing, so I'll ping you again when I have that all working

@wesm
Copy link
Member Author

wesm commented Mar 11, 2017

Actually, I'm going to disable those tests and merge this patch once the build is green. That way we can review the refactoring work more easily as a new patch

wesm added 3 commits March 11, 2017 12:25
Change-Id: Idad749d82d9a999f449a621eca7ee3eabeca5b20
Change-Id: Ia6cf85453c5926205fbbb180536e75e5a331a21a
Change-Id: I3a4dd6a03caad000ffe077514ef540cf53c9af17
@asfgit asfgit closed this in d99958d Mar 11, 2017
@wesm wesm deleted the ARROW-452 branch March 11, 2017 22:00
asfgit pushed a commit that referenced this pull request Mar 12, 2017
Closes #345. I had mostly done this in #361 so this adds tests to `ipc-adapter-test`

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #371 from wesm/ARROW-534 and squashes the following commits:

cab6d4f [Wes McKinney] Add functions to make record batches for date, date32, timestamp, time. Fix bugs
jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Mar 15, 2017
…her file format

The goal for this patch is to provide an eventual migration path for Feather (https://github.com/wesm/feather) users to use the batch and streaming Arrow file formats internally. Eventually the Feather metadata can be deprecated, but we will need to wait for the R community to build and ship Arrow bindings for R before that can happen. In the meantime, we won't need to maintain multiple Python/C++ codebases for the Python side of things.

The test suite isn't yet passing because support for timestamps with time zones has not been implemented in the conversion to pandas.DataFrame, so I will do that when I can, but this can be reviewed in the meantime.

I would upload a Gerrit code review, but there are some access control settings on gerrit.cloudera.org that need changing.

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#361 from wesm/ARROW-452 and squashes the following commits:

b7bfd30 [Wes McKinney] Add missing license header
06cbdca [Wes McKinney] Fix -Wconversion error
244959c [Wes McKinney] Mark datetime+tz tests as xfail
9a95094 [Wes McKinney] Incorporate Feather C++ and Python codebases and do associated refactoring to maximize code reuse with IPC reader/writer classes. Get C++ test suite passing.
jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Mar 15, 2017
Closes apache#345. I had mostly done this in apache#361 so this adds tests to `ipc-adapter-test`

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#371 from wesm/ARROW-534 and squashes the following commits:

cab6d4f [Wes McKinney] Add functions to make record batches for date, date32, timestamp, time. Fix bugs
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Also fixes a gcc 4.8 buglet (std::vector::erase does not accept a const_iterator in gcc 4.8)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#361 from wesm/PARQUET-1043 and squashes the following commits:

4451c5c [Wes McKinney] Raise minimum CMake version to 3.2, delete cruft. Fix gcc 4.8 buglet with std::vector::erase

Change-Id: I41793599f5ea8e07d16a850e9b43af812a1814a3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants