-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data #3171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@kszucs we aren't running the "large_memory" unit tests in Travis CI. What do you think about having a Docker target where we can run these so they can at least be spot-checked periodically? |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review.
|
I'm all done here, just will make sure the build is passing |
| @@ -79,23 +80,6 @@ Status BinaryBuilder::AppendNextOffset() { | |||
| return offsets_builder_.Append(static_cast<int32_t>(num_bytes)); | |||
| } | |||
|
|
|||
| Status BinaryBuilder::Append(const uint8_t* value, int32_t length) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's useful to inline those if AppendNextOffset is not inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll inline AppendNextOffset then
|
This PR seems basically fine to me. I posted a few minor comments. |
…s. add failing test case for ARROW-3762. Add ChunkedBinaryBuilder, make BinaryBuilder Append methods inline
Change-Id: I0eced60a1f8e16096a1b441b622ba750d1d59ca6
…ction of arrow::compute::Datum Change-Id: I483059a545c69a9b25d543faad641785da6bea29
…row test suite passing Change-Id: Icb260f6ffc4f41ee7519653bf8d3f48c2da30091
Change-Id: I35ab3ace0e4ca7a80fc7d85e55ac55ea222b15dc
Change-Id: I8f0a35ae4e8581790f7731ee2ed023a54caf0f31
Change-Id: I7fac456a34aa81683fa7315ae1b287be7f0d16e0
Change-Id: I47f93c7d8561b83414ab34f709fec66a6eb462d2
Change-Id: I8266354f04c8e14819fe4c72d28474e09843c13c
…tOffset Change-Id: Ibfc09617b365c937e7af6a4943c274843f6e7a33
|
The inlining of BinaryBuilder methods produces a meaningful benchmark improvement before after |
|
Nice :-) |
Change-Id: I48147645784402e7cf004a82151d66f337d1664e
|
+1 |
|
@wesm created issue for running large memory tests ARROW-4046 |
|
thanks =) |
|
I still see this error when using 0.13.0, also tested with 0.12.0. The code I've tested this with is the exact same code as in ARROW-4046: |
|
You mean a different JIRA than https://issues.apache.org/jira/browse/ARROW-4046, right? Can you post this on the appropriate JIRA issue or create a new one so we can track this? Thanks |
|
Right, my mistake. I've meant this one: https://issues.apache.org/jira/browse/ARROW-3762 |
|
Hi @wesm , Big fan of your work! |
|
@yogeshg can you open a JIRA issue either in Arrow or Spark? I think that this is something that will have to be handled on the Spark side cc @BryanCutler |
|
I'll do that soon.
…On Fri, Jun 14, 2019, 6:05 PM Wes McKinney ***@***.***> wrote:
@yogeshg <https://github.com/yogeshg> can you open a JIRA issue either in
Arrow or Spark? I think that this is something that will have to be handled
on the Spark side cc @BryanCutler <https://github.com/BryanCutler>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3171?email_source=notifications&email_token=AAICSYEMEXKN2DPBHM2F5Y3P2Q54FA5CNFSM4GKKNLFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXYNGHI#issuecomment-502321949>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICSYFZN2G64YJWHWTDTZLP2Q54FANCNFSM4GKKNLFA>
.
|
|
@yogeshg , this might be the related issue from the Java MR Parquet Reader that Spark uses https://issues.apache.org/jira/browse/PARQUET-980, but please open another JIRA if it is not |
This patch ended up being a bit more of a bloodbath than I planned: please accept my apologies.
Associated changes in this patch:
As far as what code to review, focus efforts on
I'm going to tackle ARROW-2970 which should not be complicated after this patch; I will submit that as a PR after this is reviews and merged.