-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3762: [C++] manage ChunkedArrayBuilder capacity explicitly #4695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3762: [C++] manage ChunkedArrayBuilder capacity explicitly #4695
Conversation
095862e to
fd35c7f
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay, but I think we need to add some unit tests (that aren't disabled) to exercise the various code paths in ChunkedBinaryBuilder::Reserve to assert that they do what we expect. Otherwise if we refactor later we're going to be playing whackamole
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a reasonable improvement to me. Unfortunately it'll be difficult to come up with a test that doesn't blow 16+GB of memory.
|
I'll add some tests to src/arrow/array-binary-test.cc |
fd35c7f to
9ae140f
Compare
|
Hm. Can we make the 2GB limit something that's configurable rather than hard-coded? In theory that's the purpose of the max chunk size. We can't have such a long-running test in CI |
|
We could certainly add another to apply to the # of strings in the string_array, or alternatively use |
|
Oh okay, sorry for my confusion. Can we make the number of elements per chunk configurable, then? I can see a use case for having no more than 64K elements per chunk, no matter what their size. Then this is much easier to test |
|
Will do |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks @bkietz! Will merge once CI has run
16baa10 to
95a4e30
Compare
|
Merging, the failure is the cloud.r-project.org DNS issue |
ChunkedArrayBuilder should never request BinaryBuilder reserve space for more strings than it can hold (`kListMaximumElements`); with this change it will instead of begin a new chunk Author: Benjamin Kietzman <bengilgit@gmail.com> Closes apache#4695 from bkietz/3762-Parquet-arrowTable-reads-error-when-over and squashes the following commits: 95a4e30 <Benjamin Kietzman> add configurable maximum element count to ChunkedBinaryBuilder 5331a2c <Benjamin Kietzman> add test for ChunkedBinaryBuilder's new reserve behavior 3eb5719 <Benjamin Kietzman> manage ChunkedBinaryBuilder's capacity d626c07 <Benjamin Kietzman> attempted fix, killed: OOM cb000ab <Benjamin Kietzman> add (failing) test which repros issue in c++
ChunkedArrayBuilder should never request BinaryBuilder reserve space for more strings than it can hold (
kListMaximumElements); with this change it will instead of begin a new chunk