-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1533: [JAVA] realloc should consider the existing buffer capacity for computing target memory requirement #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -370,13 +370,20 @@ public void reset() { | |
| } | ||
|
|
||
| public void reAlloc() { | ||
| final long newAllocationSize = allocationSizeInBytes*2L; | ||
| long baseSize = allocationSizeInBytes; | ||
| final int currentBufferCapacity = data.capacity(); | ||
| if (baseSize < (long)currentBufferCapacity) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems to be same as
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I don't understand the comment. The same problem is fixed across reAlloc() of all vectors -- bit vector, fixed width and variable width.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I meant, since these logic looks similar, I am wondering if we can refactor the shared logic into the base class.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had thought about it but since with ARROW-1463 the inheritance hierarchy and templates might look different, I thought may there is not much gain in refactoring now?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Yeah I don't love it but if we are going to fix this in ARROW-1463 then I am ok. |
||
| baseSize = (long)currentBufferCapacity; | ||
| } | ||
| long newAllocationSize = baseSize * 2L; | ||
| newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize); | ||
|
|
||
| if (newAllocationSize > MAX_ALLOCATION_SIZE) { | ||
| throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); | ||
| } | ||
|
|
||
| final ArrowBuf newBuf = allocator.buffer((int)newAllocationSize); | ||
| newBuf.setBytes(0, data, 0, data.capacity()); | ||
| newBuf.setBytes(0, data, 0, currentBufferCapacity); | ||
| data.release(); | ||
| data = newBuf; | ||
| allocationSizeInBytes = (int)newAllocationSize; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is not very straight forward to me:
allocationSizeInBytesis less thandata.capacity()?data.capacity() * 2instead ofallocationSizeInBytes * 2?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a vector A, write data to it to an extent that you trigger 2 reallocs at least.
Now transfer the vector to vector B.
Now do something that triggers reAlloc() for vector B -- the reAlloc() will segfault because allocateSizeInBytes is still at the initial default value whereas this vector's buffer is probably 4x the size of that.
reAlloc will try to copy data from 128K sized buffer (at least) onto a 64K buffer and segfault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...Thanks for the explanation.
What is the semantics of transfer and why doesn't it set allocateSizeInBytes for vector B in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw the problem while dealing with complex JSON schema -- detailed problem description is here #1097
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transfer() function is aimed at transferring the data buffer (along with ownership) of one vector to target vector of same type.
It is not clear to me if we would want to transfer more state in the function. Secondly, there could be other cases where allocationSizeInBytes < buffer capacity. So in any case we are probably better off safeguarding reAlloc() regardless.
One case could be vector reset() which resets the value of allocationSizeInBytes as well. If we reAlloc() a vector after doing reset(), I think we will run into the same problem.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, didn't realize there is another PR, thanks for the context.
Edit: Didn't see the second comment which answers my question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But doesn't this change cause weird
reAlloc()effect afterreset()though? i.e. if we reset a 128Kb double vector to 32Kb and thenreAlloc(), it would be 256Kb instead of 64Kb, and it would also have the old value (incorrect) from 32-128Kb?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset() already has a problem that I think we should fix across all vector types. reset() should actually not re-initialize allocationSizeInBytes at all because reset() is typically aimed at zeroing out the buffer, resetting mutator/accessor etc -- the underlying buffer capacity still remains the same after reset.
So for your example, 128KB buffer remains a 128Kb buffer zeroed out upon reset(). On a subsequent reAlloc(), we will go to 256KB. There won't be any incorrect or garbage bits lying around on the data buffer because the entire buffer is zeroed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yeah you are right. This is correct.
However, in general, I find the usage between
allocationSizeInBytesanddata.capacity()confusing. They seem to be the same thing but is inconsistent in various places.For instance, it's not clear to me why don't we just double
data.capacity()inreAlloc()instead of checking bothallocationSizeInBytesanddata.capacity().Maybe we should have a follow up Jira to:
allocationSizeInBytesanddata.capacity()are used correctly in all places