-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5970: [Java] Provide pointer to Arrow buffer #4897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar logic already exists in https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java#L44
Can we directly use it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianchen92 Thanks for your kind reminder.
This functionality will be used in the new design of the dictionary encoding, and possibly other parts of the code base.
The logic in ByteFunctionHelpers is based on static methods. So the scenario that is based on ByteFunctionHelpers can also use ArrowBufPointer, but not vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this. If ByteFunctionHelpers was moved to this package couldn't it be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield , to use ByteFunctionHelpers, we should move it to the arrow-memory module. Do you think it is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider adding hashcode and equality to ArrowBuf directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield good question.
Adding hash code & equality to ArrowBuf is also a good choice.
I think there are several reasons for this data structure:
- We want to compute and compare an arbitrary sub-area of the ArrowBuf, not the complete buffer.
- The algorithm to compute the hash code should be configurable to be suitable for different scenarios.
- We need a way to show that the data area is invalid.
- ArrowBuf is the key data structure, so we do not want to add overhead (like new members) to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).
- 3 can be solved by using a null value (same as for ArrowBufPointer)
The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is this approach dangerous then in the sense that we could have a dangling pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield I think you are right.
It is possible to have dangling pointers (as in C++). The users can check it by examining the reference count of the underlying ArrowBuf.
However, for most scenarios, I think the users have sufficient knowledge about the underlying ArrowBuf, so the checks can be avoided.
|
@siddharthteotia @pravindra I think having your input on this would be helpful. |
|
@liyafan82 can you please provide some context or explain the use-case for this ? |
@pravindra Sure. Good question. In some scenarios (e.g. in dictionary encoding), we need to consider a memory segment as the basic unit for comparing, computation, etc. Therefore, such a data structure is required. For instance, we may place such data structures in the heap, binary search tree, hash table, etc. The most important is that, there is little overhead in this, as no memory copy is involved. |
|
thanks @liyafan82. Couple of related questions
dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ? |
@pravindra Thanks a lot for your valuable feedback. Please see my reply in line.
An invalid element is represented by setting ArrowBufPointer#buf to null.
You are right. It only works for primitive types, because for such types, each element is based on a consecutive memory region.
I agree with you that the solution provided by this PR may not be efficient for the scenario you described. For the scenario, it can be better to use the get/set methods, or use the method you have given below. For scenarios where a piece of memory needs to be placed in to search tree/heap/hash table, this data structure is required.
This is definitely a good idea. It has been widely used in SQL engines. We have another PR to work towards this goal (#4844). Would you please give some comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).
- 3 can be solved by using a null value (same as for ArrowBufPointer)
The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return getDataPointer(index, new ArrowBufPointer());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid duplication by
getDataPointer(index, new ArrowBufPointer())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thank you so much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pls add a doc comment here that this fn returning null implies that it's a null element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thank you for the good suggestion.
pravindra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
|
i'll wait to hear what @emkornfield says about the question on ByteFunctionHelpers, and then, merge this. |
Codecov Report
@@ Coverage Diff @@
## master #4897 +/- ##
==========================================
+ Coverage 87.44% 89.58% +2.14%
==========================================
Files 995 661 -334
Lines 140460 96645 -43815
Branches 1418 0 -1418
==========================================
- Hits 122820 86580 -36240
+ Misses 17278 10065 -7213
+ Partials 362 0 -362Continue to review full report at Codecov.
|
|
My suggested approach: As much as possible I think we should be trying to get in the habit of having a 1 release cycle grace-period where we try to preserve public API so clients have warnings of the change. |
@emkornfield thanks for your suggestions. |
java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java
Outdated
Show resolved
Hide resolved
|
There has conflicts, otherwise looks good. |
Thanks for your kind reminder. The conflicts have been resolved. |
java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java
Show resolved
Hide resolved
ed180da to
85fe336
Compare
|
Looks like there is a conflict now. @pravindra if you are happy with the changes, go ahead and merge. Thanks. |
Conflict resolved. Thanks a lot. |
|
thanks @liyafan82 and @emkornfield |
|
Dumb question... Why not just ArrowBuf to point? ArrowBuf is already a
pointer with a length. Why do we need a new class?
…On Wed, Jul 17, 2019, 5:17 AM liyafan82 ***@***.***> wrote:
Introduce pointer to a memory region within an ArrowBuf.
This pointer will be used as the basis for calculating the hash code
within a vector, and equality determination.
------------------------------
You can view, comment on, or merge this pull request online at:
#4897
Commit Summary
- [ARROW-5970][Java] Provide pointer to Arrow buffer
File Changes
- *A*
java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java
<https://github.com/apache/arrow/pull/4897/files#diff-0> (132)
- *A*
java/memory/src/test/java/org/apache/arrow/memory/util/TestArrowBufPointer.java
<https://github.com/apache/arrow/pull/4897/files#diff-1> (70)
- *M*
java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
<https://github.com/apache/arrow/pull/4897/files#diff-2> (22)
- *M*
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
<https://github.com/apache/arrow/pull/4897/files#diff-3> (26)
- *M*
java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java
<https://github.com/apache/arrow/pull/4897/files#diff-4> (16)
- *M*
java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java
<https://github.com/apache/arrow/pull/4897/files#diff-5> (17)
- *M*
java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java
<https://github.com/apache/arrow/pull/4897/files#diff-6> (81)
Patch Links:
- https://github.com/apache/arrow/pull/4897.patch
- https://github.com/apache/arrow/pull/4897.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4897?email_source=notifications&email_token=AABMYNXBK55IUB3PIJ54MVLP74EVJA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7WZVNQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABMYNQFYZM6CXGI4Z2MVPTP74EVJANCNFSM4IEPWZNQ>
.
|
@pravindra thanks for your effort. |
@jacques-n good question. |
|
@jacques-n - this alternative was discussed in the code review comments. The use case that convinced me was an iterator over all the elements in a vector - the APIs added here will let us do that in a very low cost manner (no heap allocations, no refcnts, ..). |
|
Ok, thanks
…On Wed, Jul 24, 2019, 9:00 PM Pindikura Ravindra ***@***.***> wrote:
@jacques-n <https://github.com/jacques-n> - this alternative was
discussed in the code review comments. The use case that convinced me was
an iterator over all the elements in a vector - the APIs added here will
let us do that in a very low cost manner (no heap allocations, no refcnts,
..).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4897?email_source=notifications&email_token=AABMYNTXNBEAHIH5PPSAAVTQBEQOFA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2YIUOI#issuecomment-514886201>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABMYNX354L4ATGYDMQTYE3QBEQOFANCNFSM4IEPWZNQ>
.
|
Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. This data structure can be considered as a "universal value holder". Closes apache#4897 from liyafan82/fly_0717_ptr and squashes the following commits: f9b0ee4 <liyafan82> Merge branch 'master' into fly_0717_ptr b2fa206 <liyafan82> Merge branch 'master' into fly_0717_ptr 394b356 <liyafan82> Move ByteFunctionHelpers class to memory module 7e34ae0 <liyafan82> Provide pointer to Arrow buffer Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Introduce pointer to a memory region within an ArrowBuf.
This pointer will be used as the basis for calculating the hash code within a vector, and equality determination.
This data structure can be considered as a "universal value holder".