ARROW-5970: [Java] Provide pointer to Arrow buffer #4897

liyafan82 · 2019-07-17T12:17:22Z

Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a vector, and equality determination.

This data structure can be considered as a "universal value holder".

tianchen92 · 2019-07-17T12:31:00Z

java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java

Similar logic already exists in https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java#L44
Can we directly use it here?

@tianchen92 Thanks for your kind reminder.

This functionality will be used in the new design of the dictionary encoding, and possibly other parts of the code base.

The logic in ByteFunctionHelpers is based on static methods. So the scenario that is based on ByteFunctionHelpers can also use ArrowBufPointer, but not vice versa.

I'm not sure I understand this. If ByteFunctionHelpers was moved to this package couldn't it be used here?

@emkornfield , to use ByteFunctionHelpers, we should move it to the arrow-memory module. Do you think it is OK?

emkornfield · 2019-07-18T06:03:19Z

java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java

Did you consider adding hashcode and equality to ArrowBuf directly?

@emkornfield good question.
Adding hash code & equality to ArrowBuf is also a good choice.

I think there are several reasons for this data structure:

We want to compute and compare an arbitrary sub-area of the ArrowBuf, not the complete buffer.

The algorithm to compute the hash code should be configurable to be suitable for different scenarios.

We need a way to show that the data area is invalid.

ArrowBuf is the key data structure, so we do not want to add overhead (like new members) to it.

1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).

3 can be solved by using a null value (same as for ArrowBufPointer)

The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.

OK, so is this approach dangerous then in the sense that we could have a dangling pointer?

@emkornfield I think you are right.
It is possible to have dangling pointers (as in C++). The users can check it by examining the reference count of the underlying ArrowBuf.

However, for most scenarios, I think the users have sufficient knowledge about the underlying ArrowBuf, so the checks can be avoided.

emkornfield · 2019-07-18T06:04:21Z

@siddharthteotia @pravindra I think having your input on this would be helpful.

pravindra · 2019-07-18T07:21:01Z

@liyafan82 can you please provide some context or explain the use-case for this ?

liyafan82 · 2019-07-18T07:28:39Z

@liyafan82 can you please provide some context or explain the use-case for this ?

@pravindra Sure. Good question.

In some scenarios (e.g. in dictionary encoding), we need to consider a memory segment as the basic unit for comparing, computation, etc. Therefore, such a data structure is required.

For instance, we may place such data structures in the heap, binary search tree, hash table, etc.
Another benefit is that, we can interpret a vector as a collection/iterable of the arrow buffer pointer. This will facilitate some operations.

The most important is that, there is little overhead in this, as no memory copy is involved.

pravindra · 2019-07-18T10:18:39Z

thanks @liyafan82. Couple of related questions

how is an element that is not valid represented ? null value for ArrowBufPointer ?
this doesn't work for the complex types (list/struct/map), right ?
how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

liyafan82 · 2019-07-18T11:00:22Z

thanks @liyafan82. Couple of related questions

how is an element that is not valid represented ? null value for ArrowBufPointer ?

this doesn't work for the complex types (list/struct/map), right ?

how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

@pravindra Thanks a lot for your valuable feedback. Please see my reply in line.

how is an element that is not valid represented ? null value for ArrowBufPointer ?

An invalid element is represented by setting ArrowBufPointer#buf to null.

this doesn't work for the complex types (list/struct/map), right ?

You are right. It only works for primitive types, because for such types, each element is based on a consecutive memory region.

how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

I agree with you that the solution provided by this PR may not be efficient for the scenario you described. For the scenario, it can be better to use the get/set methods, or use the method you have given below.

For scenarios where a piece of memory needs to be placed in to search tree/heap/hash table, this data structure is required.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

This is definitely a good idea. It has been widely used in SQL engines. We have another PR to work towards this goal (#4844). Would you please give some comments?

pravindra · 2019-07-18T11:24:18Z

java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java

1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).

3 can be solved by using a null value (same as for ArrowBufPointer)

The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.

pravindra · 2019-07-18T11:26:42Z

java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java

return getDataPointer(index, new ArrowBufPointer());

Good suggestion. Thanks a lot.

pravindra · 2019-07-18T11:28:07Z

java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java

avoid duplication by

getDataPointer(index, new ArrowBufPointer())

Revised. Thank you so much.

pravindra · 2019-07-18T12:10:53Z

java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java

can you pls add a doc comment here that this fn returning null implies that it's a null element.

Sure. Thank you for the good suggestion.

pravindra

lgtm.

pravindra · 2019-07-18T12:26:17Z

i'll wait to hear what @emkornfield says about the question on ByteFunctionHelpers, and then, merge this.

codecov-io · 2019-07-18T14:37:39Z

Codecov Report

Merging #4897 into master will increase coverage by 2.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4897      +/-   ##
==========================================
+ Coverage   87.44%   89.58%   +2.14%     
==========================================
  Files         995      661     -334     
  Lines      140460    96645   -43815     
  Branches     1418        0    -1418     
==========================================
- Hits       122820    86580   -36240     
+ Misses      17278    10065    -7213     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
cpp/src/gandiva/precompiled/arithmetic_ops_test.cc	`100% <0%> (ø)`	⬆️
cpp/src/gandiva/function_registry_arithmetic.cc	`100% <0%> (ø)`	⬆️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
... and 327 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1f25e8...7e34ae0. Read the comment docs.

emkornfield · 2019-07-19T03:20:05Z

My suggested approach:
Since it looks like ByteFunctionHelpers is public, I think the approach we should take is make a copy of the class and place it in this package. Update the arrow code base to point to this new class here. Keep ByteFunctionHelpers class where it is but update the implementations to point to the new class here. Mark the existing class and methods as deprecated, and remove it after the next release.

As much as possible I think we should be trying to get in the habit of having a 1 release cycle grace-period where we try to preserve public API so clients have warnings of the change.

liyafan82 · 2019-07-19T04:12:36Z

thanks @liyafan82. Couple of related questions

how is an element that is not valid represented ? null value for ArrowBufPointer ?

this doesn't work for the complex types (list/struct/map), right ?

how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

@pravindra Thanks a lot for your valuable feedback. Please see my reply in line.

how is an element that is not valid represented ? null value for ArrowBufPointer ?
An invalid element is represented by setting ArrowBufPointer#buf to null.

this doesn't work for the complex types (list/struct/map), right ?
You are right. It only works for primitive types, because for such types, each element is based on a consecutive memory region.

how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.
I agree with you that the solution provided by this PR may not be efficient for the scenario you described. For the scenario, it can be better to use the get/set methods, or use the method you have given below.

For scenarios where a piece of memory needs to be placed in to search tree/heap/hash table, this data structure is required.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?
This is definitely a good idea. It has been widely used in SQL engines. We have another PR to work towards this goal (#4844). Would you please give some comments?

My suggested approach:
Since it looks like ByteFunctionHelpers is public, I think the approach we should take is make a copy of the class and place it in this package. Update the arrow code base to point to this new class here. Keep ByteFunctionHelpers class where it is but update the implementations to point to the new class here. Mark the existing class and methods as deprecated, and remove it after the next release.

As much as possible I think we should be trying to get in the habit of having a 1 release cycle grace-period where we try to preserve public API so clients have warnings of the change.

@emkornfield thanks for your suggestions.
I have revised it accordingly. Please take a look.

java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java

tianchen92 · 2019-07-19T04:17:17Z

There has conflicts, otherwise looks good.

liyafan82 · 2019-07-19T04:48:59Z

There has conflicts, otherwise looks good.

Thanks for your kind reminder. The conflicts have been resolved.

java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java

emkornfield · 2019-07-24T04:20:31Z

Looks like there is a conflict now. @pravindra if you are happy with the changes, go ahead and merge. Thanks.

liyafan82 · 2019-07-24T04:42:43Z

Looks like there is a conflict now. @pravindra if you are happy with the changes, go ahead and merge. Thanks.

Conflict resolved. Thanks a lot.

pravindra · 2019-07-24T13:51:13Z

thanks @liyafan82 and @emkornfield

jacques-n · 2019-07-25T01:15:12Z

Dumb question... Why not just ArrowBuf to point? ArrowBuf is already a pointer with a length. Why do we need a new class?

…

On Wed, Jul 17, 2019, 5:17 AM liyafan82 ***@***.***> wrote: Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. ------------------------------ You can view, comment on, or merge this pull request online at: #4897 Commit Summary - [ARROW-5970][Java] Provide pointer to Arrow buffer File Changes - *A* java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java <https://github.com/apache/arrow/pull/4897/files#diff-0> (132) - *A* java/memory/src/test/java/org/apache/arrow/memory/util/TestArrowBufPointer.java <https://github.com/apache/arrow/pull/4897/files#diff-1> (70) - *M* java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java <https://github.com/apache/arrow/pull/4897/files#diff-2> (22) - *M* java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java <https://github.com/apache/arrow/pull/4897/files#diff-3> (26) - *M* java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java <https://github.com/apache/arrow/pull/4897/files#diff-4> (16) - *M* java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java <https://github.com/apache/arrow/pull/4897/files#diff-5> (17) - *M* java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java <https://github.com/apache/arrow/pull/4897/files#diff-6> (81) Patch Links: - https://github.com/apache/arrow/pull/4897.patch - https://github.com/apache/arrow/pull/4897.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4897?email_source=notifications&email_token=AABMYNXBK55IUB3PIJ54MVLP74EVJA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7WZVNQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABMYNQFYZM6CXGI4Z2MVPTP74EVJANCNFSM4IEPWZNQ> .

liyafan82 · 2019-07-25T01:44:40Z

thanks @liyafan82 and @emkornfield

@pravindra thanks for your effort.

liyafan82 · 2019-07-25T01:49:10Z

Dumb question... Why not just ArrowBuf to point? ArrowBuf is already a pointer with a length. Why do we need a new class?
…
On Wed, Jul 17, 2019, 5:17 AM liyafan82 @.***> wrote: Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. ------------------------------ You can view, comment on, or merge this pull request online at: #4897 Commit Summary - [ARROW-5970][Java] Provide pointer to Arrow buffer File Changes - A java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java https://github.com/apache/arrow/pull/4897/files#diff-0 (132) - A java/memory/src/test/java/org/apache/arrow/memory/util/TestArrowBufPointer.java https://github.com/apache/arrow/pull/4897/files#diff-1 (70) - M java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-2 (22) - M java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-3 (26) - M java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-4 (16) - M java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-5 (17) - M java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java https://github.com/apache/arrow/pull/4897/files#diff-6 (81) Patch Links: - https://github.com/apache/arrow/pull/4897.patch - https://github.com/apache/arrow/pull/4897.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4897?email_source=notifications&email_token=AABMYNXBK55IUB3PIJ54MVLP74EVJA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7WZVNQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMYNQFYZM6CXGI4Z2MVPTP74EVJANCNFSM4IEPWZNQ .

@jacques-n good question.
ArrowBufPointer points to a continuous region within an ArrowBuf. It is more light-weight compared with ArrowBuf, because there is no need to maintain the reference count, the read/write indices, and no need to close it, so it does not have the problem of resource leak.

pravindra · 2019-07-25T04:00:33Z

@jacques-n - this alternative was discussed in the code review comments. The use case that convinced me was an iterator over all the elements in a vector - the APIs added here will let us do that in a very low cost manner (no heap allocations, no refcnts, ..).

jacques-n · 2019-07-25T04:39:58Z

Ok, thanks

…

On Wed, Jul 24, 2019, 9:00 PM Pindikura Ravindra ***@***.***> wrote: @jacques-n <https://github.com/jacques-n> - this alternative was discussed in the code review comments. The use case that convinced me was an iterator over all the elements in a vector - the APIs added here will let us do that in a very low cost manner (no heap allocations, no refcnts, ..). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4897?email_source=notifications&email_token=AABMYNTXNBEAHIH5PPSAAVTQBEQOFA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2YIUOI#issuecomment-514886201>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABMYNX354L4ATGYDMQTYE3QBEQOFANCNFSM4IEPWZNQ> .

Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. This data structure can be considered as a "universal value holder". Closes apache#4897 from liyafan82/fly_0717_ptr and squashes the following commits: f9b0ee4 <liyafan82> Merge branch 'master' into fly_0717_ptr b2fa206 <liyafan82> Merge branch 'master' into fly_0717_ptr 394b356 <liyafan82> Move ByteFunctionHelpers class to memory module 7e34ae0 <liyafan82> Provide pointer to Arrow buffer Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

tianchen92 reviewed Jul 17, 2019

View reviewed changes

emkornfield reviewed Jul 18, 2019

View reviewed changes

emkornfield added the Component: Java label Jul 18, 2019

pravindra reviewed Jul 18, 2019

View reviewed changes

liyafan82 force-pushed the fly_0717_ptr branch from 5f50791 to 26d992b Compare July 18, 2019 11:47

pravindra reviewed Jul 18, 2019

View reviewed changes

pravindra approved these changes Jul 18, 2019

View reviewed changes

[ARROW-5970][Java] Provide pointer to Arrow buffer

7e34ae0

liyafan82 force-pushed the fly_0717_ptr branch from 26d992b to 7e34ae0 Compare July 18, 2019 12:30

[ARROW-5970][Java] Move ByteFunctionHelpers class to memory module

394b356

emkornfield reviewed Jul 19, 2019

View reviewed changes

java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java Outdated Show resolved Hide resolved

Merge branch 'master' into fly_0717_ptr

b2fa206

tianchen92 approved these changes Jul 19, 2019

View reviewed changes

emkornfield reviewed Jul 20, 2019

View reviewed changes

java/memory/src/main/java/org/apache/arrow/memory/util/ByteFunctionHelpers.java Show resolved Hide resolved

kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29

Merge branch 'master' into fly_0717_ptr

f9b0ee4

pravindra closed this in 065d9dc Jul 24, 2019

asfimport mentioned this pull request Aug 1, 2019

[Java] Provide pointer to Arrow buffer #22378

Closed

ARROW-5970: [Java] Provide pointer to Arrow buffer #4897

ARROW-5970: [Java] Provide pointer to Arrow buffer #4897

Uh oh!

Conversation

liyafan82 commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Jul 18, 2019

Uh oh!

pravindra commented Jul 18, 2019

Uh oh!

liyafan82 commented Jul 18, 2019

Uh oh!

pravindra commented Jul 18, 2019

Uh oh!

liyafan82 commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pravindra left a comment

Choose a reason for hiding this comment

Uh oh!

pravindra commented Jul 18, 2019

Uh oh!

codecov-io commented Jul 18, 2019

Codecov Report

Uh oh!

emkornfield commented Jul 19, 2019

Uh oh!

liyafan82 commented Jul 19, 2019

Uh oh!

Uh oh!

tianchen92 commented Jul 19, 2019

Uh oh!

liyafan82 commented Jul 19, 2019

Uh oh!

Uh oh!

emkornfield commented Jul 24, 2019

Uh oh!

liyafan82 commented Jul 24, 2019

Uh oh!

pravindra commented Jul 24, 2019

Uh oh!

jacques-n commented Jul 25, 2019 via email

Uh oh!

liyafan82 commented Jul 25, 2019

Uh oh!

liyafan82 commented Jul 25, 2019

Uh oh!

liyafan82 commented Jul 17, 2019 •

edited

Loading

liyafan82 commented Jul 18, 2019 •

edited

Loading