Skip to content

Conversation

@liyafan82
Copy link
Contributor

@liyafan82 liyafan82 commented Jul 17, 2019

Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a vector, and equality determination.

This data structure can be considered as a "universal value holder".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianchen92 Thanks for your kind reminder.

This functionality will be used in the new design of the dictionary encoding, and possibly other parts of the code base.

The logic in ByteFunctionHelpers is based on static methods. So the scenario that is based on ByteFunctionHelpers can also use ArrowBufPointer, but not vice versa.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. If ByteFunctionHelpers was moved to this package couldn't it be used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield , to use ByteFunctionHelpers, we should move it to the arrow-memory module. Do you think it is OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding hashcode and equality to ArrowBuf directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield good question.
Adding hash code & equality to ArrowBuf is also a good choice.

I think there are several reasons for this data structure:

  1. We want to compute and compare an arbitrary sub-area of the ArrowBuf, not the complete buffer.
  2. The algorithm to compute the hash code should be configurable to be suitable for different scenarios.
  3. We need a way to show that the data area is invalid.
  4. ArrowBuf is the key data structure, so we do not want to add overhead (like new members) to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).
  • 3 can be solved by using a null value (same as for ArrowBufPointer)

The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is this approach dangerous then in the sense that we could have a dangling pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield I think you are right.
It is possible to have dangling pointers (as in C++). The users can check it by examining the reference count of the underlying ArrowBuf.

However, for most scenarios, I think the users have sufficient knowledge about the underlying ArrowBuf, so the checks can be avoided.

@emkornfield
Copy link
Contributor

@siddharthteotia @pravindra I think having your input on this would be helpful.

@pravindra
Copy link
Contributor

@liyafan82 can you please provide some context or explain the use-case for this ?

@liyafan82
Copy link
Contributor Author

@liyafan82 can you please provide some context or explain the use-case for this ?

@pravindra Sure. Good question.

In some scenarios (e.g. in dictionary encoding), we need to consider a memory segment as the basic unit for comparing, computation, etc. Therefore, such a data structure is required.

For instance, we may place such data structures in the heap, binary search tree, hash table, etc.
Another benefit is that, we can interpret a vector as a collection/iterable of the arrow buffer pointer. This will facilitate some operations.

The most important is that, there is little overhead in this, as no memory copy is involved.

@pravindra
Copy link
Contributor

thanks @liyafan82. Couple of related questions

  • how is an element that is not valid represented ? null value for ArrowBufPointer ?
  • this doesn't work for the complex types (list/struct/map), right ?
  • how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

@liyafan82
Copy link
Contributor Author

liyafan82 commented Jul 18, 2019

thanks @liyafan82. Couple of related questions

  • how is an element that is not valid represented ? null value for ArrowBufPointer ?
  • this doesn't work for the complex types (list/struct/map), right ?
  • how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

@pravindra Thanks a lot for your valuable feedback. Please see my reply in line.

  • how is an element that is not valid represented ? null value for ArrowBufPointer ?

An invalid element is represented by setting ArrowBufPointer#buf to null.

  • this doesn't work for the complex types (list/struct/map), right ?

You are right. It only works for primitive types, because for such types, each element is based on a consecutive memory region.

  • how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

I agree with you that the solution provided by this PR may not be efficient for the scenario you described. For the scenario, it can be better to use the get/set methods, or use the method you have given below.

For scenarios where a piece of memory needs to be placed in to search tree/heap/hash table, this data structure is required.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

This is definitely a good idea. It has been widely used in SQL engines. We have another PR to work towards this goal (#4844). Would you please give some comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1 and 4 can be solved by using a slice of an arrow buf (address/length adjusted to the data element).
  • 3 can be solved by using a null value (same as for ArrowBufPointer)

The problem with slices is that there is perf overhead due to the refCnt incr/decr. So, I'm fine with the ArrowBufPointer approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return getDataPointer(index, new ArrowBufPointer());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Thanks a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid duplication by

getDataPointer(index, new ArrowBufPointer())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you so much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls add a doc comment here that this fn returning null implies that it's a null element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thank you for the good suggestion.

Copy link
Contributor

@pravindra pravindra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@pravindra
Copy link
Contributor

i'll wait to hear what @emkornfield says about the question on ByteFunctionHelpers, and then, merge this.

@codecov-io
Copy link

Codecov Report

Merging #4897 into master will increase coverage by 2.14%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4897      +/-   ##
==========================================
+ Coverage   87.44%   89.58%   +2.14%     
==========================================
  Files         995      661     -334     
  Lines      140460    96645   -43815     
  Branches     1418        0    -1418     
==========================================
- Hits       122820    86580   -36240     
+ Misses      17278    10065    -7213     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/gandiva/precompiled/arithmetic_ops_test.cc 100% <0%> (ø) ⬆️
cpp/src/gandiva/function_registry_arithmetic.cc 100% <0%> (ø) ⬆️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
... and 327 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c1f25e8...7e34ae0. Read the comment docs.

@emkornfield
Copy link
Contributor

My suggested approach:
Since it looks like ByteFunctionHelpers is public, I think the approach we should take is make a copy of the class and place it in this package. Update the arrow code base to point to this new class here. Keep ByteFunctionHelpers class where it is but update the implementations to point to the new class here. Mark the existing class and methods as deprecated, and remove it after the next release.

As much as possible I think we should be trying to get in the habit of having a 1 release cycle grace-period where we try to preserve public API so clients have warnings of the change.

@liyafan82
Copy link
Contributor Author

thanks @liyafan82. Couple of related questions

  • how is an element that is not valid represented ? null value for ArrowBufPointer ?
  • this doesn't work for the complex types (list/struct/map), right ?
  • how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?

@pravindra Thanks a lot for your valuable feedback. Please see my reply in line.

  • how is an element that is not valid represented ? null value for ArrowBufPointer ?
    An invalid element is represented by setting ArrowBufPointer#buf to null.
  • this doesn't work for the complex types (list/struct/map), right ?
    You are right. It only works for primitive types, because for such types, each element is based on a consecutive memory region.
  • how would hash/equality work when multiple vector types are involved ? eg. groupBy(intColumnA, longColumnB) ? The random-access, function calls, pointer indirections for doing this computation on every cell add up to a significant amount in cpu cost.
    I agree with you that the solution provided by this PR may not be efficient for the scenario you described. For the scenario, it can be better to use the get/set methods, or use the method you have given below.

For scenarios where a piece of memory needs to be placed in to search tree/heap/hash table, this data structure is required.

dremio optimises this by morphing the relevant vectors, one small batch at a time, to a transient row format (we call this pivoting), and then, computing the hash/equality of contiguous byte ranges. Have you considered this approach ?
This is definitely a good idea. It has been widely used in SQL engines. We have another PR to work towards this goal (#4844). Would you please give some comments?

My suggested approach:
Since it looks like ByteFunctionHelpers is public, I think the approach we should take is make a copy of the class and place it in this package. Update the arrow code base to point to this new class here. Keep ByteFunctionHelpers class where it is but update the implementations to point to the new class here. Mark the existing class and methods as deprecated, and remove it after the next release.

As much as possible I think we should be trying to get in the habit of having a 1 release cycle grace-period where we try to preserve public API so clients have warnings of the change.

@emkornfield thanks for your suggestions.
I have revised it accordingly. Please take a look.

@tianchen92
Copy link
Contributor

There has conflicts, otherwise looks good.

@liyafan82
Copy link
Contributor Author

There has conflicts, otherwise looks good.

Thanks for your kind reminder. The conflicts have been resolved.

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@emkornfield
Copy link
Contributor

Looks like there is a conflict now. @pravindra if you are happy with the changes, go ahead and merge. Thanks.

@liyafan82
Copy link
Contributor Author

Looks like there is a conflict now. @pravindra if you are happy with the changes, go ahead and merge. Thanks.

Conflict resolved. Thanks a lot.

@pravindra pravindra closed this in 065d9dc Jul 24, 2019
@pravindra
Copy link
Contributor

thanks @liyafan82 and @emkornfield

@jacques-n
Copy link
Contributor

jacques-n commented Jul 25, 2019 via email

@liyafan82
Copy link
Contributor Author

thanks @liyafan82 and @emkornfield

@pravindra thanks for your effort.

@liyafan82
Copy link
Contributor Author

Dumb question... Why not just ArrowBuf to point? ArrowBuf is already a pointer with a length. Why do we need a new class?

On Wed, Jul 17, 2019, 5:17 AM liyafan82 @.***> wrote: Introduce pointer to a memory region within an ArrowBuf. This pointer will be used as the basis for calculating the hash code within a vector, and equality determination. ------------------------------ You can view, comment on, or merge this pull request online at: #4897 Commit Summary - [ARROW-5970][Java] Provide pointer to Arrow buffer File Changes - A java/memory/src/main/java/org/apache/arrow/memory/util/ArrowBufPointer.java https://github.com/apache/arrow/pull/4897/files#diff-0 (132) - A java/memory/src/test/java/org/apache/arrow/memory/util/TestArrowBufPointer.java https://github.com/apache/arrow/pull/4897/files#diff-1 (70) - M java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-2 (22) - M java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-3 (26) - M java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-4 (16) - M java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java https://github.com/apache/arrow/pull/4897/files#diff-5 (17) - M java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java https://github.com/apache/arrow/pull/4897/files#diff-6 (81) Patch Links: - https://github.com/apache/arrow/pull/4897.patch - https://github.com/apache/arrow/pull/4897.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4897?email_source=notifications&email_token=AABMYNXBK55IUB3PIJ54MVLP74EVJA5CNFSM4IEPWZN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7WZVNQ>, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMYNQFYZM6CXGI4Z2MVPTP74EVJANCNFSM4IEPWZNQ .

@jacques-n good question.
ArrowBufPointer points to a continuous region within an ArrowBuf. It is more light-weight compared with ArrowBuf, because there is no need to maintain the reference count, the read/write indices, and no need to close it, so it does not have the problem of resource leak.

@pravindra
Copy link
Contributor

@jacques-n - this alternative was discussed in the code review comments. The use case that convinced me was an iterator over all the elements in a vector - the APIs added here will let us do that in a very low cost manner (no heap allocations, no refcnts, ..).

@jacques-n
Copy link
Contributor

jacques-n commented Jul 25, 2019 via email

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a vector, and equality determination.

This data structure can be considered as a "universal value holder".

Closes apache#4897 from liyafan82/fly_0717_ptr and squashes the following commits:

f9b0ee4 <liyafan82> Merge branch 'master' into fly_0717_ptr
b2fa206 <liyafan82> Merge branch 'master' into fly_0717_ptr
394b356 <liyafan82>  Move ByteFunctionHelpers class to memory module
7e34ae0 <liyafan82>  Provide pointer to Arrow buffer

Authored-by: liyafan82 <fan_li_ya@foxmail.com>
Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants