ARROW-5898: [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment #4844

liyafan82 · 2019-07-10T10:36:24Z

This issue adds a functionality to efficiently compute the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps:

*Avoid unnecessary memory copy.
*Avoid repeated conversions between Java objects & Arrow buffers.

Since the algorithm for calculating hash code has significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins.

praveenbingo · 2019-07-11T07:09:40Z

java/algorithm/src/main/java/org/apache/arrow/algorithm/hash/ArrowBufHasher.java

can be replaced with PlatformDependent.getLong() interprets according to the platform endianess

@praveenbingo thanks a lot for your comments.

Here we force the algorithm to interpret the data in little endian, in a platform independent way. This is to make sure that, if the data is sent from a little endian machine to a big endian machine, its hash code remain unchanged.

I see, sounds good.

liyafan82 · 2019-07-15T03:38:56Z

This functionality should be placed in the memory module to avoid potential cyclic dependency.

praveenbingo

Lgtm.

praveenbingo · 2019-07-18T11:58:10Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/DirectHasher.java

can avoid Integer.hashCode(..) since it returns the value as is anyways..

Good catch! Thanks a lot.

praveenbingo · 2019-07-18T11:58:46Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/DirectHasher.java

same can directly do (int)byteValue..

Revised. Thank you.

tianchen92 · 2019-07-19T04:20:38Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/DirectHasher.java

@emkornfield Since we hashCode & equals API already checked in, could we take a look at this PR? Something like finalizeHashCode is useful in actual applications.

@tianchen92 I'm not sure I understand the question exactly? What were you thinking about doing with ti?

I think algorithm like Murmur hashing will significantly reduce hash collision.

It could, this code doesn't seem to implement full murmur hash though, I don't know how much just the finalization steps helps

You are right, this is not full murmur hash. However, it is simple and effective:

the hash code on int/byte/long are poor (they are not spread evenly in the universe), but fast enough

the hash code finalized by murmur hash makes the result evenly distributed in the universe. This is costly, as there are a few integer multiplications. However, it is carried out only once.

Without the finalization step, the quality of the hash code can be poor. For example, suppose we have a integer array with small positive numbers (this is common in practice, and many cases can be converted to an equivalent one). Without the finalization step, the hash code would also be a small integer (though larger), so it is not evenly distributed in the universe, and may cause problems (e.g. in an open addressing hash table).

codecov-io · 2019-07-19T12:47:44Z

Codecov Report

Merging #4844 into master will increase coverage by 2.15%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4844      +/-   ##
==========================================
+ Coverage   87.46%   89.61%   +2.15%     
==========================================
  Files         994      660     -334     
  Lines      140389    96546   -43843     
  Branches     1418        0    -1418     
==========================================
- Hits       122785    86518   -36267     
+ Misses      17242    10028    -7214     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 324 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85fe336...8f8b6c5. Read the comment docs.

emkornfield · 2019-07-20T05:21:53Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/ArrowBufHasher.java

can you document how consumers are expected to use this class?

Sure. Good suggestion.

emkornfield · 2019-07-20T05:22:33Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/ArrowBufHasher.java

should all of the abstract methods here be protected?

I think for some scenarios, the users just want to know the hash code of a small memory segment (like 1-byte, 4-byte, or 8-byte segments). So making the methods public can be helpful for them.

What do you think?

I'm not an expert on this, but I don't think hashes are necessarily considered "good" until they've been finalized, so it might lead to poorer hashes.

Sure. It depends on the specific situation.
According to our experience, the one provided in this implementation may be a good balance between computational efficiency and uniformness. The finalization step is critical. Please see my comments above.

emkornfield · 2019-07-23T02:41:51Z

java/memory/src/main/java/org/apache/arrow/memory/util/hash/ArrowBufHasher.java

Suggested change

* A default light-weight implementation of this class is given in {@link DirectHasher}.However, the users can

* A default light-weight implementation of this class is given in {@link DirectHasher}. However, the users can

Given this explantation I think the method should be protected.

@emkornfield Sure. Aggreed.
Let's make them protected now. In the future, if we will need them to be public, we can have a discussion then.

…code for arbitrary memory segment

emkornfield · 2019-07-24T04:33:01Z

+1, LGTM.

…code for arbitrary memory segment This issue adds a functionality to efficiently compute the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps: *Avoid unnecessary memory copy. *Avoid repeated conversions between Java objects & Arrow buffers. Since the algorithm for calculating hash code has significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins. Author: liyafan82 <fan_li_ya@foxmail.com> Closes apache#4844 from liyafan82/fly_0710_hash1 and squashes the following commits: b1b6f78 <liyafan82> Provide functionality to efficiently compute hash code for arbitrary memory segment

liyafan82 mentioned this pull request Jul 11, 2019

ARROW-5902: [Java] Implement hash table and equals & hashCode API for dictionary encoding #4846

Closed

praveenbingo reviewed Jul 11, 2019

View reviewed changes

tianchen92 mentioned this pull request Jul 12, 2019

ARROW-5835: [Java] Support Dictionary Encoding for binary type #4792

Closed

fsaintjacques added the Component: Java label Jul 12, 2019

liyafan82 force-pushed the fly_0710_hash1 branch from 29c0e6b to 6ccaa03 Compare July 15, 2019 03:37

liyafan82 mentioned this pull request Jul 18, 2019

ARROW-5970: [Java] Provide pointer to Arrow buffer #4897

Closed

praveenbingo approved these changes Jul 18, 2019

View reviewed changes

praveenbingo reviewed Jul 18, 2019

View reviewed changes

liyafan82 force-pushed the fly_0710_hash1 branch from 6ccaa03 to 5e10ecc Compare July 18, 2019 12:18

tianchen92 reviewed Jul 19, 2019

View reviewed changes

liyafan82 closed this Jul 19, 2019

liyafan82 reopened this Jul 19, 2019

emkornfield reviewed Jul 20, 2019

View reviewed changes

liyafan82 force-pushed the fly_0710_hash1 branch from 5e10ecc to 8f8b6c5 Compare July 22, 2019 04:00

kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29

emkornfield reviewed Jul 23, 2019

View reviewed changes

[ARROW-5898][Java] Provide functionality to efficiently compute hash …

b1b6f78

…code for arbitrary memory segment

liyafan82 force-pushed the fly_0710_hash1 branch from 8f8b6c5 to b1b6f78 Compare July 23, 2019 03:20

emkornfield closed this in c27c29e Jul 24, 2019

asfimport mentioned this pull request Aug 1, 2019

[Java] Provide functionality to efficiently compute hash code for arbitrary memory segment #22311

Closed

	* A default light-weight implementation of this class is given in {@link DirectHasher}.However, the users can
	* A default light-weight implementation of this class is given in {@link DirectHasher}. However, the users can

ARROW-5898: [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment #4844

ARROW-5898: [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment #4844

Uh oh!

Conversation

liyafan82 commented Jul 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyafan82 commented Jul 15, 2019

Uh oh!

praveenbingo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liyafan82 Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Jul 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

liyafan82 Jul 18, 2019 •

edited

Loading

codecov-io commented Jul 19, 2019 •

edited

Loading