-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5898: [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment #4844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be replaced with PlatformDependent.getLong() interprets according to the platform endianess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@praveenbingo thanks a lot for your comments.
Here we force the algorithm to interpret the data in little endian, in a platform independent way. This is to make sure that, if the data is sent from a little endian machine to a big endian machine, its hash code remain unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sounds good.
29c0e6b to
6ccaa03
Compare
|
This functionality should be placed in the memory module to avoid potential cyclic dependency. |
praveenbingo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can avoid Integer.hashCode(..) since it returns the value as is anyways..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same can directly do (int)byteValue..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised. Thank you.
6ccaa03 to
5e10ecc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Since we hashCode & equals API already checked in, could we take a look at this PR? Something like finalizeHashCode is useful in actual applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianchen92 I'm not sure I understand the question exactly? What were you thinking about doing with ti?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think algorithm like Murmur hashing will significantly reduce hash collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, this code doesn't seem to implement full murmur hash though, I don't know how much just the finalization steps helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this is not full murmur hash. However, it is simple and effective:
- the hash code on int/byte/long are poor (they are not spread evenly in the universe), but fast enough
- the hash code finalized by murmur hash makes the result evenly distributed in the universe. This is costly, as there are a few integer multiplications. However, it is carried out only once.
Without the finalization step, the quality of the hash code can be poor. For example, suppose we have a integer array with small positive numbers (this is common in practice, and many cases can be converted to an equivalent one). Without the finalization step, the hash code would also be a small integer (though larger), so it is not evenly distributed in the universe, and may cause problems (e.g. in an open addressing hash table).
Codecov Report
@@ Coverage Diff @@
## master #4844 +/- ##
==========================================
+ Coverage 87.46% 89.61% +2.15%
==========================================
Files 994 660 -334
Lines 140389 96546 -43843
Branches 1418 0 -1418
==========================================
- Hits 122785 86518 -36267
+ Misses 17242 10028 -7214
+ Partials 362 0 -362Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document how consumers are expected to use this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should all of the abstract methods here be protected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for some scenarios, the users just want to know the hash code of a small memory segment (like 1-byte, 4-byte, or 8-byte segments). So making the methods public can be helpful for them.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on this, but I don't think hashes are necessarily considered "good" until they've been finalized, so it might lead to poorer hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It depends on the specific situation.
According to our experience, the one provided in this implementation may be a good balance between computational efficiency and uniformness. The finalization step is critical. Please see my comments above.
5e10ecc to
8f8b6c5
Compare
ed180da to
85fe336
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * A default light-weight implementation of this class is given in {@link DirectHasher}.However, the users can | |
| * A default light-weight implementation of this class is given in {@link DirectHasher}. However, the users can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this explantation I think the method should be protected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Sure. Aggreed.
Let's make them protected now. In the future, if we will need them to be public, we can have a discussion then.
…code for arbitrary memory segment
8f8b6c5 to
b1b6f78
Compare
|
+1, LGTM. |
…code for arbitrary memory segment This issue adds a functionality to efficiently compute the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps: *Avoid unnecessary memory copy. *Avoid repeated conversions between Java objects & Arrow buffers. Since the algorithm for calculating hash code has significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins. Author: liyafan82 <fan_li_ya@foxmail.com> Closes apache#4844 from liyafan82/fly_0710_hash1 and squashes the following commits: b1b6f78 <liyafan82> Provide functionality to efficiently compute hash code for arbitrary memory segment
This issue adds a functionality to efficiently compute the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps:
*Avoid unnecessary memory copy.
*Avoid repeated conversions between Java objects & Arrow buffers.
Since the algorithm for calculating hash code has significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins.