Skip to content

Conversation

@liyafan82
Copy link
Contributor

This issue adds a functionality to efficiently compute the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps:

*Avoid unnecessary memory copy.
*Avoid repeated conversions between Java objects & Arrow buffers.

Since the algorithm for calculating hash code has significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be replaced with PlatformDependent.getLong() interprets according to the platform endianess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praveenbingo thanks a lot for your comments.

Here we force the algorithm to interpret the data in little endian, in a platform independent way. This is to make sure that, if the data is sent from a little endian machine to a big endian machine, its hash code remain unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sounds good.

@liyafan82
Copy link
Contributor Author

This functionality should be placed in the memory module to avoid potential cyclic dependency.

Copy link
Contributor

@praveenbingo praveenbingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can avoid Integer.hashCode(..) since it returns the value as is anyways..

Copy link
Contributor Author

@liyafan82 liyafan82 Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same can directly do (int)byteValue..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Since we hashCode & equals API already checked in, could we take a look at this PR? Something like finalizeHashCode is useful in actual applications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianchen92 I'm not sure I understand the question exactly? What were you thinking about doing with ti?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think algorithm like Murmur hashing will significantly reduce hash collision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, this code doesn't seem to implement full murmur hash though, I don't know how much just the finalization steps helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is not full murmur hash. However, it is simple and effective:

  1. the hash code on int/byte/long are poor (they are not spread evenly in the universe), but fast enough
  2. the hash code finalized by murmur hash makes the result evenly distributed in the universe. This is costly, as there are a few integer multiplications. However, it is carried out only once.

Without the finalization step, the quality of the hash code can be poor. For example, suppose we have a integer array with small positive numbers (this is common in practice, and many cases can be converted to an equivalent one). Without the finalization step, the hash code would also be a small integer (though larger), so it is not evenly distributed in the universe, and may cause problems (e.g. in an open addressing hash table).

@liyafan82 liyafan82 closed this Jul 19, 2019
@liyafan82 liyafan82 reopened this Jul 19, 2019
@codecov-io
Copy link

codecov-io commented Jul 19, 2019

Codecov Report

Merging #4844 into master will increase coverage by 2.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4844      +/-   ##
==========================================
+ Coverage   87.46%   89.61%   +2.15%     
==========================================
  Files         994      660     -334     
  Lines      140389    96546   -43843     
  Branches     1418        0    -1418     
==========================================
- Hits       122785    86518   -36267     
+ Misses      17242    10028    -7214     
+ Partials      362        0     -362
Impacted Files Coverage Δ
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
go/arrow/array/bufferbuilder.go
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
go/arrow/ipc/file_reader.go
... and 324 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85fe336...8f8b6c5. Read the comment docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document how consumers are expected to use this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Good suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should all of the abstract methods here be protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for some scenarios, the users just want to know the hash code of a small memory segment (like 1-byte, 4-byte, or 8-byte segments). So making the methods public can be helpful for them.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on this, but I don't think hashes are necessarily considered "good" until they've been finalized, so it might lead to poorer hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It depends on the specific situation.
According to our experience, the one provided in this implementation may be a good balance between computational efficiency and uniformness. The finalization step is critical. Please see my comments above.

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A default light-weight implementation of this class is given in {@link DirectHasher}.However, the users can
* A default light-weight implementation of this class is given in {@link DirectHasher}. However, the users can

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this explantation I think the method should be protected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield Sure. Aggreed.
Let's make them protected now. In the future, if we will need them to be public, we can have a discussion then.

@emkornfield
Copy link
Contributor

+1, LGTM.

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…code for arbitrary memory segment

This issue adds a functionality to efficiently compute  the hash code for a consecutive memory region. This functionality is important in practical scenarios because it helps:

*Avoid unnecessary memory copy.
*Avoid repeated conversions between Java objects & Arrow buffers.

Since the algorithm for calculating hash code has  significant performance implications, we need to design an interface so that different algorithms can be easily introduces as plug-ins.

Author: liyafan82 <fan_li_ya@foxmail.com>

Closes apache#4844 from liyafan82/fly_0710_hash1 and squashes the following commits:

b1b6f78 <liyafan82>  Provide functionality to efficiently compute hash code for arbitrary memory segment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants