-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5814: [Java] Implement a <Object, int> HashMap for DictionaryEncoder #4765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4765 +/- ##
==========================================
+ Coverage 86.44% 89.47% +3.03%
==========================================
Files 992 659 -333
Lines 138020 95218 -42802
Branches 1418 0 -1418
==========================================
- Hits 119307 85196 -34111
+ Misses 18351 10022 -8329
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good please, add javadoc and also please clarify why this is in memory. I think vector might be more appropriate but could be convinced otherwise. Another possible extension that can be done in a follow-up is intead of make the dictionary from Object->Int you can make it only store hash and then wrap the dictionary array, and do comparisons directly between array contents.
| */ | ||
| public interface ObjectIntMap<K> { | ||
|
|
||
| int put(K key, int value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc on the methods please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.arrow.memory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you choose this package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, removed to vector module already.
| static class Entry<K> { | ||
| final K key; | ||
| int value; | ||
| Entry<K> next; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open addressing can perform better under some circumstances I think but this is a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I have tested ObjectIntMap in eclipse-collections which use open addressing but seems performance is worse than HashMap, BWT, some attempts can be made as follows. Thanks!
|
Sorry, also some unit tests would be nice |
Thanks a lot for your reminder, fixed. |
|
+1, LGTM. I assume there will be a follow-up PR to incorporate this into the actual dictionary logic. |
@emkornfield |
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes #4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes #4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of apache#4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
Related to ARROW-5814.
As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.
Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op