Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-5814.

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

@codecov-io
Copy link

codecov-io commented Jul 1, 2019

Codecov Report

Merging #4765 into master will increase coverage by 3.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4765      +/-   ##
==========================================
+ Coverage   86.44%   89.47%   +3.03%     
==========================================
  Files         992      659     -333     
  Lines      138020    95218   -42802     
  Branches     1418        0    -1418     
==========================================
- Hits       119307    85196   -34111     
+ Misses      18351    10022    -8329     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/gandiva/projector.cc 90.75% <0%> (-5.12%) ⬇️
cpp/src/gandiva/gdv_function_stubs.cc 94.8% <0%> (-3.48%) ⬇️
cpp/src/arrow/io/readahead.cc 95.91% <0%> (-1.03%) ⬇️
cpp/src/gandiva/expr_validator.cc 100% <0%> (ø) ⬆️
cpp/src/gandiva/annotator_test.cc 100% <0%> (ø) ⬆️
cpp/src/gandiva/field_descriptor.h 100% <0%> (ø) ⬆️
cpp/src/gandiva/annotator.cc 100% <0%> (ø) ⬆️
cpp/src/arrow/csv/parser-test.cc 100% <0%> (ø) ⬆️
r/src/recordbatch.cpp
r/R/Table.R
... and 338 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91b4cbc...38ee5a4. Read the comment docs.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good please, add javadoc and also please clarify why this is in memory. I think vector might be more appropriate but could be convinced otherwise. Another possible extension that can be done in a follow-up is intead of make the dictionary from Object->Int you can make it only store hash and then wrap the dictionary array, and do comparisons directly between array contents.

*/
public interface ObjectIntMap<K> {

int put(K key, int value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc on the methods please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

* limitations under the License.
*/

package org.apache.arrow.memory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you choose this package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, removed to vector module already.

static class Entry<K> {
final K key;
int value;
Entry<K> next;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open addressing can perform better under some circumstances I think but this is a good start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I have tested ObjectIntMap in eclipse-collections which use open addressing but seems performance is worse than HashMap, BWT, some attempts can be made as follows. Thanks!

@emkornfield
Copy link
Contributor

Sorry, also some unit tests would be nice

@tianchen92
Copy link
Contributor Author

Sorry, also some unit tests would be nice

Thanks a lot for your reminder, fixed.

@emkornfield
Copy link
Contributor

+1, LGTM. I assume there will be a follow-up PR to incorporate this into the actual dictionary logic.

@tianchen92
Copy link
Contributor Author

tianchen92 commented Jul 3, 2019

Another possible extension that can be done in a follow-up is intead of make the dictionary from Object->Int you can make it only store hash and then wrap the dictionary array, and do comparisons directly between array contents.

@emkornfield
I don't quite understand what you mean here, could you give some more informations? Thanks!

kou pushed a commit that referenced this pull request Jul 4, 2019
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
kszucs pushed a commit that referenced this pull request Jul 22, 2019
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of apache#4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants