Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-5726.
Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values.

Set values by reflection is not efficient and code structure is not elegant such as

Method setter = null;
for (Class<?> c : Arrays.asList(int.class, long.class)) {
try {
setter = indices.getClass().getMethod("setSafe", int.class, c);
break;
} catch (NoSuchMethodException e) {
// ignore
}
}

Implement a common interface for int vectors to directly get set method and set values seems a good choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if you made this a long, then down-casted rather then boxing/unboxing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative might be to use a visitor pattern, but i'm not sure how efficient those can be made in java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, the indices from Dictionary is explicitly int type, Object type is not nedded, fixed now. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be technically be long (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L260). Note "Int" is a type which can be 8, 16, 32 or 64 bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method sets the indices(which called dictionary encoded values in DictionaryEncoder) which is explicit int type, see https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java#L55

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, implementation detail, that seems like it might cause us problems at some point but not ncessary to fix now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add asserts or precondition.checkArguments to ensure the value doesn't overflow for all types with smaller bitwidths

@emkornfield
Copy link
Contributor

Small improvement on overflow, otherwise looks ok. could you rebase when you've made the change. (I think the CI errors will go away)

@tianchen92
Copy link
Contributor Author

Small improvement on overflow, otherwise looks ok. could you rebase when you've made the change. (I think the CI errors will go away)

Sure, fixed now.

@codecov-io
Copy link

codecov-io commented Jun 28, 2019

Codecov Report

Merging #4698 into master will increase coverage by 3.2%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4698      +/-   ##
==========================================
+ Coverage   86.25%   89.46%    +3.2%     
==========================================
  Files         890      659     -231     
  Lines      131546    95130   -36416     
==========================================
- Hits       113468    85107   -28361     
+ Misses      17726    10023    -7703     
+ Partials      352        0     -352
Impacted Files Coverage Δ
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
cpp/src/arrow/scalar-test.cc 100% <0%> (ø) ⬆️
r/src/recordbatch.cpp
r/R/Table.R
go/arrow/math/uint64_amd64.go
r/src/symbols.cpp
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
go/arrow/ipc/file_reader.go
r/src/arrow_types.h
... and 228 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27cf503...37ec9cc. Read the comment docs.


@Override
public void setEncodedValue(int index, int value) {
Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow:" + value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow:" + value);
Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow: %s" , value);

Otherwise you pay the cost of string concatenation on every iteration (same applies below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


@Override
public void setEncodedValue(int index, int value) {
Preconditions.checkArgument(value <= 0xFFFF, "value is overflow:" + value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Preconditions.checkArgument(value <= 0xFFFF, "value is overflow:" + value);
Preconditions.checkArgument(value <= Character.MAX_VALUE, "value is overflow:" + value);

I think this should work here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Object value = vector.getObject(i);
if (value != null) { // if it's null leave it null
// note: this may fail if value was not included in the dictionary
Integer encoded = lookUps.get(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow-up to this PR you could implement a Map<Object, int> to avoid the unboxing costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good suggestion!

@emkornfield
Copy link
Contributor

+1, LGTM. Thank you

}

@Override
public void setEncodedValue(int index, int value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do this as long and we can have just a generic setWithPossibleTruncate(int index, long value)? Could be generally useful as this naming/sizing is very dictionary specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a good suggestion and I suggested a long as well. @tianchen92 do you want to open a JIRA/PR to refactor the name/size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

emkornfield pushed a commit that referenced this pull request Jul 3, 2019
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
kou pushed a commit that referenced this pull request Jul 4, 2019
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726).
Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values.

Set values by reflection is not efficient and code structure is not elegant such as

Method setter = null;
for (Class<?> c : Arrays.asList(int.class, long.class)) {
try {
setter = indices.getClass().getMethod("setSafe", int.class, c);
break;
} catch (NoSuchMethodException e) {
// ignore
}
}

Implement a common interface for int vectors to directly get set method and set values seems a good choice.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4698 from tianchen92/ARROW-5726 and squashes the following commits:

37ec9cc <tianchen> resolve comments
0219282 <tianchen> fix overflow
b97cbef <tianchen> fix
a6f351f <tianchen> resolve comments
5e58c52 <tianchen> Implement a common interface for int vectors
kou pushed a commit that referenced this pull request Jul 4, 2019
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
kszucs pushed a commit that referenced this pull request Jul 22, 2019
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726).
Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values.

Set values by reflection is not efficient and code structure is not elegant such as

Method setter = null;
for (Class<?> c : Arrays.asList(int.class, long.class)) {
try {
setter = indices.getClass().getMethod("setSafe", int.class, c);
break;
} catch (NoSuchMethodException e) {
// ignore
}
}

Implement a common interface for int vectors to directly get set method and set values seems a good choice.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4698 from tianchen92/ARROW-5726 and squashes the following commits:

37ec9cc <tianchen> resolve comments
0219282 <tianchen> fix overflow
b97cbef <tianchen> fix
a6f351f <tianchen> resolve comments
5e58c52 <tianchen> Implement a common interface for int vectors
kszucs pushed a commit that referenced this pull request Jul 22, 2019
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes #4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726).
Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values.

Set values by reflection is not efficient and code structure is not elegant such as

Method setter = null;
for (Class<?> c : Arrays.asList(int.class, long.class)) {
try {
setter = indices.getClass().getMethod("setSafe", int.class, c);
break;
} catch (NoSuchMethodException e) {
// ignore
}
}

Implement a common interface for int vectors to directly get set method and set values seems a good choice.

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4698 from tianchen92/ARROW-5726 and squashes the following commits:

37ec9cc <tianchen> resolve comments
0219282 <tianchen> fix overflow
b97cbef <tianchen> fix
a6f351f <tianchen> resolve comments
5e58c52 <tianchen> Implement a common interface for int vectors
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…coder

Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814).

As a follow-up of apache#4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations.

Benchmark:
DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op
DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op

Author: tianchen <niki.lj@alibaba-inc.com>

Closes apache#4765 from tianchen92/map and squashes the following commits:

38ee5a4 <tianchen> add UT
f620033 <tianchen> add javadoc and change package
10596ad <tianchen> fix style
86eb350 <tianchen> add interface
98f4c55 <tianchen> init
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants