-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5726: [Java] Implement a common interface for int vectors #4698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if you made this a long, then down-casted rather then boxing/unboxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative might be to use a visitor pattern, but i'm not sure how efficient those can be made in java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, the indices from Dictionary is explicitly int type, Object type is not nedded, fixed now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be technically be long (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L260). Note "Int" is a type which can be 8, 16, 32 or 64 bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method sets the indices(which called dictionary encoded values in DictionaryEncoder) which is explicit int type, see https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java#L55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, implementation detail, that seems like it might cause us problems at some point but not ncessary to fix now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add asserts or precondition.checkArguments to ensure the value doesn't overflow for all types with smaller bitwidths
|
Small improvement on overflow, otherwise looks ok. could you rebase when you've made the change. (I think the CI errors will go away) |
Sure, fixed now. |
Codecov Report
@@ Coverage Diff @@
## master #4698 +/- ##
==========================================
+ Coverage 86.25% 89.46% +3.2%
==========================================
Files 890 659 -231
Lines 131546 95130 -36416
==========================================
- Hits 113468 85107 -28361
+ Misses 17726 10023 -7703
+ Partials 352 0 -352
Continue to review full report at Codecov.
|
|
|
||
| @Override | ||
| public void setEncodedValue(int index, int value) { | ||
| Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow:" + value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow:" + value); | |
| Preconditions.checkArgument(value <= Byte.MAX_VALUE, "value is overflow: %s" , value); |
Otherwise you pay the cost of string concatenation on every iteration (same applies below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
|
|
||
| @Override | ||
| public void setEncodedValue(int index, int value) { | ||
| Preconditions.checkArgument(value <= 0xFFFF, "value is overflow:" + value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Preconditions.checkArgument(value <= 0xFFFF, "value is overflow:" + value); | |
| Preconditions.checkArgument(value <= Character.MAX_VALUE, "value is overflow:" + value); | |
I think this should work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| Object value = vector.getObject(i); | ||
| if (value != null) { // if it's null leave it null | ||
| // note: this may fail if value was not included in the dictionary | ||
| Integer encoded = lookUps.get(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a follow-up to this PR you could implement a Map<Object, int> to avoid the unboxing costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good suggestion!
|
+1, LGTM. Thank you |
| } | ||
|
|
||
| @Override | ||
| public void setEncodedValue(int index, int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do this as long and we can have just a generic setWithPossibleTruncate(int index, long value)? Could be generally useful as this naming/sizing is very dictionary specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a good suggestion and I suggested a long as well. @tianchen92 do you want to open a JIRA/PR to refactor the name/size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes #4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726). Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values. Set values by reflection is not efficient and code structure is not elegant such as Method setter = null; for (Class<?> c : Arrays.asList(int.class, long.class)) { try { setter = indices.getClass().getMethod("setSafe", int.class, c); break; } catch (NoSuchMethodException e) { // ignore } } Implement a common interface for int vectors to directly get set method and set values seems a good choice. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4698 from tianchen92/ARROW-5726 and squashes the following commits: 37ec9cc <tianchen> resolve comments 0219282 <tianchen> fix overflow b97cbef <tianchen> fix a6f351f <tianchen> resolve comments 5e58c52 <tianchen> Implement a common interface for int vectors
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes #4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726). Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values. Set values by reflection is not efficient and code structure is not elegant such as Method setter = null; for (Class<?> c : Arrays.asList(int.class, long.class)) { try { setter = indices.getClass().getMethod("setSafe", int.class, c); break; } catch (NoSuchMethodException e) { // ignore } } Implement a common interface for int vectors to directly get set method and set values seems a good choice. Author: tianchen <niki.lj@alibaba-inc.com> Closes #4698 from tianchen92/ARROW-5726 and squashes the following commits: 37ec9cc <tianchen> resolve comments 0219282 <tianchen> fix overflow b97cbef <tianchen> fix a6f351f <tianchen> resolve comments 5e58c52 <tianchen> Implement a common interface for int vectors
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of #4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes #4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
Related to [ARROW-5726](https://issues.apache.org/jira/browse/ARROW-5726). Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values. Set values by reflection is not efficient and code structure is not elegant such as Method setter = null; for (Class<?> c : Arrays.asList(int.class, long.class)) { try { setter = indices.getClass().getMethod("setSafe", int.class, c); break; } catch (NoSuchMethodException e) { // ignore } } Implement a common interface for int vectors to directly get set method and set values seems a good choice. Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4698 from tianchen92/ARROW-5726 and squashes the following commits: 37ec9cc <tianchen> resolve comments 0219282 <tianchen> fix overflow b97cbef <tianchen> fix a6f351f <tianchen> resolve comments 5e58c52 <tianchen> Implement a common interface for int vectors
…coder Related to [ARROW-5814](https://issues.apache.org/jira/browse/ARROW-5814). As a follow-up of apache#4698. Implement a Map<Object, int> for DictionaryEncoder to reduce boxing/unboxing operations. Benchmark: DictionaryEncodeHashMapBenchmarks.testHashMap: avgt 5 31151.345 ± 1661.878 ns/op DictionaryEncodeHashMapBenchmarks.testDictionaryEncodeHashMap: avgt 5 15549.902 ± 771.647 ns/op Author: tianchen <niki.lj@alibaba-inc.com> Closes apache#4765 from tianchen92/map and squashes the following commits: 38ee5a4 <tianchen> add UT f620033 <tianchen> add javadoc and change package 10596ad <tianchen> fix style 86eb350 <tianchen> add interface 98f4c55 <tianchen> init
Related to ARROW-5726.
Now in DictionaryEncoder#encode it use reflection to pull out the set method and then set values.
Set values by reflection is not efficient and code structure is not elegant such as
Method setter = null;
for (Class<?> c : Arrays.asList(int.class, long.class)) {
try {
setter = indices.getClass().getMethod("setSafe", int.class, c);
break;
} catch (NoSuchMethodException e) {
// ignore
}
}
Implement a common interface for int vectors to directly get set method and set values seems a good choice.