-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6210: [Java] remove equals API from ValueVector #5065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @pravindra |
| @@ -135,7 +139,8 @@ public int getIndex(int indexInArray, ValueVector toEncode) { | |||
| for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) { | |||
| if ((e.hash == hash)) { | |||
| int dictIndex = e.index; | |||
| if (dictionary.equals(dictIndex, toEncode, indexInArray)) { | |||
| visitor.reset(indexInArray, dictIndex, 1); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This saves the repeated allocations. Can we also save the repeated type checks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you careful check, right, here checks type for each value compare.
The dictionary vector is bind with hashTable but the toEncode vector is not(that is to see a same dictionary could be used for multiple toEncode vectors), so it's hard to check which is the first value of toEncode to compare. But as a workaround, I would like to disable all type check here as I think they will always have same type, or we could make a check outside hash table. What do you think?
| @@ -135,7 +139,8 @@ public int getIndex(int indexInArray, ValueVector toEncode) { | |||
| for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) { | |||
| if ((e.hash == hash)) { | |||
| int dictIndex = e.index; | |||
| if (dictionary.equals(dictIndex, toEncode, indexInArray)) { | |||
| visitor.reset(indexInArray, dictIndex, 1); | |||
| if (toEncode.accept(visitor)) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we hide the fact that the impl uses a visitor from the APIs ? Maybe,
matcher = new RangeMatcher(srcVector, dstVector);
if (matcher.match(srcIndex, dstIndex, 1)) {
return dictIndex;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I add a ValueMatcher to do this. As the same reason mentioned above, this matcher only holds dictionary vector instead both.
|
Should the same be done for hash? |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tianchen92 . I think the ValueMather class you add is not necessary, and also add a test for the new flag in RangeEqualsVisitor.
java/vector/src/main/java/org/apache/arrow/vector/compare/RangeEqualsVisitor.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Reset start indices and length for reuse purpose. | ||
| */ | ||
| public void reset(int leftStart, int rightStart, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just create a new RangeEqualsVisitor than add a reset method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
java/vector/src/main/java/org/apache/arrow/vector/compare/RangeEqualsVisitor.java
Outdated
Show resolved
Hide resolved
| @@ -135,7 +138,7 @@ public int getIndex(int indexInArray, ValueVector toEncode) { | |||
| for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) { | |||
| if ((e.hash == hash)) { | |||
| int dictIndex = e.index; | |||
| if (dictionary.equals(dictIndex, toEncode, indexInArray)) { | |||
| if (matcher.match(dictIndex, toEncode, indexInArray)) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just make a RangeEqualsVistor each time here, e.g.
if (dictionary.accept(new RangeEqualVisitor(toEncode, dictIndex, indexInArray, 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an instsnce each time seems less effecient compared to reuse(The same reason for reset methed)? is there a special reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks much better to me. I prefer simplified code and to prevent class bloat, unless it's on a hot path and optimization is crucial. Even then, I don't think creating a new RangeEqualsVisitor would cause any noticeable performance loss.
java/vector/src/main/java/org/apache/arrow/vector/dictionary/ValueMatcher.java
Outdated
Show resolved
Hide resolved
|
|
||
| private boolean validateTypesAndIndices(ValueVector left) { | ||
|
|
||
| if (needCheckType && !compareValueVector(left, right)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes that compareValueVector only checks the type. This should be called regardless of the needCheckType flag, and then in compareValueVector you can look at the flag before comparing types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed.
Do you mean also apply visitor mode for hash so that could apply different hash algorithm(like accept(HashVisitor visitor)? otherwise now hashCode is used for encoding and not a proper time to remove it I think. |
Weren't the equal methods also used for encoding? |
Yes, they are also used. Since #4933 (visitor to compare two vectors) already merged. Pravindra and Bryan suggested to remove equals(int, ValueVector, int) and replace this logic in encoding hash table with RangeEqualsVisitor(with length=1) to reduce duplicated logic. And hashCode could not be replaced at the moment. |
|
I updated this PR according to Bryan's newest comments, also I port equals related tests to TestRangeEqualsVisitor to make it more clean. |
Codecov Report
@@ Coverage Diff @@
## master #5065 +/- ##
==========================================
+ Coverage 88.06% 89.73% +1.67%
==========================================
Files 921 672 -249
Lines 136032 100352 -35680
Branches 1418 0 -1418
==========================================
- Hits 119799 90053 -29746
+ Misses 16223 10299 -5924
+ Partials 10 0 -10
Continue to review full report at Codecov.
|
| this.leftStart = leftStart; | ||
| this.rightStart = rightStart; | ||
| this.right = right; | ||
| this.length = length; | ||
| this.typeCheckNeeded = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.typeCheckNeeded = typeCheckNeeded ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fixed.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple minor things, otherwise looks good
|
|
||
| protected boolean typeCheckNeeded = true; | ||
|
|
||
| public void setTypeCheckNeeded(boolean typeCheckNeeded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have this flag in the constructor, then do you need this API also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, removed.
| * Constructs a new instance. | ||
| */ | ||
| public RangeEqualsVisitor(ValueVector right, int leftStart, int rightStart, int length) { | ||
| this (right, rightStart, leftStart, length, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove space after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
| import org.junit.Before; | ||
| import org.junit.Test; | ||
|
|
||
| public class TestRangeEqualsVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test that just checks a range in the middle of a vector, using RangeEqualsVisitor and not VectorEqualsVisitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
| public void testZeroVectorNotEquals() { | ||
| try (final IntVector intVector = new IntVector("int", allocator); | ||
| final ZeroVector zeroVector = new ZeroVector()) { | ||
| final ZeroVector zeroVector = new ZeroVector()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but the alignment seemed right to me before so that final is after the above parenthesis. I see the file has both ways, so no need to fix now. I'm not sure why the checkstyle didn't error one way or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now checkstyle in java code (/dev/checkstyle/checkstyle.xml) is not strong enough to catch some problem like you mentioned above and something like 'more than one empty line'. I think we could to take a close watch to see if it‘s possible to enhance this.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @emkornfield did you have any concerns with removing the equals API here?
|
@BryanCutler no, I believe it was net new since 0.14.0 so it should be fine to remove. |
|
merged to master, thanks @tianchen92 ! |
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210). This is a follow-up from apache#4933 The callers should be fixed to use the RangeEquals API instead. Closes apache#5065 from tianchen92/ARROW-6210 and squashes the following commits: cae440a <tianchen> resolve comments ad2277b <tianchen> resolve comments and port tests fb1510e <tianchen> add ValueMatcher 14b3e67 <tianchen> ARROW-6210: remove equals API from ValueVector Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Related to ARROW-6210.
This is a follow-up from #4933
The callers should be fixed to use the RangeEquals API instead.