-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1184: [Java] Dictionary.equals is not working correctly #4843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Can you add a test with some encoded strings also?
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
java/vector/src/test/java/org/apache/arrow/vector/TestDictionary.java
Outdated
Show resolved
Hide resolved
Thanks, Encoded strings equals are not supported yet and I remarked it as TODO and will do it after #4792 merged, that PR introduces a ByteArrayBuffer for byte[] compare. |
|
@BryanCutler Thanks a lot for your feedback, I updated this PR, would you mind taking a look again? thanks! |
|
@emkornfield would you mind take a look at this one? thanks :) |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianchen92 thanks for doing this. I don't want to open a can of worms, but I think to do this right we need to address equals for vectors first and reconcile with what is already in vector.util.Validator. There should also be some unit tests for these across vector types.
If you want to fix up Dictionary.equals to call vector.util.Validator for now as a bandaid, I guess that will be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly implemented in vector.util.Validator, which is used in Integration tests. It should probably be refactored, but it is ok to just call the functions for now, e.g. compareFieldVectors.
|
@BryanCutler Thanks a lot for your comments. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good in general, but I think you should also be doing the same thing for DictionaryEncoding equals. Also, the integration test failure looks relevant, can you check what went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? I don't the the exception message should be changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this change, Validator could throw Exception with different message, since we make it return boolean value, log the explicit message instead of throwing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to compare encodings with the Validator also, e.g. a function Validator.compareDictionaryEncodings. Isn't Object.equals incorrect here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Object.equals works fine with DictionaryEncoding since it implemented hashCode & equals already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it further, DictionaryEncoder implements equals so I think you should call that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean DictionaryEncoding? I think Object.equals will call DictionaryEncoding#equals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, you are correct
Thanks a lot for your comments, the relevant test failure is fixed now. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianchen92 I didn't realize you would need to change the exceptions to logs to use the Validator. I don't think we should make that drastic of a change here because it is pretty important to integration tests. I think you should revert the changes to Validator/Integration and we have to options to work around:
-
As a temporary fix, just call
Validator.compareFieldVectorsand handle the exception inDictionary.equals -
Move the logic in
Validator.compareFieldVectorstoFieldVector.equalsand call that fromValidatorandDictionary.equals
You could also do (1) now and (2) as a followup. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it further, DictionaryEncoder implements equals so I think you should call that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty important to be able to see the different error messages when an integration test fails and I'm not sure only logging them is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep original error message, it might not proper to refactor Validator to make it has return type and use it in Dictionary, just copy it logic to Dictionary as previous commit does? Or any good suggestions? Thanks!
Sure, sounds reasonable, thanks! |
|
@BryanCutler Thanks for your effort, PR updated. |
Codecov Report
@@ Coverage Diff @@
## master #4843 +/- ##
==========================================
+ Coverage 87.44% 89.58% +2.13%
==========================================
Files 995 661 -334
Lines 140479 96636 -43843
Branches 1418 0 -1418
==========================================
- Hits 122841 86570 -36271
+ Misses 17276 10066 -7210
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #4843 +/- ##
==========================================
+ Coverage 87.44% 89.58% +2.13%
==========================================
Files 995 661 -334
Lines 140479 96636 -43843
Branches 1418 0 -1418
==========================================
- Hits 122841 86570 -36271
+ Misses 17276 10066 -7210
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
@tianchen92 looks like there was a problem with your merge, can you rebase with master? |
Sure, rebased. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but there is just one change I'm not sure of and seem to be a test failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field implemented equals method but it also compare names, that is to say, two same type vector with different name will not equal here even they have same elements, so compare with getClass is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to compare Field names for the integration tests. Ideally, we should be able to compare just the data in ValueVectors and the entire FieldVector also. For the dictionary vector, I don't think the field name is important, but shouldn't hurt to include the name check as well. Were you seeing a test failure if you revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like https://github.com/apache/arrow/pull/4843/files#diff-945722295a5f95e681897e48eedfcb08R459 vectors with different names ("v1" "v2") will not equals.
If you think we shouldn't change behavior of Validator, maybe we could refactor this PR and do not use Validator logic after #4933 merged, that PR introduces equals API in ValueVector. What do you think and could you also take a look? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not change the behavior of Validator here, can you revert this and we can make it less restrictive later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a TODO note here to clean this up once we can compare vectors without using the Validator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
CI tests all passed, merged to master. Thanks @tianchen92 ! |
Related to [ARROW-1184](https://issues.apache.org/jira/browse/ARROW-1184). The Dictionary.equals method does not return True when the dictionaries are equal. This is because equals is not implemented for FieldVector and so that comparison defaults to comparing the two objects only and not the vector data. Closes apache#4843 from tianchen92/ARROW-1184 and squashes the following commits: 3511857 <tianchen> revert 527a9d7 <tianchen> add TODO 0c69588 <tianchen> fix equals 1f41366 <tianchen> fix build 87b7f66 <tianchen> fix 3fe9389 <tianchen> use Validator logic in Dictionary#equals and add UT be74915 <tianchen> move UT 314d697 <tianchen> ARROW-1184: Dictionary.equals is not working correctly Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Related to ARROW-1184.
The Dictionary.equals method does not return True when the dictionaries are equal. This is because equals is not implemented for FieldVector and so that comparison defaults to comparing the two objects only and not the vector data.