Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-6210.
This is a follow-up from #4933
The callers should be fixed to use the RangeEquals API instead.

@tianchen92
Copy link
Contributor Author

cc @pravindra

@@ -135,7 +139,8 @@ public int getIndex(int indexInArray, ValueVector toEncode) {
for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) {
if ((e.hash == hash)) {
int dictIndex = e.index;
if (dictionary.equals(dictIndex, toEncode, indexInArray)) {
visitor.reset(indexInArray, dictIndex, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saves the repeated allocations. Can we also save the repeated type checks ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you careful check, right, here checks type for each value compare.
The dictionary vector is bind with hashTable but the toEncode vector is not(that is to see a same dictionary could be used for multiple toEncode vectors), so it's hard to check which is the first value of toEncode to compare. But as a workaround, I would like to disable all type check here as I think they will always have same type, or we could make a check outside hash table. What do you think?

@@ -135,7 +139,8 @@ public int getIndex(int indexInArray, ValueVector toEncode) {
for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) {
if ((e.hash == hash)) {
int dictIndex = e.index;
if (dictionary.equals(dictIndex, toEncode, indexInArray)) {
visitor.reset(indexInArray, dictIndex, 1);
if (toEncode.accept(visitor)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we hide the fact that the impl uses a visitor from the APIs ? Maybe,

matcher = new RangeMatcher(srcVector, dstVector);
if (matcher.match(srcIndex, dstIndex, 1)) {
return dictIndex;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I add a ValueMatcher to do this. As the same reason mentioned above, this matcher only holds dictionary vector instead both.

@emkornfield
Copy link
Contributor

Should the same be done for hash?

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tianchen92 . I think the ValueMather class you add is not necessary, and also add a test for the new flag in RangeEqualsVisitor.

/**
* Reset start indices and length for reuse purpose.
*/
public void reset(int leftStart, int rightStart, int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just create a new RangeEqualsVisitor than add a reset method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@@ -135,7 +138,7 @@ public int getIndex(int indexInArray, ValueVector toEncode) {
for (DictionaryHashTable.Entry e = table[index]; e != null ; e = e.next) {
if ((e.hash == hash)) {
int dictIndex = e.index;
if (dictionary.equals(dictIndex, toEncode, indexInArray)) {
if (matcher.match(dictIndex, toEncode, indexInArray)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just make a RangeEqualsVistor each time here, e.g.
if (dictionary.accept(new RangeEqualVisitor(toEncode, dictIndex, indexInArray, 1))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an instsnce each time seems less effecient compared to reuse(The same reason for reset methed)? is there a special reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks much better to me. I prefer simplified code and to prevent class bloat, unless it's on a hot path and optimization is crucial. Even then, I don't think creating a new RangeEqualsVisitor would cause any noticeable performance loss.


private boolean validateTypesAndIndices(ValueVector left) {

if (needCheckType && !compareValueVector(left, right)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes that compareValueVector only checks the type. This should be called regardless of the needCheckType flag, and then in compareValueVector you can look at the flag before comparing types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

@tianchen92
Copy link
Contributor Author

Should the same be done for hash?

Do you mean also apply visitor mode for hash so that could apply different hash algorithm(like accept(HashVisitor visitor)? otherwise now hashCode is used for encoding and not a proper time to remove it I think.

@emkornfield
Copy link
Contributor

Do you mean also apply visitor mode for hash so that could apply different hash algorithm(like accept(HashVisitor visitor)? otherwise now hashCode is used for encoding and not a proper time to remove it I think.

Weren't the equal methods also used for encoding?

@tianchen92
Copy link
Contributor Author

Do you mean also apply visitor mode for hash so that could apply different hash algorithm(like accept(HashVisitor visitor)? otherwise now hashCode is used for encoding and not a proper time to remove it I think.

Weren't the equal methods also used for encoding?

Yes, they are also used. Since #4933 (visitor to compare two vectors) already merged. Pravindra and Bryan suggested to remove equals(int, ValueVector, int) and replace this logic in encoding hash table with RangeEqualsVisitor(with length=1) to reduce duplicated logic. And hashCode could not be replaced at the moment.

@tianchen92
Copy link
Contributor Author

I updated this PR according to Bryan's newest comments, also I port equals related tests to TestRangeEqualsVisitor to make it more clean.
@BryanCutler @pravindra Please take another look, thanks!

@codecov-io
Copy link

codecov-io commented Aug 13, 2019

Codecov Report

Merging #5065 into master will increase coverage by 1.67%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5065      +/-   ##
==========================================
+ Coverage   88.06%   89.73%   +1.67%     
==========================================
  Files         921      672     -249     
  Lines      136032   100352   -35680     
  Branches     1418        0    -1418     
==========================================
- Hits       119799    90053   -29746     
+ Misses      16223    10299    -5924     
+ Partials       10        0      -10
Impacted Files Coverage Δ
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
python/pyarrow/tests/test_parquet.py 96.33% <0%> (-0.07%) ⬇️
r/src/recordbatch.cpp
r/R/Table.R
js/src/util/fn.ts
r/src/symbols.cpp
rust/datafusion/src/execution/projection.rs
rust/datafusion/src/execution/filter.rs
rust/arrow/src/csv/writer.rs
rust/datafusion/src/bin/main.rs
... and 242 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cafb00f...cae440a. Read the comment docs.

this.leftStart = leftStart;
this.rightStart = rightStart;
this.right = right;
this.length = length;
this.typeCheckNeeded = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.typeCheckNeeded = typeCheckNeeded ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor things, otherwise looks good


protected boolean typeCheckNeeded = true;

public void setTypeCheckNeeded(boolean typeCheckNeeded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have this flag in the constructor, then do you need this API also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, removed.

* Constructs a new instance.
*/
public RangeEqualsVisitor(ValueVector right, int leftStart, int rightStart, int length) {
this (right, rightStart, leftStart, length, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove space after this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

import org.junit.Before;
import org.junit.Test;

public class TestRangeEqualsVisitor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that just checks a range in the middle of a vector, using RangeEqualsVisitor and not VectorEqualsVisitor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.

@tianchen92 tianchen92 closed this Aug 14, 2019
@tianchen92 tianchen92 reopened this Aug 14, 2019
@tianchen92 tianchen92 closed this Aug 14, 2019
@tianchen92 tianchen92 reopened this Aug 14, 2019
public void testZeroVectorNotEquals() {
try (final IntVector intVector = new IntVector("int", allocator);
final ZeroVector zeroVector = new ZeroVector()) {
final ZeroVector zeroVector = new ZeroVector()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but the alignment seemed right to me before so that final is after the above parenthesis. I see the file has both ways, so no need to fix now. I'm not sure why the checkstyle didn't error one way or the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now checkstyle in java code (/dev/checkstyle/checkstyle.xml) is not strong enough to catch some problem like you mentioned above and something like 'more than one empty line'. I think we could to take a close watch to see if it‘s possible to enhance this.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @emkornfield did you have any concerns with removing the equals API here?

@emkornfield
Copy link
Contributor

@BryanCutler no, I believe it was net new since 0.14.0 so it should be fine to remove.

@BryanCutler
Copy link
Member

merged to master, thanks @tianchen92 !

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
Related to [ARROW-6210](https://issues.apache.org/jira/browse/ARROW-6210).
This is a follow-up from apache#4933
The callers should be fixed to use the RangeEquals API instead.

Closes apache#5065 from tianchen92/ARROW-6210 and squashes the following commits:

cae440a <tianchen> resolve comments
ad2277b <tianchen> resolve comments and port tests
fb1510e <tianchen> add ValueMatcher
14b3e67 <tianchen> ARROW-6210:  remove equals API from ValueVector

Authored-by: tianchen <niki.lj@alibaba-inc.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants