Skip to content

Conversation

@siddharthteotia
Copy link
Contributor

cc @jacques-n , @BryanCutler , @icexelloss

bug fixes and clean up done as part of testing in Dremio -- this is the 3rd patch I was referring to

@wesm
Copy link
Member

wesm commented Nov 14, 2017

I just rebased this after ARROW-1717 was merged

@wesm
Copy link
Member

wesm commented Nov 14, 2017

Can we try to fix up the checkstyle warnings as part of this process as well? There seems to be a bunch of indentation problems etc.

@wesm
Copy link
Member

wesm commented Nov 14, 2017

We'll want to rebase the vector refactor branch after this is merged as some of the tests are failing due to reasons that have been fixed in master

@siddharthteotia siddharthteotia force-pushed the ARROW-1476 branch 2 times, most recently from 68066db to 9b18879 Compare November 14, 2017 04:51
@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Nov 14, 2017

@wesm, yes we need to rebase this branch before merging into master. I have addressed indentation and line length issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference of this function and

public ArrowBuf[] getBuffers(boolean clear)

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia What's your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @siddharthteotia, can you explain on the difference between the two methods and how do we need both as public API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check valueCount > 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we haven't done setValueCount() on the source vector before invoking transfer(), then valueCount is essentially 0. Doing target.setValueCount(this.valueCount) will corrupt the target vector if this.valueCount is 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor

@icexelloss icexelloss Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems naming conversion wise, all methods that handles allocates new buffers are called "xxxSafe".

Since we have

public void safeSafe(int index, ...)

Does

public void setValiditySafe(int index, boolean valid)

Makes more sense for this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure. Since this has always been setIndexDefined(), I just left it as is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what's the use case of this API, i.e, when do we want to just change the length?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia what's your thought on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @siddharthteotia , can you explain why do we need this as a public API? When would a user want to just sent length for an element?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if it's github issue, is the license format correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected. Not a github issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia The license format still doesn't look correct.

Copy link
Contributor

@icexelloss icexelloss Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class be under test or o.a.a.tools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally under tools but for some weird reason, I wasn't able to resolve the dependency w.r.t arrow tools in Dremio -- I was definitely missing something but for saving time just put it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer if we can move this out of vector public class. If we don't have time to do it here, can we track this somewhere so we don't forget to move it before release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @siddharthteotia. Looks like this class is not used anywhere. Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this in Dremio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only used for Dremio testing, does it make more sense to stay in Dremio codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not something new. Vector classes have had the API generateTestData() from the beginning. As part of refactoring, we decided to move it out of the main API of vectors and provide a separate API for populating vectors with dummy data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old generateTestData() API in vectors is deprecated and the class description implies that any users of the former should now be using GenerateSampleData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It makes more sense now, thanks for the explanation. Let's keep it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there unit tests for this kind of stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have unit tests for splitAndTransfer of bit vector (validity buffer). Also we have unit tests for split and transfer of other vectors (nullable fixed, var width and complex) and we exercise all the conditions.

Copy link
Contributor

@icexelloss icexelloss Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like

BitVectorHelper.setValidityBit(validityBuffer, thisIndex, from.get(fromIndex));

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a bit confused about

set(thisIndex, ...) 

vs

BitVectorHelper.setValidityBit(validityBuffer, thisIndex, ...)

What's the difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nvm. It seems all copyFrom follows the same pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a unit test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some copy() related unit tests but in any case, I am planning to write more unit tests in the next couple of weeks to get overall good coverage. I don't think so I can get to them now because of time constraints.

Note that behavior of copyFrom() hasn't changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "offset" is well defined in format, so this method can probably just be:

getOffset(int index)

Copy link
Contributor

@icexelloss icexelloss Nov 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just:

offsetBuffer.getInt(index * OFFSET_WIDTH)

Probably it consist with holder.end = ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startOffset implies the starting point of VARCHAR element in the data buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be lifted to base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably but I thought BaseRepeatedValueVector on itself is of no use -- at least doesn't represent a materialized field. Moreover, even if we move to base class, we still have to set reader and writer index for validity buffer in sub class since the validity buffer is a property of list vector and not BaseRepeatedValueVector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this method? Can valueCapacity and validityCapacity mismatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validity and offset move together regardless of the buffers in underlying data vector of list vector so functions like setNotNull(), startNewValue(), setValueCount() which are essentially safe methods need to determine if the inner buffers of list vector need to be reallocated or not. This decision can't be made based on getValueCapacity() since that also accounts for the value capacity of data vector. I think by mistake it is public. that needs to be corrected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah let's make this private.

@icexelloss
Copy link
Contributor

@siddharthteotia This looks good. I left some comments.

One thing is that it seems this patch has quite bit of bug fixes but not many new tests. Should we add tests for those bug fixes?

@siddharthteotia
Copy link
Contributor Author

There was one more problem. When JsonFileReader was refactored recently while removing the static interfaces, we did not set the writer index on the buffers we are allocating for the vector. As we are parsing JSON and writing to buffers, we should be setting writer index before calling loadFieldBuffers().

The static interfaces were doing so but we missed doing that in the recent change

@BryanCutler
Copy link
Member

FYI, I updated Spark to use these changes and tests pass for the most part. There are still a couple things I need to work out, but overall it is looking good from this side!

@icexelloss
Copy link
Contributor

Oops, I can take a look at JsonFileReader tomorrow. @siddharthteotia which tests are failing because of that?

@siddharthteotia
Copy link
Contributor Author

siddharthteotia commented Nov 15, 2017

Fixing checkstyle for warning is probably a wasted effort especially when the warnings are coming from code that hasn't been modified as well. I have filed a JIRA to address this separately. With so many warnings dumped by mvn package checkstyle:check, it is slightly difficult to identify what to fix and what not.

Essentially everything needs to be fixed -- indentation, order of imports etc throughout the code base. I have done it for most parts of newly written code.

I am fine with doing it but can we do it separately? I have filed a JIRA to track this
https://issues.apache.org/jira/browse/ARROW-1813

@wesm
Copy link
Member

wesm commented Nov 15, 2017

Yes, let's do the style fixes separately. I think it's long overdue

@icexelloss
Copy link
Contributor

@siddharthteotia How about we fix at least indentation, header and white spaces, end of file new lines for new files?

Other checkstyle warning we can fix it later, but I am concerned with adding lots of new files with incorrect indentation and white spaces.

@siddharthteotia
Copy link
Contributor Author

@icexelloss , I have fixed indentation, license header and white spaces in new files. Going through the code again. Travis CI build passes.

@siddharthteotia
Copy link
Contributor Author

Build passed.

import org.apache.arrow.vector.schema.ArrowRecordBatch;
import org.apache.arrow.vector.schema.ArrowVectorType;

import javax.annotation.Nullable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

buf.writeByte(parser.getByteValue());
}

buf.writerIndex(size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? buf.writeByte seems to update writerIndex already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we don't need.

setNotNull(index);
lastSet = index + 1;
return offsets.getAccessor().get(lastSet);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia I think we want to make this private, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@icexelloss
Copy link
Contributor

Awesome.

@siddharthteotia I have re commented on all my previous comments that are not addressed. Can you take a look? Thanks!

@icexelloss
Copy link
Contributor

icexelloss commented Nov 15, 2017

@siddharthteotia There are still some indentation and white space warning on the some new vector files

org/apache/arrow/vector/BaseNullableFixedWidthVector.java
org/apache/arrow/vector/BaseNullableVariableWidthVector.java

you can see this by running:

mvn package checkstyle:checkstyle

and check the html file under

vector/target/site/checkstyle.html

Can you double check those please? Thanks. Otherwise style looks correct.

@wesm
Copy link
Member

wesm commented Nov 15, 2017

Looks like we are on the home stretch. I can merge later today when these last small items are addressed

* vector instance.
*/
@Override
public ArrowBuf[] getBuffers(boolean clear) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comments are squashed because of code change.

Why do we need both

public ArrowBuf[] getBuffers(boolean clear)

and

public List<ArrowBuf> getFieldBuffers()

Both as public API? They look very similar to me and might confuse users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, getBuffers() is an old method for getting a batch of buffers for serialization. We use it for maintaining compatibility with Drill client. I think in future, we should consider removing it from Arrow and writing a wrapper over it in Dremio. This way we will just have getFieldBuffers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Does it make sense to mark getBuffers() as deprecated so it's less confusing for the other users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not worry about it right now. These two APIs have not been introduced as part of refactoring and have been in Arrow code for quite some time.

I will file a JIRA and we can brainstorm w.r.t appropriate steps to address the ambiguity (if any)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds good.

* @param index position of the element to set
* @param length length of the element
*/
public void setValueLengthSafe(int index, int length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used.

Also, I am not sure if we need an API for just setting value length. When would this be useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia what's your thought on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it in our Parquet code -- I may have to evaluate its usage but the API has been there in variable width vector from the beginning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification - I wasn't sure which APIs are from the old vectors and which APIs are added.

Looks good then.

@siddharthteotia
Copy link
Contributor Author

@wesm , let me know if I need to squash the latest 7 commits for this JIRA or you can take care of it while merging. I think we would be merging 4 commits into master:

ARROW-1473
ARROW-1474
ARROW-1717
ARROW-1476 (squashed latest commits)

@siddharthteotia
Copy link
Contributor Author

Build passed.
I think we are all set for merge.

@icexelloss
Copy link
Contributor

LGTM. +1.

Thanks @siddharthteotia for the epic effort!

@wesm
Copy link
Member

wesm commented Nov 15, 2017

Sweet, thanks all! I will take care of merging to master (and squashing the commits between the last refactor patch and now)

@BryanCutler
Copy link
Member

+1, great work @siddharthteotia !

@wesm
Copy link
Member

wesm commented Nov 15, 2017

merged to master. thank you!

@wesm wesm closed this Nov 15, 2017
wesm pushed a commit that referenced this pull request Nov 16, 2017
…s zero null count and optional validity buffer

Currently when a Field has null count = 0, C++ will omit the validity buffer as it is optional in this case.  Testing for #1316 was failing because Java was not handling this properly.  This PR adds an explicit test to ensure this is being tested and easier to locate possible issues.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #1323 from BryanCutler/integration-test-optional-validity-buffer-ARROW-1821 and squashes the following commits:

4295660 [Bryan Cutler] expanded simple.json static test file to include field with null count = 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants