-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1476: [JAVA] Implement Final ValueVector Updates #1316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I just rebased this after ARROW-1717 was merged |
|
Can we try to fix up the checkstyle warnings as part of this process as well? There seems to be a bunch of indentation problems etc. |
|
We'll want to rebase the vector refactor branch after this is merged as some of the tests are failing due to reasons that have been fixed in master |
68066db to
9b18879
Compare
|
@wesm, yes we need to rebase this branch before merging into master. I have addressed indentation and line length issues. |
Close apache#1164 Close apache#1198 Change-Id: If18e42d2edfdfef83e83621334a5b65a390e9db9
Implementation of all scalar types and complex types with corresponding legacy versions. Closes apache#1203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference of this function and
public ArrowBuf[] getBuffers(boolean clear)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia What's your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @siddharthteotia, can you explain on the difference between the two methods and how do we need both as public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check valueCount > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we haven't done setValueCount() on the source vector before invoking transfer(), then valueCount is essentially 0. Doing target.setValueCount(this.valueCount) will corrupt the target vector if this.valueCount is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems naming conversion wise, all methods that handles allocates new buffers are called "xxxSafe".
Since we have
public void safeSafe(int index, ...)
Does
public void setValiditySafe(int index, boolean valid)
Makes more sense for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure. Since this has always been setIndexDefined(), I just left it as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious what's the use case of this API, i.e, when do we want to just change the length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia what's your thought on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @siddharthteotia , can you explain why do we need this as a public API? When would a user want to just sent length for an element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know if it's github issue, is the license format correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected. Not a github issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia The license format still doesn't look correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class be under test or o.a.a.tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally under tools but for some weird reason, I wasn't able to resolve the dependency w.r.t arrow tools in Dremio -- I was definitely missing something but for saving time just put it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer if we can move this out of vector public class. If we don't have time to do it here, can we track this somewhere so we don't forget to move it before release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @siddharthteotia. Looks like this class is not used anywhere. Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this in Dremio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's only used for Dremio testing, does it make more sense to stay in Dremio codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not something new. Vector classes have had the API generateTestData() from the beginning. As part of refactoring, we decided to move it out of the main API of vectors and provide a separate API for populating vectors with dummy data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old generateTestData() API in vectors is deprecated and the class description implies that any users of the former should now be using GenerateSampleData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It makes more sense now, thanks for the explanation. Let's keep it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there unit tests for this kind of stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have unit tests for splitAndTransfer of bit vector (validity buffer). Also we have unit tests for split and transfer of other vectors (nullable fixed, var width and complex) and we exercise all the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like
BitVectorHelper.setValidityBit(validityBuffer, thisIndex, from.get(fromIndex));
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also a bit confused about
set(thisIndex, ...)
vs
BitVectorHelper.setValidityBit(validityBuffer, thisIndex, ...)
What's the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nvm. It seems all copyFrom follows the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a unit test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some copy() related unit tests but in any case, I am planning to write more unit tests in the next couple of weeks to get overall good coverage. I don't think so I can get to them now because of time constraints.
Note that behavior of copyFrom() hasn't changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "offset" is well defined in format, so this method can probably just be:
getOffset(int index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just:
offsetBuffer.getInt(index * OFFSET_WIDTH)
Probably it consist with holder.end = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startOffset implies the starting point of VARCHAR element in the data buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be lifted to base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably but I thought BaseRepeatedValueVector on itself is of no use -- at least doesn't represent a materialized field. Moreover, even if we move to base class, we still have to set reader and writer index for validity buffer in sub class since the validity buffer is a property of list vector and not BaseRepeatedValueVector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this method? Can valueCapacity and validityCapacity mismatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validity and offset move together regardless of the buffers in underlying data vector of list vector so functions like setNotNull(), startNewValue(), setValueCount() which are essentially safe methods need to determine if the inner buffers of list vector need to be reallocated or not. This decision can't be made based on getValueCapacity() since that also accounts for the value capacity of data vector. I think by mistake it is public. that needs to be corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yeah let's make this private.
|
@siddharthteotia This looks good. I left some comments. One thing is that it seems this patch has quite bit of bug fixes but not many new tests. Should we add tests for those bug fixes? |
|
There was one more problem. When JsonFileReader was refactored recently while removing the static interfaces, we did not set the writer index on the buffers we are allocating for the vector. As we are parsing JSON and writing to buffers, we should be setting writer index before calling loadFieldBuffers(). The static interfaces were doing so but we missed doing that in the recent change |
|
FYI, I updated Spark to use these changes and tests pass for the most part. There are still a couple things I need to work out, but overall it is looking good from this side! |
|
Oops, I can take a look at JsonFileReader tomorrow. @siddharthteotia which tests are failing because of that? |
|
Fixing checkstyle for warning is probably a wasted effort especially when the warnings are coming from code that hasn't been modified as well. I have filed a JIRA to address this separately. With so many warnings dumped by mvn package checkstyle:check, it is slightly difficult to identify what to fix and what not. Essentially everything needs to be fixed -- indentation, order of imports etc throughout the code base. I have done it for most parts of newly written code. I am fine with doing it but can we do it separately? I have filed a JIRA to track this |
|
Yes, let's do the style fixes separately. I think it's long overdue |
af14dd7 to
9706172
Compare
|
@siddharthteotia How about we fix at least indentation, header and white spaces, end of file new lines for new files? Other checkstyle warning we can fix it later, but I am concerned with adding lots of new files with incorrect indentation and white spaces. |
|
@icexelloss , I have fixed indentation, license header and white spaces in new files. Going through the code again. Travis CI build passes. |
|
Build passed. |
| import org.apache.arrow.vector.schema.ArrowRecordBatch; | ||
| import org.apache.arrow.vector.schema.ArrowVectorType; | ||
|
|
||
| import javax.annotation.Nullable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| buf.writeByte(parser.getByteValue()); | ||
| } | ||
|
|
||
| buf.writerIndex(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? buf.writeByte seems to update writerIndex already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we don't need.
| setNotNull(index); | ||
| lastSet = index + 1; | ||
| return offsets.getAccessor().get(lastSet); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia I think we want to make this private, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
Awesome. @siddharthteotia I have re commented on all my previous comments that are not addressed. Can you take a look? Thanks! |
|
@siddharthteotia There are still some indentation and white space warning on the some new vector files you can see this by running: and check the html file under Can you double check those please? Thanks. Otherwise style looks correct. |
|
Looks like we are on the home stretch. I can merge later today when these last small items are addressed |
| * vector instance. | ||
| */ | ||
| @Override | ||
| public ArrowBuf[] getBuffers(boolean clear) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comments are squashed because of code change.
Why do we need both
public ArrowBuf[] getBuffers(boolean clear)
and
public List<ArrowBuf> getFieldBuffers()
Both as public API? They look very similar to me and might confuse users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, getBuffers() is an old method for getting a batch of buffers for serialization. We use it for maintaining compatibility with Drill client. I think in future, we should consider removing it from Arrow and writing a wrapper over it in Dremio. This way we will just have getFieldBuffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Does it make sense to mark getBuffers() as deprecated so it's less confusing for the other users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not worry about it right now. These two APIs have not been introduced as part of refactoring and have been in Arrow code for quite some time.
I will file a JIRA and we can brainstorm w.r.t appropriate steps to address the ambiguity (if any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Sounds good.
| * @param index position of the element to set | ||
| * @param length length of the element | ||
| */ | ||
| public void setValueLengthSafe(int index, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not used.
Also, I am not sure if we need an API for just setting value length. When would this be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siddharthteotia what's your thought on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use it in our Parquet code -- I may have to evaluate its usage but the API has been there in variable width vector from the beginning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification - I wasn't sure which APIs are from the old vectors and which APIs are added.
Looks good then.
|
@wesm , let me know if I need to squash the latest 7 commits for this JIRA or you can take care of it while merging. I think we would be merging 4 commits into master: ARROW-1473 |
|
Build passed. |
|
LGTM. +1. Thanks @siddharthteotia for the epic effort! |
|
Sweet, thanks all! I will take care of merging to master (and squashing the commits between the last refactor patch and now) |
|
+1, great work @siddharthteotia ! |
|
merged to master. thank you! |
…s zero null count and optional validity buffer Currently when a Field has null count = 0, C++ will omit the validity buffer as it is optional in this case. Testing for #1316 was failing because Java was not handling this properly. This PR adds an explicit test to ensure this is being tested and easier to locate possible issues. Author: Bryan Cutler <cutlerb@gmail.com> Closes #1323 from BryanCutler/integration-test-optional-validity-buffer-ARROW-1821 and squashes the following commits: 4295660 [Bryan Cutler] expanded simple.json static test file to include field with null count = 0
cc @jacques-n , @BryanCutler , @icexelloss
bug fixes and clean up done as part of testing in Dremio -- this is the 3rd patch I was referring to