-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1692: [Java] UnionArray round trip not working #7290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like the integration tests are still failing for union arrays between C++ and Java. |
yeah, something w/ Flight, though the IPC works fine. Checking what is different in flight now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please write some comments about the indices & values of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to multiply by 2 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. Probably a copy/paste error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a space after the cast, or the style check would fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe we can use
setValidityBit(ArrowBuf validityBuffer, int index, int value)
here to make the code more concise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are still relying on the MinorType ordinal as the type ID?
If so, the sparse union vector is still not aligning with the specification, and that causes the integration test to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch uses the typeIds array to extract the MinorType associated with the logical type. If it does not exist in the array (usually because a logical type wasn't previously assigned) it defaults to the id of the minor type. So Java always uses the MinorType as the type id and allows c++ (and others) to send arbitrary type mappings.
|
@jacques-n I think you had some concerns the last time sparse unions were made. Do these changes raise the same issues? |
14db89f to
0781a84
Compare
|
Assuming CI is still passing, is this good to merge? What is left to do, or who needs to approve? |
|
It would be great if @jacques-n could confirm he is OK with these changes. I believe on a previous PR the extra indirection was a concern. |
|
I'm really struggling with these changes. I don't understand why there is a validity buffer at the union level as well as at the cell level. I'm not sure what it even means that a a union value is valid but the inner value is not. I think there was an inconsistent understanding on my part when that change was proposed in the format (versus what the Java implementation was already doing). @wesm why would we have validity at both the top level and the inner level? Or is the C++ impl only at the top level and not the inner level (which means that this java code is incorrect since the inner vectors have their own validity vectors. |
|
Adding to my previous comments: if only at the top level, I'm not sure what the ramification of that would mean at the Java codebase. I think it would require a fairly massive refactoring. |
Well, the way the specification is written
We can decide to stipulate that union types never have non-valid values at the Union cell level, only at the child cell level. But then a union value cannot be "made null" by changing the validity bitmap of the Union. From a purely algebraic / design perspective it isn't great. From the get-go in the project I've been striving for algebraic consistency, e.g. enabling well-formed arrays to be composed to created composite types without alteration. Since Unions are one of the more seldom used part of the project we can decide to explicitly deviate from that, but we need to do it now if we're going to do that and not delay longer in reconciling the issue. |
|
FTR I'm OK with dropping the top-level validity bitmap from Union, especially if it helps us move forward |
I believe a union can still express this with a child type null type, no? I think that is how we either modeled it or planned to model it no the java side.
I'm in agreement on this. Decomposing would be complex.
I think it's where the model breaks down because of the weird situation where you actually need to evaluate two validity buffers to determine whether something is valid: the parent and the child. And an inconsistency would be really weird. As such, I'm think it would be better to avoid the top-level validity buffer.
That would be my preference. It seems to ultimately reduce the risk of inconsistency and doesn't seem to have any functional loss (given the use of null type to indicate a non-alternatively-typed value). I also think this works well in the most common case of union types, e.g. two files where one has fieldA with schemaA and another where you have fieldA with schemaB. Compositing those two doesn't require some kind of introspection and AND'ing of the individual children to build an additional validity buffer (or simply setting true for all and then having an inconsistency with the child array) and allows a fast set of the type vector for each independent chunk. |
I'm OK with this. We would need to act quickly to try to pull this off for the release. I can start a DISCUSS thread and work up a patch with the proposed changes to the specification |
|
Thanks @wesm and @jacques-n for the review. I will leave this up until consensus is reached on the format change. Please let me know if I can help w/ the c++ patch, would be happy to contribute! |
|
I sent an e-mail to dev@ -- let's discuss there |
|
In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification. This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. Do we have a plan to mitigate this? |
|
@liyafan82 I think that is a good point. If it supports both modes I think that is a reasonable compromise for now as long as @jacques-n is OK with it. But we can maybe discuss further if needed. |
|
I think this is now inline with the spec. The Union/DenseUnion types now uses logical type ids in Java. Which is the same as in c++. The difference in a java created Union is that the type ids are chosen from a known index (minor type). But the java implementation doesn't rely on minor type ordinals strict ordering like before. Or at least the intention was to be fully in line with spec while maintaining a convenient default choice for type id. Drawing any meaning from type buffers alone in client code would be an error in my opinion. Hope that clarifies my intention w/ this change. |
|
This has been modified to incorporate the changes to Unions as proposed on the mailing list |
cc11a2e to
0c8f9e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems NullVector throws UnsupportedOperationException in this case - should we be consistent with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, yes. I copied UnionVector (which returns typeBuffer) but I didn't really like it. Fixed on both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with getNullCount; this vector has nullCount == 0 but can have isNull return true regardless. C++ seems to always return false when there is no validity buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with changing this to always return false. However I think this means anyone using a Union vector would have to extract the child and check for null as an extra step. All other nested vectors have a top level validity buffer. I don't know what the ramifications of changing this are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set isNull to always be false and all the tests seem to pass (with minor modifications). Take a look and let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think from the viewpoint of code generically manipulating vectors, this makes more sense - the entries of the union are never null, but they may reference something that is null. I agree it's less convenient when working with a union array explicitly though.
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @liyafan82 do you have other comments?
|
Just rebased to include the memory related changes. Avoid any surprise broken master builds. @liyafan82 think its ready to go? |
|
Looks like unions fail in integration again. |
|
Seems like Java needs to be excluded from testing the 0.17.1 "golden" reference union file. Also, Java produces data that C++ can't read, but Java can read the C++ files; this fails the IPC tests with Java producing, and the Flight tests with both Java producing and consuming (since Flight tests upload and download). |
Wonder if its the MetadataVersion change that went in for c++ today? Looking now. |
|
It is indeed the V4 patch that has been submitted on the c++ side and not on the Java side. I think this means we need to submit your patch first @lidavidm? |
|
Hmm - Java doesn't check the metadata version and the error makes it sound like it's layout-related? Oh wait. So now C++'s view of the union layout depends on the metadata version. Ok, then let's merge the Java V5 metadata change first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor hack to remove Java from 0.17.1 union gold tests. Not the most elegant way to do it but doing it correctly would be a non-trivial change to archery imho. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think this is OK for now.
This fixes the integration tests for Union Arrays. Both Dense and Sparse unions now work: * validity buffer added to Sparse Union * both Union types now track logical types rather than MinorType enum ordinal
|
Rebased since #7685 was merged |
|
Tests pass so I'm merging. Hooray! |
|
Thanks for the rebate @wesm! Glad we have finished this one! |
This fixes the integration tests for Union Arrays. Both Dense and Sparse unions now work: * validity buffer added to Sparse Union * both Union types now track logical types rather than MinorType enum ordinal Dependent on apache#7289 for duplicate field names. Closes apache#7290 from rymurr/ARROW-1692 Authored-by: Ryan Murray <rymurr@dremio.com> Signed-off-by: Wes McKinney <wesm@apache.org>
This fixes the integration tests for Union Arrays.
Both Dense and Sparse unions now work:
Dependent on #7289 for duplicate field names.