Skip to content

Conversation

@rymurr
Copy link
Contributor

@rymurr rymurr commented May 27, 2020

This fixes the integration tests for Union Arrays.

Both Dense and Sparse unions now work:

  • validity buffer added to Sparse Union
  • both Union types now track logical types rather than MinorType enum ordinal

Dependent on #7289 for duplicate field names.

@github-actions
Copy link

@rymurr rymurr marked this pull request as ready for review June 10, 2020 18:09
@lidavidm
Copy link
Member

It looks like the integration tests are still failing for union arrays between C++ and Java.

@rymurr
Copy link
Contributor Author

rymurr commented Jun 11, 2020

It looks like the integration tests are still failing for union arrays between C++ and Java.

yeah, something w/ Flight, though the IPC works fine. Checking what is different in flight now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please write some comments about the indices & values of the array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to multiply by 2 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. Probably a copy/paste error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a space after the cast, or the style check would fail?

Copy link
Contributor

@liyafan82 liyafan82 Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe we can use

setValidityBit(ArrowBuf validityBuffer, int index, int value)

here to make the code more concise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are still relying on the MinorType ordinal as the type ID?
If so, the sparse union vector is still not aligning with the specification, and that causes the integration test to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch uses the typeIds array to extract the MinorType associated with the logical type. If it does not exist in the array (usually because a logical type wasn't previously assigned) it defaults to the id of the minor type. So Java always uses the MinorType as the type id and allows c++ (and others) to send arbitrary type mappings.

@emkornfield
Copy link
Contributor

@jacques-n I think you had some concerns the last time sparse unions were made. Do these changes raise the same issues?

@rymurr rymurr force-pushed the ARROW-1692 branch 3 times, most recently from 14db89f to 0781a84 Compare June 22, 2020 15:11
@nealrichardson
Copy link
Member

Assuming CI is still passing, is this good to merge? What is left to do, or who needs to approve?

@emkornfield
Copy link
Contributor

It would be great if @jacques-n could confirm he is OK with these changes. I believe on a previous PR the extra indirection was a concern.

@jacques-n
Copy link
Contributor

jacques-n commented Jun 23, 2020

I'm really struggling with these changes. I don't understand why there is a validity buffer at the union level as well as at the cell level. I'm not sure what it even means that a a union value is valid but the inner value is not. I think there was an inconsistent understanding on my part when that change was proposed in the format (versus what the Java implementation was already doing). @wesm why would we have validity at both the top level and the inner level? Or is the C++ impl only at the top level and not the inner level (which means that this java code is incorrect since the inner vectors have their own validity vectors.

@jacques-n
Copy link
Contributor

Adding to my previous comments: if only at the top level, I'm not sure what the ramification of that would mean at the Java codebase. I think it would require a fairly massive refactoring.

@wesm
Copy link
Member

wesm commented Jun 23, 2020

@wesm why would we have validity at both the top level and the inner level?

Well, the way the specification is written

  • All nested types including union are composed from well-formed child arrays which may have null values.
  • Additionally, all array types, including all nested array types, have their own validity bitmap
  • In the case of union, a null at the top level would indicate that the type of the child is not known. This seems algebraically consistent to me.

We can decide to stipulate that union types never have non-valid values at the Union cell level, only at the child cell level. But then a union value cannot be "made null" by changing the validity bitmap of the Union. From a purely algebraic / design perspective it isn't great. From the get-go in the project I've been striving for algebraic consistency, e.g. enabling well-formed arrays to be composed to created composite types without alteration. Since Unions are one of the more seldom used part of the project we can decide to explicitly deviate from that, but we need to do it now if we're going to do that and not delay longer in reconciling the issue.

@wesm
Copy link
Member

wesm commented Jun 23, 2020

FTR I'm OK with dropping the top-level validity bitmap from Union, especially if it helps us move forward

@jacques-n
Copy link
Contributor

We can decide to stipulate that union types never have non-valid values at the Union cell level, only at the child cell level. But then a union value cannot be "made null" by changing the validity bitmap of the Union.

I believe a union can still express this with a child type null type, no? I think that is how we either modeled it or planned to model it no the java side.

All nested types including union are composed from well-formed child arrays which may have null values.

I'm in agreement on this. Decomposing would be complex.

In the case of union, a null at the top level would indicate that the type of the child is not known. This seems algebraically consistent to me.

I think it's where the model breaks down because of the weird situation where you actually need to evaluate two validity buffers to determine whether something is valid: the parent and the child. And an inconsistency would be really weird. As such, I'm think it would be better to avoid the top-level validity buffer.

FTR I'm OK with dropping the top-level validity bitmap from Union, especially if it helps us move forward

That would be my preference. It seems to ultimately reduce the risk of inconsistency and doesn't seem to have any functional loss (given the use of null type to indicate a non-alternatively-typed value). I also think this works well in the most common case of union types, e.g. two files where one has fieldA with schemaA and another where you have fieldA with schemaB. Compositing those two doesn't require some kind of introspection and AND'ing of the individual children to build an additional validity buffer (or simply setting true for all and then having an inconsistency with the child array) and allows a fast set of the type vector for each independent chunk.

@wesm
Copy link
Member

wesm commented Jun 24, 2020

That would be my preference.

I'm OK with this. We would need to act quickly to try to pull this off for the release. I can start a DISCUSS thread and work up a patch with the proposed changes to the specification

@rymurr
Copy link
Contributor Author

rymurr commented Jun 24, 2020

Thanks @wesm and @jacques-n for the review. I will leave this up until consensus is reached on the format change. Please let me know if I can help w/ the c++ patch, would be happy to contribute!

@wesm
Copy link
Member

wesm commented Jun 24, 2020

I sent an e-mail to dev@ -- let's discuss there

@liyafan82
Copy link
Contributor

liyafan82 commented Jun 30, 2020

In addition to the problem of top level validity buffer, I think there is another problem to discuss: Java is using the ordinal of the minor type as the type id, and this is not aligning with the C++ implementation/specification.

This PR works around the problem by supporting two modes. However, the C++ and Java implementations are not really aligning, IMO. Do we have a plan to mitigate this?

@emkornfield
Copy link
Contributor

@liyafan82 I think that is a good point. If it supports both modes I think that is a reasonable compromise for now as long as @jacques-n is OK with it. But we can maybe discuss further if needed.

@rymurr
Copy link
Contributor Author

rymurr commented Jun 30, 2020

I think this is now inline with the spec. The Union/DenseUnion types now uses logical type ids in Java. Which is the same as in c++. The difference in a java created Union is that the type ids are chosen from a known index (minor type). But the java implementation doesn't rely on minor type ordinals strict ordering like before. Or at least the intention was to be fully in line with spec while maintaining a convenient default choice for type id. Drawing any meaning from type buffers alone in client code would be an error in my opinion. Hope that clarifies my intention w/ this change.

@rymurr
Copy link
Contributor Author

rymurr commented Jul 1, 2020

This has been modified to incorporate the changes to Unions as proposed on the mailing list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems NullVector throws UnsupportedOperationException in this case - should we be consistent with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes. I copied UnionVector (which returns typeBuffer) but I didn't really like it. Fixed on both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with getNullCount; this vector has nullCount == 0 but can have isNull return true regardless. C++ seems to always return false when there is no validity buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with changing this to always return false. However I think this means anyone using a Union vector would have to extract the child and check for null as an extra step. All other nested vectors have a top level validity buffer. I don't know what the ramifications of changing this are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set isNull to always be false and all the tests seem to pass (with minor modifications). Take a look and let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from the viewpoint of code generically manipulating vectors, this makes more sense - the entries of the union are never null, but they may reference something that is null. I agree it's less convenient when working with a union array explicitly though.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @liyafan82 do you have other comments?

@rymurr
Copy link
Contributor Author

rymurr commented Jul 10, 2020

Just rebased to include the memory related changes. Avoid any surprise broken master builds. @liyafan82 think its ready to go?

@lidavidm
Copy link
Member

Looks like unions fail in integration again.

@lidavidm
Copy link
Member

Seems like Java needs to be excluded from testing the 0.17.1 "golden" reference union file. Also, Java produces data that C++ can't read, but Java can read the C++ files; this fails the IPC tests with Java producing, and the Flight tests with both Java producing and consuming (since Flight tests upload and download).

@rymurr
Copy link
Contributor Author

rymurr commented Jul 10, 2020

Seems like Java needs to be excluded from testing the 0.17.1 "golden" reference union file. Also, Java produces data that C++ can't read, but Java can read the C++ files; this fails the IPC tests with Java producing, and the Flight tests with both Java producing and consuming (since Flight tests upload and download).

Wonder if its the MetadataVersion change that went in for c++ today? Looking now.

@rymurr
Copy link
Contributor Author

rymurr commented Jul 10, 2020

It is indeed the V4 patch that has been submitted on the c++ side and not on the Java side. I think this means we need to submit your patch first @lidavidm?

@lidavidm
Copy link
Member

Hmm - Java doesn't check the metadata version and the error makes it sound like it's layout-related? Oh wait. So now C++'s view of the union layout depends on the metadata version. Ok, then let's merge the Java V5 metadata change first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor hack to remove Java from 0.17.1 union gold tests. Not the most elegant way to do it but doing it correctly would be a non-trivial change to archery imho. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think this is OK for now.

This fixes the integration tests for Union Arrays.

Both Dense and Sparse unions now work:

* validity buffer added to Sparse Union
* both Union types now track logical types rather than MinorType enum ordinal
@wesm
Copy link
Member

wesm commented Jul 11, 2020

Rebased since #7685 was merged

@wesm
Copy link
Member

wesm commented Jul 11, 2020

Tests pass so I'm merging. Hooray!

@wesm wesm closed this in d2ddaa6 Jul 11, 2020
@rymurr
Copy link
Contributor Author

rymurr commented Jul 12, 2020

Thanks for the rebate @wesm! Glad we have finished this one!

@rymurr rymurr deleted the ARROW-1692 branch July 13, 2020 14:16
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This fixes the integration tests for Union Arrays.

Both Dense and Sparse unions now work:

* validity buffer added to Sparse Union
* both Union types now track logical types rather than MinorType enum ordinal

Dependent on apache#7289 for duplicate field names.

Closes apache#7290 from rymurr/ARROW-1692

Authored-by: Ryan Murray <rymurr@dremio.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants