-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8948: [Java][Integration] enable duplicate field names integration tests #7289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
94a221b to
d9dba4c
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took an initial pass. I'm not super familiar with the internals here so I can't comment on the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken - this means the "correct" or "compatible" behavior is opt-in via a JVM flag? Are these flags clearly documented somewhere, I think we have a few others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lidavidm, yes the 'compatible' behaviour is opt-in. This may be controversial but through a mix of opinionated, ignorant and lazy I don't understand the value of duplicate names in a struct. Given the scope of implementing such a change fully and the unintended consequences downstream I have opted to give the library user the option to be compatible with the c++ IPC or maintain backwards compatibility with Java. I am happy to hear what the community thinks, especially if this approach is seen as too heavy handed.
This flag wasn't document anywhere so I have added a note to the 'Java Properties' section in the Java README.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as it's easily controllable in code (so a single application can work with both behaviors) and well-documented, that should be OK. I dislike only having global flags, but that's not the case here. (And having global flags can be useful to tweak the behavior of an application that's otherwise agnostic to the default behavior.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does change the complexity - though presumably it's not an issue unless someone has tens of thousands of columns or is repeatedly fetching vectors in a tight loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this now matches with the Schema implementation. I think we should either:
i) deprecate the name based methods
ii) document that these methods are unsuitable for tight loops.
Any thoughts on which is more sensible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer extends the Map<K,V> interface - is that an acceptable break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Presumably this was for internal use only.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt it was acceptable to break from Map as it is internal and only used by Structs. I surmise that the intention was for this class to be a more generic utility and that need never materialised
java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java
Outdated
Show resolved
Hide resolved
java/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've been using the policy that java properties should also be settable via environment variable (I believe in some context environment variables are easier to set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason no to change this to a Map<String, List>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to lose the type information. This still allows you to address Field("x", int) compared to Field("x", long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move the enum above this block, are you able to access the CONFLICT_REPLACE enum directly, to use it instead oa constant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…ion tests This addresses the integration test error where Java cannot process duplicate field names. This extends StructVector and VectorSchemaRoot to support duplicate field names.
|
@rymurr @lidavidm @emkornfield is this good to go now? |
|
Thanks @nealrichardson , I am happy if @lidavidm and @emkornfield are! |
|
I am good with this. |
…ion tests This addresses the integration test error where Java cannot process duplicate field names. This extends StructVector and VectorSchemaRoot to support duplicate field names. Closes apache#7289 from rymurr/ARROW-8948 Authored-by: Ryan Murray <rymurr@dremio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This fixes the integration tests for Union Arrays. Both Dense and Sparse unions now work: * validity buffer added to Sparse Union * both Union types now track logical types rather than MinorType enum ordinal Dependent on #7289 for duplicate field names. Closes #7290 from rymurr/ARROW-1692 Authored-by: Ryan Murray <rymurr@dremio.com> Signed-off-by: Wes McKinney <wesm@apache.org>
…ion tests This addresses the integration test error where Java cannot process duplicate field names. This extends StructVector and VectorSchemaRoot to support duplicate field names. Closes apache#7289 from rymurr/ARROW-8948 Authored-by: Ryan Murray <rymurr@dremio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This fixes the integration tests for Union Arrays. Both Dense and Sparse unions now work: * validity buffer added to Sparse Union * both Union types now track logical types rather than MinorType enum ordinal Dependent on apache#7289 for duplicate field names. Closes apache#7290 from rymurr/ARROW-1692 Authored-by: Ryan Murray <rymurr@dremio.com> Signed-off-by: Wes McKinney <wesm@apache.org>
This addresses the integration test error where Java cannot process duplicate field names.
This extends StructVector and VectorSchemaRoot to support duplicate field names.