Skip to content

Conversation

@rymurr
Copy link
Contributor

@rymurr rymurr commented May 27, 2020

This addresses the integration test error where Java cannot process duplicate field names.

This extends StructVector and VectorSchemaRoot to support duplicate field names.

@github-actions
Copy link

@rymurr rymurr marked this pull request as draft May 28, 2020 13:23
@rymurr rymurr marked this pull request as ready for review May 28, 2020 13:41
@rymurr rymurr force-pushed the ARROW-8948 branch 2 times, most recently from 94a221b to d9dba4c Compare May 29, 2020 08:37
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took an initial pass. I'm not super familiar with the internals here so I can't comment on the approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken - this means the "correct" or "compatible" behavior is opt-in via a JVM flag? Are these flags clearly documented somewhere, I think we have a few others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lidavidm, yes the 'compatible' behaviour is opt-in. This may be controversial but through a mix of opinionated, ignorant and lazy I don't understand the value of duplicate names in a struct. Given the scope of implementing such a change fully and the unintended consequences downstream I have opted to give the library user the option to be compatible with the c++ IPC or maintain backwards compatibility with Java. I am happy to hear what the community thinks, especially if this approach is seen as too heavy handed.

This flag wasn't document anywhere so I have added a note to the 'Java Properties' section in the Java README.md.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as it's easily controllable in code (so a single application can work with both behaviors) and well-documented, that should be OK. I dislike only having global flags, but that's not the case here. (And having global flags can be useful to tweak the behavior of an application that's otherwise agnostic to the default behavior.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does change the complexity - though presumably it's not an issue unless someone has tens of thousands of columns or is repeatedly fetching vectors in a tight loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this now matches with the Schema implementation. I think we should either:
i) deprecate the name based methods
ii) document that these methods are unsuitable for tight loops.
Any thoughts on which is more sensible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer extends the Map<K,V> interface - is that an acceptable break?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Presumably this was for internal use only.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt it was acceptable to break from Map as it is internal and only used by Structs. I surmise that the intention was for this class to be a more generic utility and that need never materialised

java/README.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been using the policy that java properties should also be settable via environment variable (I believe in some context environment variables are easier to set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason no to change this to a Map<String, List>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to lose the type information. This still allows you to address Field("x", int) compared to Field("x", long).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move the enum above this block, are you able to access the CONFLICT_REPLACE enum directly, to use it instead oa constant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…ion tests

This addresses the integration test error where Java cannot process duplicate field names.

This extends StructVector and VectorSchemaRoot to support duplicate field names.
@nealrichardson
Copy link
Member

@rymurr @lidavidm @emkornfield is this good to go now?

@rymurr
Copy link
Contributor Author

rymurr commented Jun 10, 2020

Thanks @nealrichardson , I am happy if @lidavidm and @emkornfield are!

@lidavidm
Copy link
Member

I am good with this.

sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Jun 11, 2020
…ion tests

This addresses the integration test error where Java cannot process duplicate field names.

This extends StructVector and VectorSchemaRoot to support duplicate field names.

Closes apache#7289 from rymurr/ARROW-8948

Authored-by: Ryan Murray <rymurr@dremio.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
@rymurr rymurr deleted the ARROW-8948 branch July 1, 2020 10:02
wesm pushed a commit that referenced this pull request Jul 11, 2020
This fixes the integration tests for Union Arrays.

Both Dense and Sparse unions now work:

* validity buffer added to Sparse Union
* both Union types now track logical types rather than MinorType enum ordinal

Dependent on #7289 for duplicate field names.

Closes #7290 from rymurr/ARROW-1692

Authored-by: Ryan Murray <rymurr@dremio.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…ion tests

This addresses the integration test error where Java cannot process duplicate field names.

This extends StructVector and VectorSchemaRoot to support duplicate field names.

Closes apache#7289 from rymurr/ARROW-8948

Authored-by: Ryan Murray <rymurr@dremio.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This fixes the integration tests for Union Arrays.

Both Dense and Sparse unions now work:

* validity buffer added to Sparse Union
* both Union types now track logical types rather than MinorType enum ordinal

Dependent on apache#7289 for duplicate field names.

Closes apache#7290 from rymurr/ARROW-1692

Authored-by: Ryan Murray <rymurr@dremio.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants