Skip to content

Conversation

@BryanCutler
Copy link
Member

This merges the two MapVector classes and removes the NonNullableMapVector class as a followup to #1341. There is only a nullable version of MapVector now.

@BryanCutler
Copy link
Member Author

This is a WIP, still need to make another pass through to clean up. Also, I'm not too sure about the purpose of SingleMapWriter, is it optimized for a single field or for the non-nullable MapVector? I'll have to check into it later
cc @icexelloss @siddharthteotia

Copy link
Member Author

@BryanCutler BryanCutler Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any point to have a SingleMapWriter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not have this as final anymore because it depends on fieldType being set in the constructor. Previously it was set under the super class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now need to make sure the validity buffer is allocated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it was in NonNullableMapVector so I brought it over. This doesn't seem to be used anywhere and overrides from AbstractContainerVector that returns false. @siddharthteotia do you know what this was used for and if it is still needed?

@icexelloss
Copy link
Contributor

LGTM at high level. @BryanCutler I don't know about SingleMapWriter either. Is that orthogonal or related to this change?

@BryanCutler
Copy link
Member Author

I don't know about SingleMapWriter either. Is that orthogonal or related to this change?

With this change SingleMapWriter is now the same same as NullableMapWriter so they can also be combined and SingleMapWriter can be removed. This is also the same for SingleMapReaderImpl.

@siddharthteotia and @jacques-n do you have any objections to removing the SingleMapWriter/Reader classes here as well?

@BryanCutler
Copy link
Member Author

I could also remove the codegen file MapWriters.java and just have one non codegen class MapWriter

@icexelloss
Copy link
Contributor

I could also remove the codegen file MapWriters.java and just have one non codegen class MapWriter

I think we probably still need to codegen the MapWriter class because it codegen methods for different types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this should be public? This class has getValueCount and setValueCount

Copy link
Member Author

@BryanCutler BryanCutler Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueCount was public in NonNullableMapVector but I don't think it should be here so I made it private

@BryanCutler BryanCutler changed the title [WIP] ARROW-1866: [Java] Combine MapVector classes and remove NonNullableMapVector ARROW-1866: [Java] Combine MapVector classes and remove NonNullableMapVector Nov 29, 2017
@BryanCutler
Copy link
Member Author

Removing WIP, I think this should be ok to merge (pending tests) unless we decide to remove SingleMapWriter here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do something with the name here? i.e MapReaderImpl and MapWriterImpl ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the change, but I held off on renaming classes outside of the vectors until we have a consensus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullableMapTransferPair -> MapTransferPair?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be ok to rename since its a protected inner class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this.mapVector?

@icexelloss
Copy link
Contributor

I looked at SingleMap reader and writer. I think the purpose of these two classes is to handle non nullable map vectors so we should probably remove them.

@BryanCutler it seems your PR make SingleMap and NullableMap reader writer identical, is that correct?

@siddharthteotia Does Dremio code uses SingleMap reader writer?

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Nov 30, 2017

@BryanCutler , @icexelloss , yes Dremio uses SingleMapWriter/Reader. What is the concern here?

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Nov 30, 2017

Let us hold off making any changes to NullableMapVector and MapVector. I think this PR should just remove the newly added NonNullableMapVector class in the hierarchy and bring the hierarchy to the state it was before.

Meanwhile we can assess the usage of both types of vectors in Dremio and then document the changes (if any) required to these 2 vectors.

I don't think there is any harm in keeping these 2 vectors around for the time being. They are not breaking anything.

@icexelloss
Copy link
Contributor

icexelloss commented Nov 30, 2017

Let us hold off making any changes to NullableMapVector and MapVector. I think this PR should just remove the newly added NonNullableMapVector class in the hierarchy and bring the hierarchy to the state it was before.
Meanwhile we can assess the usage of both types of vectors in Dremio and then document the changes (if any) required to these 2 vectors.
I don't think there is any harm in keeping these 2 vectors around for the time being. They are not breaking anything.

There are only two vectors currently - MapVector and NonNullableMapVector. @siddharthteotia what do you suggest we to do?

@siddharthteotia
Copy link
Contributor

My suggestion is to keep the old pair of NullableMapVector and MapVector as is and just remove the new NonNullable* class -- basically let's have the MapVector hierarchy back to what it was before with MapVector being the base class and NullableMapVector being the subclass.

@BryanCutler
Copy link
Member Author

@siddharthteotia just to be clear, the end result of this PR is to flatten the 2 map vectors classes into 1 MapVector and no functionality was added or removed. If you prefer we revert back to these vectors before #1341 to assess the impact of the change in Dremio, then I'm ok with that. Is it primarily a question of performance you would like to check? My thought is that it would be good to be consistent with the rest of the vectors that no longer have a Nullable prefix and get as much of the refactoring done as possible for the 0.8 release.

@BryanCutler
Copy link
Member Author

@siddharthteotia perhaps you can elaborate more on if combining the 2 MapVectors is something you would be ok with in the future and when you might be able to sign off on that? Also, are ok with the renaming to StructVector that was discussed for ARROW-1815?

@icexelloss
Copy link
Contributor

I thought we've reached consensus about this but maybe not. I have also put up https://docs.google.com/document/d/1n4qjO20wZyS7wSpISgYdIVuD22zstLgP-7gXxS5_7_E/edit?usp=sharing to help discuss vector naming so we can make progress.

@BryanCutler
Copy link
Member Author

Thanks for putting that together @icexelloss , it looks good but I think the question we need to answer first is if we plan to combine MapVector and NonNullableMapVector and if not then what to name these 2 classes? @siddharthteotia would like to revert back to the previous names of NullableMapVector and MapVector, what are your thoughts on that?

@icexelloss
Copy link
Contributor

I still don't see why we need to keep nullable and non nullable version of MapVector. I am also waiting on @siddharthteotia for his comment.

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Nov 30, 2017

My suggestion is to keep the old pair of NullableMapVector and MapVector as is and just remove the new NonNullable* class -- basically let's have the MapVector hierarchy back to what it was before with MapVector being the base class and NullableMapVector being the subclass.

Since the implications of combining them are to the reader and writer as well and @BryanCutler proposed to remove them, I would like to assess their usage in Dremio. For example, we have a data structure writer called VectorContainerWriter which has a MapVector inside and a SingleMapWriter to populate the data structure.

I am just asking to keep these 2 vectors around as is for the time being. Later on we can combine them and decide what to do with reader/writer.

@BryanCutler
Copy link
Member Author

I did not remove anything outside of the vector hierarchy, I just brought up that we could combine the readers and writers as well. However, doing this is out of scope for what we have discussed so I only did minimal changes required to get things working. Does that alleviate some of your concern?

@icexelloss
Copy link
Contributor

icexelloss commented Nov 30, 2017

Seems there are some code in Dremio that is going to be affected by removing non nullable MapVector and SingleMapWriter but not too many: (Hopefully this is up to date)

https://github.com/dremio/dremio-oss/search?p=2&q=MapVector&type=&utf8=%E2%9C%93
https://github.com/dremio/dremio-oss/search?utf8=%E2%9C%93&q=SingleMapWriter&type=

I don't know what's the best way to proceed. Maybe we can give it a couple of days for @siddharthteotia to evaluate if removing non nullable vectors is OK for Dremio.

Any thoughts?

@BryanCutler
Copy link
Member Author

BryanCutler commented Nov 30, 2017 via email

@icexelloss
Copy link
Contributor

@BryanCutler yeah I see your point but @siddharthteotia seems to be saying Dremio uses non-nullable MapVector as well so he wants to take a look.

@BryanCutler
Copy link
Member Author

Yes, I just wanted to be clear that removing the non-nullable MapVector was the only refactoring done here. Hopefully that can help assess the impact.

@wesm
Copy link
Member

wesm commented Dec 1, 2017

N.B. Dremio is open source, so others are also free to look at this code: https://github.com/dremio/dremio-oss -- I have been asking for someone to set up integration tests to be able to run Dremio's tests against Arrow master, that would be hugely helpful in situations like these. There might be some unpublished changes to Dremio trunk, though.

@wesm wesm force-pushed the java-combine-MapVectors-ARROW-1866 branch from 0988161 to 59647f9 Compare December 1, 2017 16:55
@icexelloss
Copy link
Contributor

+1.

I also feel the issue is also we (Bryan and I) don't know enough about Dremio codebase to assess the impact of certain changes to Dremio such as removing non nullable map vectors. I also feel maybe @siddharthteotia is too resource bound on keeping up with master refactoring PRs. I am willing to help make downstream changes to Dremio just as we make downstream changes to Spark but I don't know if https://github.com/dremio/dremio-oss is update to date.

@BryanCutler
Copy link
Member Author

BryanCutler commented Dec 1, 2017

I'm also willing to make downstream changes to Dremio if that would help. I think the underlying issue here is that I (and I believe Li too) was under the assumption that removing the non-nullable MapVector was part of the agenda for the Java refactoring, while @siddharthteotia wants keep as they were before ARROW-1710. Here are the options we have, as I see it, to proceed:

  1. Keep this PR open and assess the impact on downstream projects. If all checks out, then hopefully we can get this in before 0.8.0 is cut.

  2. Close this PR for now. Since we are planning to change the name of MapVector (this is still part of the refactoring plan for 0.8.0?) then start with that and rename to NullableStructVector and StructVector or some variant.

  3. Close this PR for now, change the names back to NullableMapVector and MapVector, hold off on renaming to StructVector, and put these in a future plan to discuss later.

Of course I prefer (1) but I understand if constraints prevent this, it would just be nice to get confirmation if that is the case and if there is disagreement with doing (2).

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Dec 1, 2017

My suggestion is to go with 3rd option here #1371 (comment)
Same as what I mentioned here #1371 (comment)

@jacques-n
Copy link
Contributor

What do people think of?

  • Keep the two level hierarchy
  • Rename NonNullableMapVector > StructVector
  • Mark existing MapVector as deprecated

@BryanCutler, I think the SingleMapWriter has more differences than Nullability. I think it is specifically used to treat the top of a record as a pseudo-map where needed.

We do some fairly complex runtime code generation associated with this structure in Dremio. It relates to these pieces of code [1][2][3] (among others) and it will take some time to really remember & understand the impact of what you're proposing to provide good feedback.

[1] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/org/apache/arrow/vector/complex/FieldIdUtil2.java
[2] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/expr/EvaluationVisitor.java#L341
[3] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/expr/EvaluationVisitor.java#L405

@wesm
Copy link
Member

wesm commented Dec 3, 2017

In

Rename NonNullableMapVector > StructVector

Structs are nullable in general, so the StructVector needs to accommodate null structs

@BryanCutler
Copy link
Member Author

Thanks for the details @jacques-n, I am fine with keeping the 2 level hierarchy for now and it seems like we need some more discussion to move forward with StructType naming. I'll proceed with option (3) and restore the previous MapVector class names.

@BryanCutler BryanCutler closed this Dec 4, 2017
@BryanCutler BryanCutler deleted the java-combine-MapVectors-ARROW-1866 branch November 19, 2018 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants