-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-725: [Formats/Java] FixedSizeList message and java implementation #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -47,6 +47,7 @@ private static void writeValue(FieldReader reader, FieldWriter writer) { | |||
| switch (mt) { | |||
|
|
|||
| case LIST: | |||
| case TUPLE_2: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to make the particular size not a special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I could figure out... I think this goes back to the problem with the java minor types - I believe @julienledem was planning to remove them eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this partially or fully addressed in #409? Can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #409 would fix it - I'll wait until that is merged and rebase
|
It's safe to rebase now -- I also renamed FixedWidthBinary to FixedSizeBinary for better conformity with this |
f6aa6f1 to
f42dac9
Compare
|
thanks, I've rebased |
| @Override | ||
| public ListWriter list(int size) { | ||
| fail("FixedSizeList"); | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's hold off adding the fixed width types to the writer APIs.
One of the goals of those APIs is to be able to ingest arbitrary data (for example JSON) and infer the schema on the fly (hence the PromotableWriter promoting vectors to union if the type is not homogenous). So it won't work well with fixed width list since nothing prevents having more than one size in this api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't add it to the apis, how would we write fixed size lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, arbitrary data should still work fine and default to regular lists - you have to specifically set up the fixed size list if you want to use it. If you know you have a fixed size list, you can use that optimization, otherwise you can just use regular lists that will work the same with slightly more overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- you can write to the vector directly through its mutator. The Writer api is a facility to write in a row-oriented fashion to the vectors.
- How does this interact with the PromotableWriter? what happens if we write both fixed lists and variable size lists to the same field?
| * @return | ||
| */ | ||
| abstract protected FieldWriter getWriter(MinorType type); | ||
| abstract protected FieldWriter getWriter(ArrowType type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 things:
- this kind of change should be in its own pull request.
- For performance reason we may not want to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only performance penalty is comparing an ArrowType instance instead of an enum compare - seems pretty minor.
Would you like me to split out into a separate PR?
MinorType is still used extensively in the writer framework - I think we ultimately need to pull it out since it doesn't capture the full arrow type. I noticed that the writers don't really handle decimal types, for that reason.
The other alternative is to create minor types for various fixed size lists - that would work for me, as I really only need 2-tuples at the moment. But it isn't especially flexible and @wesm asked for a more generic implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we need a more flexible solution. A Minor type for every list size is very inconvenient.
Regarding perf: There are a few places were we convert between minorType and arrowType each way. We also use ArrowType.equals instead of == between MinorTypes in getWriter(ArrowType) which is called in startList() which is called for each list creation. Just saying that we need to actually measure the consequence of this. I'm not trying to make you do more work than necessary. Separating the refactoring from the new type and vector allows to get the later merged while we finalize the former.
|
@julienledem do you have any more comments, or what do I need to do to get this approved? |
julienledem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good but we need to figure out the change to getWriter(ArrowType). Possibly we can have more than one method (getFixedWidthListWriter...) to avoid creating objects and calling equals. The UnionWriter can be limited to just arbitrary json writing type use case.
| @Override | ||
| public ListWriter list(int size) { | ||
| fail("FixedSizeList"); | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- you can write to the vector directly through its mutator. The Writer api is a facility to write in a row-oriented fashion to the vectors.
- How does this interact with the PromotableWriter? what happens if we write both fixed lists and variable size lists to the same field?
| } else if (!Objects.equals(type, this.type)) { | ||
| promoteToUnion(); | ||
| ((UnionWriter)writer).getWriter(type); | ||
| ((UnionWriter) writer).getWriter(Types.getMinorTypeForArrowType(type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a discrepancy here. We compare ArrowType but then we create the writer based on MinorType which the two types may have the same even if they are not equal (Decimal or FixedWidthList for example).
It looks like if we start writing fixed width lists and then we write a different width we will promote but write the different width lists to the old fixed width vector. Possibly we should fail in this case.
| * @return | ||
| */ | ||
| abstract protected FieldWriter getWriter(MinorType type); | ||
| abstract protected FieldWriter getWriter(ArrowType type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that we need a more flexible solution. A Minor type for every list size is very inconvenient.
Regarding perf: There are a few places were we convert between minorType and arrowType each way. We also use ArrowType.equals instead of == between MinorTypes in getWriter(ArrowType) which is called in startList() which is called for each list creation. Just saying that we need to actually measure the consequence of this. I'm not trying to make you do more work than necessary. Separating the refactoring from the new type and vector allows to get the later merged while we finalize the former.
| @Override | ||
| public ListWriter list(int size) { | ||
| // prime the writer with the fixed size list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't really need to got Through ArrowType nor UnionType to call getWriter(...).
We could have a getFixedWidthList(size) in writer.
| protected FieldWriter getWriter(ArrowType type) { | ||
| if (state == State.UNION) { | ||
| ((UnionWriter)writer).getWriter(type); | ||
| ((UnionWriter) writer).getWriter(Types.getMinorTypeForArrowType(type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work for fixed width lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnionWriter doesn't work for Decimal it looks like (https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/UnionWriter.java#L124)
It's fine not to support FixedWidthList in union either. (at least for now)
|
|
||
| public FieldWriter createNewFieldWriter(ValueVector vector) { | ||
| MinorType minorType = Types.getMinorTypeForArrowType(type); | ||
| return minorType.getNewFieldWriter(vector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be done from the type in the vector?
|
@julienledem I will revert the writer changes and just add some unit tests that show how to use the mutators directly. |
|
reverted - ready for another review |
|
@julienledem @elahrvivaz I'll be happy to provide a C++ implementation whenever this goes in, then we can set up integration tests. Let me know |
julienledem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@jacques-n ?
|
Thank you @elahrvivaz ! |
jacques-n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there is one small spurious import reorder. Otherwise, lgtm.
| import java.util.Collections; | ||
| import java.util.List; | ||
|
|
||
| import com.google.common.collect.ImmutableList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordering on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's just how my ide (intellij) sorts imports alphabetically. I'll revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted
|
hold off merging - I need to add reAlloc to FixeSizeListVector now that #534 is merged |
|
Sounds good |
…using vector mutators
|
updated and rebased - should be good now |
|
+1. This is great, thanks @elahrvivaz! |
~Currently only added minor type for 2-tuples~ Author: Emilio Lahr-Vivaz <elahrvivaz@ccri.com> Closes apache#452 from elahrvivaz/ARROW-725 and squashes the following commits: b139d3d [Emilio Lahr-Vivaz] adding reAlloc to FixedSizeListVector 229e24a [Emilio Lahr-Vivaz] re-ordering imports 594c0a2 [Emilio Lahr-Vivaz] simplifying writing of list vectors through mutator 7cb2324 [Emilio Lahr-Vivaz] reverting writer changes, adding examples of writing fixed size list using vector mutators 756dc8a [Emilio Lahr-Vivaz] ARROW-725: [Formats/Java] FixedSizeList message and java implementation
Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#452 from majetideepak/PARQUET-1265 and squashes the following commits: 2972197 [Deepak Majeti] fix initialization error ab07a47 [Deepak Majeti] Simplify construction of static variables b23d3b3 [Deepak Majeti] PARQUET-1265: Segfault on static ApplicationVersion initialization Change-Id: I012fc78dd90686309bc1f9feac446c9324e931f5
~Currently only added minor type for 2-tuples~ Author: Emilio Lahr-Vivaz <elahrvivaz@ccri.com> Closes apache#452 from elahrvivaz/ARROW-725 and squashes the following commits: b139d3d [Emilio Lahr-Vivaz] adding reAlloc to FixedSizeListVector 229e24a [Emilio Lahr-Vivaz] re-ordering imports 594c0a2 [Emilio Lahr-Vivaz] simplifying writing of list vectors through mutator 7cb2324 [Emilio Lahr-Vivaz] reverting writer changes, adding examples of writing fixed size list using vector mutators 756dc8a [Emilio Lahr-Vivaz] ARROW-725: [Formats/Java] FixedSizeList message and java implementation
Update latest Java version used by Github workflows to Java 23.
Currently only added minor type for 2-tuples