-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1047: [Java] Add Generic Reader Interface for Stream Format #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1047: [Java] Add Generic Reader Interface for Stream Format #1259
Conversation
|
I'm proposing some re-organization of the packages for reading/writing to hopefully better group related classes. Here is what I'm proposing where most of the files would fall (this also tries to follow some of the arrow-cpp structure): |
|
I think this should mostly be orthogonal to the Java vector refactoring that is going on now. That should take priority, but it would be great if we could get this in for 0.8 if possible. |
|
@wesm @icexelloss @siddharthteotia what are your thoughts on this? |
|
@BryanCutler at a high level this sounds great to me. cc @nongli also to take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes it's useful to be able to just read the schema out of a message, without loading up any dictionaries or record batches. is there a way to preserve that functionality somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could still do that. I think it just comes down to either reading the dictionaries after the schema, or reading them before the first data batch. I thought it made a little more sense to read them with the schema, otherwise the user could create the reader, load the schema and try to decode it but fail.
Would it work for you to maybe overload ArrowReader.readSchema which will be able to return the original schema before loading the dictionaries? Similarly, if using the stream format, you could make a subclass of MessageReader (introduced here) and react after reading a schema message. If not, I'm ok with reading them before data batches and documenting for the user that you can't decode until batches are read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, an overloaded method would be fine. I agree that having to load a batch before reading dictionaries is a bit confusing for the general use case.
|
At the high level, @BryanCutler what do you feel about having |
I'm not sure, all of the current messages are geared towards vectors so it makes sense to keep it there. Are you thinking of possible messages in the future that might not be vector related? |
Longer term, I kind of think we can improve the current package hierarchy where all API is under the name space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you feel about get rid of "file" and "stream" sub namespace, i.e.
org.apache.arrow.vector.ipc.ArrowFileWriter
org.apache.arrow.vector.ipc.ArrowStreamWriter
I think these two namespaces file and stream are not very complicated, they can probably be combined
|
@BryanCutler This looks great! What do people feel about having less sub namespaces? Original, Less sub namespaces: |
|
Also maybe |
|
Backward compatibility wise, I think we should probably change this along with vector changes in one arrow release? |
907e348 to
0e07e28
Compare
|
Thanks @elahrvivaz, @icexelloss and @wesm !
I sort of prefer having separate packages for the different readers/writers. There are some supporting files that are specific to certain formats, like
+1 for me on renaming this |
|
imo i like the current package layout with file, stream, json, message. |
Sounds good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a bit doc of what these methods are supposed to do? It's not very clear how to use readNextMessage and readMessageBody
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I meant to say that I still need to go through these changes and make sure everything is documented properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems to closer to read schema rather than deserialize schema
public static Schema deserializeSchema(Message message)
seem to make more sense to me
Maybe this method can be made into:
public static Schema readSchema(MessageReader reader) {
Message message = reader.readNextMessage();
return deserializeSchema(message);
}
?
@BryanCutler what do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to include reading the message as part of deserialization and some messages also require to read another chunk after the message. I do think the behavior of these functions could be made to be more consistent, but we should probably do that as a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Agree this can be a follow upl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word "Batch" in the function name is a bit unintuitive. I kind of feel "Message" is a better term than "MessageBatch".
Should we maybe rename this to deserializeMessage?
Also, this message doesn't seem to exclude schema message explicitly. Which also feels a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method won't read any generic message, it only works with RecordBatches or DictionaryBatches, hence the name...
in the streaming format the first message after the schema could be either a record batch or a dictionary batch, this method is to handle either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's ok as is but this seems to be used only in a test. How about we do a followup PR to refine these functions and we can discuss there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
|
One thing I am not sure is if this patch will make java-refactor-branch hard to merge - cc @siddharthteotia for comment. Maybe we should keep all refactor changes in java-refactor-branch to make it easier to merge? Not sure though. |
|
Yes I am concerned that this will make patches in java-vector-refactor branch hard to merge into master, Secondly, the nature of changes suggest that we should be testing this with Dremio as well -- I would have loved to offer help but I am in the process of moving Dremio to new code in java-vector-refactor branch. I would prefer to have these changes merged after java-vector-refactor changes are merged into master. |
|
@siddharthteotia the Java refactoring is the priority right now so I don't want to hinder that, but I would like to get this in for the 0.8 release if possible. I think changes to the ArrowReader should be mostly transparent, although there might be conflicts with some of the tests I had to change. What about if I try to cherry pick this into the java-vector-refactor branch? If it doesn't go in cleanly then we can put it on hold until the refactor branch is merge. |
|
@BryanCutler, are you suggesting to cherry pick your changes in refactor branch and revert commit in case things don't look good? I am not entirely sure what's the best option here but I believe that adding orthogonal set of changes to java-vector-refactor branch at this point may not be a good idea. However, I don't want to block other work. So feel free to proceed based on your best judgement. Note that there are currently two patches in that branch. While making changes in Dremio and debugging test failures, I had to go back and make some changes in vector code (minor only, no redesign). Currently those additional changes are in Dremio's fork (as I wanted to make quick progress) and I will put a PR against java-vector-refactor branch for the third patch very soon -- better to do at last when testing with Dremio completes. |
|
@siddharthteotia that's fair enough, I don't want to complicate the refactoring. I mostly just want to make sure that these changes don't make things harder to merge the java-vector-refactor branch into master. I can try that out locally and report back. |
|
@siddharthteotia is this something you would like to run with the Dremio suite of tests before merging? |
|
@BryanCutler , I would like to test this with Dremio but I am not sure how quickly I will be able to do that and revert back after making necessary changes in Dremio and doing proper testing. Part of me says you should go ahead and merge this since you were already waiting for the refactor work to get done before this. Since we will anyway have to rebase on Arrow master after the ongoing timestamp vector related changes in #1330, we can take care of testing this out with Dremio at that time unless @jacques-n thinks otherwise? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this ipc.ArrowStreamReader but not ipc.stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the same for file and json readers, e.g. ipc.ArrowFileReader? I made these subpackages because there were some supporting files specific to just the file reader, so they could be grouped together. But I'm ok either way, @icexelloss brought this up here #1259 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are all quite similar (the file format is very nearly the stream format, plus a file footer and magic numbers at start and end), I think it would make sense to keep them in a flat package namespace (but I'm not a Java expert)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss do you have an opinion on this? Would be good to get this patch in soon to facilitate testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer ipc.ArrowStreamReader to ipc.stream.ArrowStreamReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm fine with this. I'll change it now
|
@siddharthteotia what ever is easier for this, but I would like to hear that I didn't break anything on your side :) It's pretty easy to rebase this, so no need to rush |
|
This looks good to me. Once the package name hierarchy I think this should be good to go. |
|
LGTM. +1 |
Change-Id: I7a59a24bd54339cd637ace36e991bc062ba1d4e1
5b998e4 to
43314ca
Compare
|
Squashed and rebased so we can get a passing build. While we are waiting, do we also want the |
|
I do not have a strong feeling either, I think |
|
Reviewing the past comments, since these classes are generally internal, I think it's fine. master is broken right now (ARROW-1845) so I will merge this |
|
I'd like to keep the |
|
Thanks @wesm @icexelloss @elahrvivaz and @siddharthteotia ! |
This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream. This adds a new interface `MessageReader` that will return a `Message` and the message body. The `MessageChannelReader` implements this interface to read from a `ReadChannel` to match the current functionality. A re-org of reading and writing packages is also done to better organize classes under an `ipc` package. There is a slight change in behavior for the `ArrowFileReader` that should not affect any usage. Previously, the schema was read during initialization, and all dictionaries read just before the first record batch was read. Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#1259 from BryanCutler/java-generic-stream-interfaces-ARROW-1047 and squashes the following commits: 43314ca [Bryan Cutler] ARROW-1047: [Java] Add Generic Reader Interface for Stream Format
This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream. This adds a new interface
MessageReaderthat will return aMessageand the message body. TheMessageChannelReaderimplements this interface to read from aReadChannelto match the current functionality. A re-org of reading and writing packages is also done to better organize classes under anipcpackage.There is a slight change in behavior for the
ArrowFileReaderthat should not affect any usage. Previously, the schema was read during initialization, and all dictionaries read just before the first record batch was read. Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch.