ARROW-1047: [Java] Add Generic Reader Interface for Stream Format #1259

BryanCutler · 2017-10-27T19:02:12Z

This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream. This adds a new interface MessageReader that will return a Message and the message body. The MessageChannelReader implements this interface to read from a ReadChannel to match the current functionality. A re-org of reading and writing packages is also done to better organize classes under an ipc package.

There is a slight change in behavior for the ArrowFileReader that should not affect any usage. Previously, the schema was read during initialization, and all dictionaries read just before the first record batch was read. Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch.

BryanCutler · 2017-10-27T19:05:15Z

I'm proposing some re-organization of the packages for reading/writing to hopefully better group related classes. Here is what I'm proposing where most of the files would fall (this also tries to follow some of the arrow-cpp structure):

o.a.a.vector.ipc
      file
          ArrowFileReader
          ArrowFileWriter
          ArrowMagic
      stream
          ArrowStreamReader
          ArrowStreamWriter
      json
          JsonFileReader
          JsonFileWriter
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ReadChannel
      WriteChannel

BryanCutler · 2017-10-27T19:07:50Z

I think this should mostly be orthogonal to the Java vector refactoring that is going on now. That should take priority, but it would be great if we could get this in for 0.8 if possible.

BryanCutler · 2017-10-27T19:08:26Z

@wesm @icexelloss @siddharthteotia what are your thoughts on this?

wesm · 2017-10-30T01:00:18Z

@BryanCutler at a high level this sounds great to me. cc @nongli also to take a look

elahrvivaz · 2017-10-30T13:03:00Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowReader.java

sometimes it's useful to be able to just read the schema out of a message, without loading up any dictionaries or record batches. is there a way to preserve that functionality somehow?

Yeah, we could still do that. I think it just comes down to either reading the dictionaries after the schema, or reading them before the first data batch. I thought it made a little more sense to read them with the schema, otherwise the user could create the reader, load the schema and try to decode it but fail.

Would it work for you to maybe overload ArrowReader.readSchema which will be able to return the original schema before loading the dictionaries? Similarly, if using the stream format, you could make a subclass of MessageReader (introduced here) and react after reading a schema message. If not, I'm ok with reading them before data batches and documenting for the user that you can't decode until batches are read.

yeah, an overloaded method would be fine. I agree that having to load a batch before reading dictionaries is a bit confusing for the general use case.

icexelloss · 2017-10-30T17:16:54Z

At the high level, @BryanCutler what do you feel about having ipc be a top level package rather than a subpackage under vector, i.e. org.apache.arrow.ipc

BryanCutler · 2017-10-30T18:48:41Z

having ipc should be a top level package rather than a subpackage under vector, i.e. org.apache.arrow.ipc

I'm not sure, all of the current messages are geared towards vectors so it makes sense to keep it there. Are you thinking of possible messages in the future that might not be vector related?

icexelloss · 2017-10-30T19:31:30Z

I'm not sure, all of the current messages are geared towards vectors so it makes sense to keep it there. Are you thinking of possible messages in the future that might not be vector related?
I think this is fine for now.

Longer term, I kind of think we can improve the current package hierarchy where all API is under the name space org.apache.arrow.vector. A hierarchy similar to C++ might make more sense - o.a.a.vector o.a.a.ipc and etc. But no need to do it here I think.

icexelloss · 2017-10-30T19:39:57Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/file/ArrowFileWriter.java

What do you feel about get rid of "file" and "stream" sub namespace, i.e.

org.apache.arrow.vector.ipc.ArrowFileWriter

org.apache.arrow.vector.ipc.ArrowStreamWriter

I think these two namespaces file and stream are not very complicated, they can probably be combined

icexelloss · 2017-10-30T19:48:17Z

@BryanCutler This looks great! What do people feel about having less sub namespaces?

Original,

o.a.a.vector.ipc
      file
          ArrowFileReader
          ArrowFileWriter
          ArrowMagic
      stream
          ArrowStreamReader
          ArrowStreamWriter
      json
          JsonFileReader
          JsonFileWriter
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ReadChannel
      WriteChannel

Less sub namespaces:

o.a.a.vector.ipc
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ArrowFileReader
      ArrowFileWriter
      ArrowMagic
      ArrowStreamReader
      ArrowStreamWriter
      ReadChannel
      WriteChannel
      JsonFileReader
      JsonFileWriter

icexelloss · 2017-10-30T19:49:50Z

Also maybe JsonFileReader -> ArrowJsonReader for more consistent naming?

icexelloss · 2017-10-30T19:59:02Z

Backward compatibility wise, I think we should probably change this along with vector changes in one arrow release?

BryanCutler · 2017-10-30T22:43:36Z

Thanks @elahrvivaz, @icexelloss and @wesm !

What do people feel about having less sub namespaces?

I sort of prefer having separate packages for the different readers/writers. There are some supporting files that are specific to certain formats, like ArrowMagic and InvalidArrowFileException, and I like pushing it down to the feature that uses them. I think users will be more likely to import reader/writer from 1 format for a particular use too. I'm not tied to this though, we can simplify if that's the consensus.

Also maybe JsonFileReader -> ArrowJsonReader for more consistent naming?

+1 for me on renaming this

elahrvivaz · 2017-10-31T12:53:48Z

imo i like the current package layout with file, stream, json, message.

icexelloss · 2017-10-31T15:41:58Z

I sort of prefer having separate packages for the different readers/writers. There are some supporting files that are specific to certain formats, like ArrowMagic and InvalidArrowFileException, and I like pushing it down to the feature that uses them. I think users will be more likely to import reader/writer from 1 format for a particular use too. I'm not tied to this though, we can simplify if that's the consensus.

imo i like the current package layout with file, stream, json, message.

Sounds good to me.

icexelloss · 2017-10-31T17:57:54Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java

Add override?

icexelloss · 2017-10-31T18:00:33Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageReader.java

Maybe add a bit doc of what these methods are supposed to do? It's not very clear how to use readNextMessage and readMessageBody

Yeah, I meant to say that I still need to go through these changes and make sure everything is documented properly.

icexelloss · 2017-10-31T18:06:14Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java

This method seems to closer to read schema rather than deserialize schema

public static Schema deserializeSchema(Message message)

seem to make more sense to me

Maybe this method can be made into:

public static Schema readSchema(MessageReader reader) { Message message = reader.readNextMessage(); return deserializeSchema(message); }

?

@BryanCutler what do you think

I think it's ok to include reading the message as part of deserialization and some messages also require to read another chunk after the message. I do think the behavior of these functions could be made to be more consistent, but we should probably do that as a followup.

Ok. Agree this can be a follow upl

icexelloss · 2017-10-31T18:16:19Z

java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java

The word "Batch" in the function name is a bit unintuitive. I kind of feel "Message" is a better term than "MessageBatch".

Should we maybe rename this to deserializeMessage?

Also, this message doesn't seem to exclude schema message explicitly. Which also feels a bit weird.

this method won't read any generic message, it only works with RecordBatches or DictionaryBatches, hence the name...
in the streaming format the first message after the schema could be either a record batch or a dictionary batch, this method is to handle either case.

Yeah, I think it's ok as is but this seems to be used only in a test. How about we do a followup PR to refine these functions and we can discuss there?

icexelloss · 2017-11-01T18:38:19Z

One thing I am not sure is if this patch will make java-refactor-branch hard to merge - cc @siddharthteotia for comment.

Maybe we should keep all refactor changes in java-refactor-branch to make it easier to merge? Not sure though.

siddharthteotia · 2017-11-01T18:46:48Z

Yes I am concerned that this will make patches in java-vector-refactor branch hard to merge into master, Secondly, the nature of changes suggest that we should be testing this with Dremio as well -- I would have loved to offer help but I am in the process of moving Dremio to new code in java-vector-refactor branch.

I would prefer to have these changes merged after java-vector-refactor changes are merged into master.

BryanCutler · 2017-11-01T22:17:40Z

@siddharthteotia the Java refactoring is the priority right now so I don't want to hinder that, but I would like to get this in for the 0.8 release if possible. I think changes to the ArrowReader should be mostly transparent, although there might be conflicts with some of the tests I had to change. What about if I try to cherry pick this into the java-vector-refactor branch? If it doesn't go in cleanly then we can put it on hold until the refactor branch is merge.

siddharthteotia · 2017-11-05T09:22:34Z

@BryanCutler, are you suggesting to cherry pick your changes in refactor branch and revert commit in case things don't look good?

I am not entirely sure what's the best option here but I believe that adding orthogonal set of changes to java-vector-refactor branch at this point may not be a good idea. However, I don't want to block other work. So feel free to proceed based on your best judgement.

Note that there are currently two patches in that branch. While making changes in Dremio and debugging test failures, I had to go back and make some changes in vector code (minor only, no redesign). Currently those additional changes are in Dremio's fork (as I wanted to make quick progress) and I will put a PR against java-vector-refactor branch for the third patch very soon -- better to do at last when testing with Dremio completes.

BryanCutler · 2017-11-06T21:05:30Z

@siddharthteotia that's fair enough, I don't want to complicate the refactoring. I mostly just want to make sure that these changes don't make things harder to merge the java-vector-refactor branch into master. I can try that out locally and report back.

BryanCutler · 2017-11-18T01:31:25Z

@siddharthteotia is this something you would like to run with the Dremio suite of tests before merging?

siddharthteotia · 2017-11-20T18:28:47Z

@BryanCutler , I would like to test this with Dremio but I am not sure how quickly I will be able to do that and revert back after making necessary changes in Dremio and doing proper testing.

Part of me says you should go ahead and merge this since you were already waiting for the refactor work to get done before this.

Since we will anyway have to rebase on Arrow master after the ongoing timestamp vector related changes in #1330, we can take care of testing this out with Dremio at that time unless @jacques-n thinks otherwise?

wesm · 2017-11-20T19:26:25Z

java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java

I would make this ipc.ArrowStreamReader but not ipc.stream

Do you think the same for file and json readers, e.g. ipc.ArrowFileReader? I made these subpackages because there were some supporting files specific to just the file reader, so they could be grouped together. But I'm ok either way, @icexelloss brought this up here #1259 (comment)

These classes are all quite similar (the file format is very nearly the stream format, plus a file footer and magic numbers at start and end), I think it would make sense to keep them in a flat package namespace (but I'm not a Java expert)

@icexelloss do you have an opinion on this? Would be good to get this patch in soon to facilitate testing

I prefer ipc.ArrowStreamReader to ipc.stream.ArrowStreamReader

Sure, I'm fine with this. I'll change it now

BryanCutler · 2017-11-21T18:09:37Z

@siddharthteotia what ever is easier for this, but I would like to hear that I didn't break anything on your side :) It's pretty easy to rebase this, so no need to rush

icexelloss · 2017-11-22T03:49:27Z

This looks good to me. Once the package name hierarchy I think this should be good to go.

icexelloss · 2017-11-22T17:10:09Z

LGTM. +1

Change-Id: I7a59a24bd54339cd637ace36e991bc062ba1d4e1

wesm · 2017-11-22T18:01:52Z

Squashed and rebased so we can get a passing build. While we are waiting, do we also want the vector.ipc.message subnamespace? Do not have a strong feeling

icexelloss · 2017-11-22T18:10:33Z

I do not have a strong feeling either, I think vector.ipc.message subnamespace are fine. Although maybe we can move ArrowMagic to message subnamespace? Sorry for the oversight. @BryanCutler what do you think

wesm · 2017-11-22T18:46:37Z

Reviewing the past comments, since these classes are generally internal, I think it's fine. master is broken right now (ARROW-1845) so I will merge this

BryanCutler · 2017-11-22T18:59:44Z

I'd like to keep the vector.ipc.message package, I think these generally define messages that serialize to FB.

BryanCutler · 2017-11-22T19:00:38Z

Thanks @wesm @icexelloss @elahrvivaz and @siddharthteotia !

This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream. This adds a new interface `MessageReader` that will return a `Message` and the message body. The `MessageChannelReader` implements this interface to read from a `ReadChannel` to match the current functionality. A re-org of reading and writing packages is also done to better organize classes under an `ipc` package. There is a slight change in behavior for the `ArrowFileReader` that should not affect any usage. Previously, the schema was read during initialization, and all dictionaries read just before the first record batch was read. Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#1259 from BryanCutler/java-generic-stream-interfaces-ARROW-1047 and squashes the following commits: 43314ca [Bryan Cutler] ARROW-1047: [Java] Add Generic Reader Interface for Stream Format

elahrvivaz reviewed Oct 30, 2017

View reviewed changes

icexelloss reviewed Oct 30, 2017

View reviewed changes

BryanCutler force-pushed the java-generic-stream-interfaces-ARROW-1047 branch from 907e348 to 0e07e28 Compare October 30, 2017 21:21

icexelloss reviewed Oct 31, 2017

View reviewed changes

icexelloss mentioned this pull request Nov 10, 2017

ARROW-1785: [Format/C++/Java] Remove VectorLayout from serialized schemas #1297

Closed

wesm reviewed Nov 20, 2017

View reviewed changes

ARROW-1047: [Java] Add Generic Reader Interface for Stream Format

43314ca

Change-Id: I7a59a24bd54339cd637ace36e991bc062ba1d4e1

wesm force-pushed the java-generic-stream-interfaces-ARROW-1047 branch from 5b998e4 to 43314ca Compare November 22, 2017 18:01

wesm closed this in 1516306 Nov 22, 2017

asfimport mentioned this pull request Nov 23, 2017

[Java] Add generalized stream writer and reader interfaces that are decoupled from IO / message framing #16639

Closed

ARROW-1047: [Java] Add Generic Reader Interface for Stream Format #1259

ARROW-1047: [Java] Add Generic Reader Interface for Stream Format #1259

Uh oh!

Conversation

BryanCutler commented Oct 27, 2017

Uh oh!

BryanCutler commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Oct 27, 2017

Uh oh!

BryanCutler commented Oct 27, 2017

Uh oh!

wesm commented Oct 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler commented Oct 30, 2017

Uh oh!

icexelloss commented Oct 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Oct 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Oct 30, 2017

Uh oh!

icexelloss commented Oct 30, 2017

Uh oh!

BryanCutler commented Oct 30, 2017

Uh oh!

elahrvivaz commented Oct 31, 2017

Uh oh!

icexelloss commented Oct 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Nov 1, 2017

Uh oh!

siddharthteotia commented Nov 1, 2017

Uh oh!

BryanCutler commented Nov 1, 2017

Uh oh!

siddharthteotia commented Nov 5, 2017

Uh oh!

BryanCutler commented Nov 6, 2017

Uh oh!

BryanCutler commented Nov 18, 2017

Uh oh!

siddharthteotia commented Nov 20, 2017

Uh oh!

wesm Nov 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

BryanCutler commented Oct 27, 2017 •

edited

Loading

icexelloss commented Oct 30, 2017 •

edited

Loading

icexelloss commented Oct 30, 2017 •

edited

Loading

wesm Nov 20, 2017 •

edited

Loading

BryanCutler Nov 21, 2017 •

edited

Loading

icexelloss commented Nov 22, 2017 •

edited

Loading