Skip to content

Conversation

@BryanCutler
Copy link
Member

This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream. This adds a new interface MessageReader that will return a Message and the message body. The MessageChannelReader implements this interface to read from a ReadChannel to match the current functionality. A re-org of reading and writing packages is also done to better organize classes under an ipc package.

There is a slight change in behavior for the ArrowFileReader that should not affect any usage. Previously, the schema was read during initialization, and all dictionaries read just before the first record batch was read. Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch.

@BryanCutler
Copy link
Member Author

BryanCutler commented Oct 27, 2017

I'm proposing some re-organization of the packages for reading/writing to hopefully better group related classes. Here is what I'm proposing where most of the files would fall (this also tries to follow some of the arrow-cpp structure):

o.a.a.vector.ipc
      file
          ArrowFileReader
          ArrowFileWriter
          ArrowMagic
      stream
          ArrowStreamReader
          ArrowStreamWriter
      json
          JsonFileReader
          JsonFileWriter
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ReadChannel
      WriteChannel

@BryanCutler
Copy link
Member Author

I think this should mostly be orthogonal to the Java vector refactoring that is going on now. That should take priority, but it would be great if we could get this in for 0.8 if possible.

@BryanCutler
Copy link
Member Author

@wesm @icexelloss @siddharthteotia what are your thoughts on this?

@wesm
Copy link
Member

wesm commented Oct 30, 2017

@BryanCutler at a high level this sounds great to me. cc @nongli also to take a look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes it's useful to be able to just read the schema out of a message, without loading up any dictionaries or record batches. is there a way to preserve that functionality somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could still do that. I think it just comes down to either reading the dictionaries after the schema, or reading them before the first data batch. I thought it made a little more sense to read them with the schema, otherwise the user could create the reader, load the schema and try to decode it but fail.

Would it work for you to maybe overload ArrowReader.readSchema which will be able to return the original schema before loading the dictionaries? Similarly, if using the stream format, you could make a subclass of MessageReader (introduced here) and react after reading a schema message. If not, I'm ok with reading them before data batches and documenting for the user that you can't decode until batches are read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, an overloaded method would be fine. I agree that having to load a batch before reading dictionaries is a bit confusing for the general use case.

@icexelloss
Copy link
Contributor

icexelloss commented Oct 30, 2017

At the high level, @BryanCutler what do you feel about having ipc be a top level package rather than a subpackage under vector, i.e. org.apache.arrow.ipc

@BryanCutler
Copy link
Member Author

having ipc should be a top level package rather than a subpackage under vector, i.e. org.apache.arrow.ipc

I'm not sure, all of the current messages are geared towards vectors so it makes sense to keep it there. Are you thinking of possible messages in the future that might not be vector related?

@icexelloss
Copy link
Contributor

I'm not sure, all of the current messages are geared towards vectors so it makes sense to keep it there. Are you thinking of possible messages in the future that might not be vector related?
I think this is fine for now.

Longer term, I kind of think we can improve the current package hierarchy where all API is under the name space org.apache.arrow.vector. A hierarchy similar to C++ might make more sense - o.a.a.vector o.a.a.ipc and etc. But no need to do it here I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you feel about get rid of "file" and "stream" sub namespace, i.e.

org.apache.arrow.vector.ipc.ArrowFileWriter
org.apache.arrow.vector.ipc.ArrowStreamWriter

I think these two namespaces file and stream are not very complicated, they can probably be combined

@icexelloss
Copy link
Contributor

icexelloss commented Oct 30, 2017

@BryanCutler This looks great! What do people feel about having less sub namespaces?

Original,

o.a.a.vector.ipc
      file
          ArrowFileReader
          ArrowFileWriter
          ArrowMagic
      stream
          ArrowStreamReader
          ArrowStreamWriter
      json
          JsonFileReader
          JsonFileWriter
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ReadChannel
      WriteChannel

Less sub namespaces:

o.a.a.vector.ipc
      message
          ArrowBlock
          ArrowFooter
          ArrowMessage
          ArrowRecordBatch
          ArrowDictionaryBatch
          FBSerializable
          FBSerializables          
          MessageSerializer
      ArrowReader
      ArrowWriter
      ArrowFileReader
      ArrowFileWriter
      ArrowMagic
      ArrowStreamReader
      ArrowStreamWriter
      ReadChannel
      WriteChannel
      JsonFileReader
      JsonFileWriter

@icexelloss
Copy link
Contributor

Also maybe JsonFileReader -> ArrowJsonReader for more consistent naming?

@icexelloss
Copy link
Contributor

Backward compatibility wise, I think we should probably change this along with vector changes in one arrow release?

@BryanCutler BryanCutler force-pushed the java-generic-stream-interfaces-ARROW-1047 branch from 907e348 to 0e07e28 Compare October 30, 2017 21:21
@BryanCutler
Copy link
Member Author

Thanks @elahrvivaz, @icexelloss and @wesm !

What do people feel about having less sub namespaces?

I sort of prefer having separate packages for the different readers/writers. There are some supporting files that are specific to certain formats, like ArrowMagic and InvalidArrowFileException, and I like pushing it down to the feature that uses them. I think users will be more likely to import reader/writer from 1 format for a particular use too. I'm not tied to this though, we can simplify if that's the consensus.

Also maybe JsonFileReader -> ArrowJsonReader for more consistent naming?

+1 for me on renaming this

@elahrvivaz
Copy link
Contributor

imo i like the current package layout with file, stream, json, message.

@icexelloss
Copy link
Contributor

I sort of prefer having separate packages for the different readers/writers. There are some supporting files that are specific to certain formats, like ArrowMagic and InvalidArrowFileException, and I like pushing it down to the feature that uses them. I think users will be more likely to import reader/writer from 1 format for a particular use too. I'm not tied to this though, we can simplify if that's the consensus.

imo i like the current package layout with file, stream, json, message.

Sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add override?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a bit doc of what these methods are supposed to do? It's not very clear how to use readNextMessage and readMessageBody

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant to say that I still need to go through these changes and make sure everything is documented properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to closer to read schema rather than deserialize schema

public static Schema deserializeSchema(Message message)

seem to make more sense to me

Maybe this method can be made into:

public static Schema readSchema(MessageReader reader) {

    Message message = reader.readNextMessage();
    return deserializeSchema(message);
}

?

@BryanCutler what do you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to include reading the message as part of deserialization and some messages also require to read another chunk after the message. I do think the behavior of these functions could be made to be more consistent, but we should probably do that as a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Agree this can be a follow upl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "Batch" in the function name is a bit unintuitive. I kind of feel "Message" is a better term than "MessageBatch".

Should we maybe rename this to deserializeMessage?

Also, this message doesn't seem to exclude schema message explicitly. Which also feels a bit weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method won't read any generic message, it only works with RecordBatches or DictionaryBatches, hence the name...
in the streaming format the first message after the schema could be either a record batch or a dictionary batch, this method is to handle either case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's ok as is but this seems to be used only in a test. How about we do a followup PR to refine these functions and we can discuss there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@icexelloss
Copy link
Contributor

One thing I am not sure is if this patch will make java-refactor-branch hard to merge - cc @siddharthteotia for comment.

Maybe we should keep all refactor changes in java-refactor-branch to make it easier to merge? Not sure though.

@siddharthteotia
Copy link
Contributor

Yes I am concerned that this will make patches in java-vector-refactor branch hard to merge into master, Secondly, the nature of changes suggest that we should be testing this with Dremio as well -- I would have loved to offer help but I am in the process of moving Dremio to new code in java-vector-refactor branch.

I would prefer to have these changes merged after java-vector-refactor changes are merged into master.

@BryanCutler
Copy link
Member Author

@siddharthteotia the Java refactoring is the priority right now so I don't want to hinder that, but I would like to get this in for the 0.8 release if possible. I think changes to the ArrowReader should be mostly transparent, although there might be conflicts with some of the tests I had to change. What about if I try to cherry pick this into the java-vector-refactor branch? If it doesn't go in cleanly then we can put it on hold until the refactor branch is merge.

@siddharthteotia
Copy link
Contributor

@BryanCutler, are you suggesting to cherry pick your changes in refactor branch and revert commit in case things don't look good?

I am not entirely sure what's the best option here but I believe that adding orthogonal set of changes to java-vector-refactor branch at this point may not be a good idea. However, I don't want to block other work. So feel free to proceed based on your best judgement.

Note that there are currently two patches in that branch. While making changes in Dremio and debugging test failures, I had to go back and make some changes in vector code (minor only, no redesign). Currently those additional changes are in Dremio's fork (as I wanted to make quick progress) and I will put a PR against java-vector-refactor branch for the third patch very soon -- better to do at last when testing with Dremio completes.

@BryanCutler
Copy link
Member Author

@siddharthteotia that's fair enough, I don't want to complicate the refactoring. I mostly just want to make sure that these changes don't make things harder to merge the java-vector-refactor branch into master. I can try that out locally and report back.

@BryanCutler
Copy link
Member Author

@siddharthteotia is this something you would like to run with the Dremio suite of tests before merging?

@siddharthteotia
Copy link
Contributor

@BryanCutler , I would like to test this with Dremio but I am not sure how quickly I will be able to do that and revert back after making necessary changes in Dremio and doing proper testing.

Part of me says you should go ahead and merge this since you were already waiting for the refactor work to get done before this.

Since we will anyway have to rebase on Arrow master after the ongoing timestamp vector related changes in #1330, we can take care of testing this out with Dremio at that time unless @jacques-n thinks otherwise?

Copy link
Member

@wesm wesm Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this ipc.ArrowStreamReader but not ipc.stream

Copy link
Member Author

@BryanCutler BryanCutler Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the same for file and json readers, e.g. ipc.ArrowFileReader? I made these subpackages because there were some supporting files specific to just the file reader, so they could be grouped together. But I'm ok either way, @icexelloss brought this up here #1259 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes are all quite similar (the file format is very nearly the stream format, plus a file footer and magic numbers at start and end), I think it would make sense to keep them in a flat package namespace (but I'm not a Java expert)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@icexelloss do you have an opinion on this? Would be good to get this patch in soon to facilitate testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer ipc.ArrowStreamReader to ipc.stream.ArrowStreamReader

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm fine with this. I'll change it now

@BryanCutler
Copy link
Member Author

@siddharthteotia what ever is easier for this, but I would like to hear that I didn't break anything on your side :) It's pretty easy to rebase this, so no need to rush

@icexelloss
Copy link
Contributor

This looks good to me. Once the package name hierarchy I think this should be good to go.

@icexelloss
Copy link
Contributor

LGTM. +1

Change-Id: I7a59a24bd54339cd637ace36e991bc062ba1d4e1
@wesm wesm force-pushed the java-generic-stream-interfaces-ARROW-1047 branch from 5b998e4 to 43314ca Compare November 22, 2017 18:01
@wesm
Copy link
Member

wesm commented Nov 22, 2017

Squashed and rebased so we can get a passing build. While we are waiting, do we also want the vector.ipc.message subnamespace? Do not have a strong feeling

@icexelloss
Copy link
Contributor

icexelloss commented Nov 22, 2017

I do not have a strong feeling either, I think vector.ipc.message subnamespace are fine. Although maybe we can move ArrowMagic to message subnamespace? Sorry for the oversight. @BryanCutler what do you think

@wesm
Copy link
Member

wesm commented Nov 22, 2017

Reviewing the past comments, since these classes are generally internal, I think it's fine. master is broken right now (ARROW-1845) so I will merge this

@wesm wesm closed this in 1516306 Nov 22, 2017
@BryanCutler
Copy link
Member Author

I'd like to keep the vector.ipc.message package, I think these generally define messages that serialize to FB.

@BryanCutler
Copy link
Member Author

Thanks @wesm @icexelloss @elahrvivaz and @siddharthteotia !

pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
This change decouples the reading of messages from the ReadChannel so that it is possible to build a reader that is not tied to a specific stream.  This adds a new interface `MessageReader` that will return a `Message` and the message body.  The `MessageChannelReader` implements this interface to read from a `ReadChannel` to match the current functionality.  A re-org of reading and writing packages is also done to better organize classes under an `ipc` package.

There is a slight change in behavior for the `ArrowFileReader` that should not affect any usage.  Previously,  the schema was read during initialization, and all dictionaries read just before the first record batch was read.  Now, all dictionaries are read directly after the schema, and not specifically tied to reading the first record batch.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#1259 from BryanCutler/java-generic-stream-interfaces-ARROW-1047 and squashes the following commits:

43314ca [Bryan Cutler] ARROW-1047: [Java] Add Generic Reader Interface for Stream Format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants