Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Apr 2, 2020

This change adds the following push style reader classes:

  • ipc::MessageEmitter
  • ipc::RecordBatchStreamEmitter

Push style readers don't read data from stream directly. They receive
already read data by users. This style is useful with event driven
style IO API. We can't read data from stream directly in event driven
style IO API. We just receive already read data from event driven style
IO API like:

void on_read(const uint8_t* data, size_t data_size) {
   process_data(data, data_size);
}
register_read_event(on_read);
run_event_loop();

We can't use the current reader API with event driven style IO API but
we can use this push style reader with event driven style IO API.

The current Message reader is changed to use ipc::MessageEmitter
internally. So we don't have duplicated reader implementation. And no
performance regression with our benchmark.

Before:

Running release/arrow-ipc-read-write-benchmark
Run on (12 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
Load Average: 0.85, 0.84, 0.65
-----------------------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------
ReadRecordBatch/1/real_time           886 ns          886 ns       774286 bytes_per_second=1102.15G/s
ReadRecordBatch/4/real_time          1601 ns         1601 ns       436258 bytes_per_second=610.078G/s
ReadRecordBatch/16/real_time         4819 ns         4820 ns       143568 bytes_per_second=202.663G/s
ReadRecordBatch/64/real_time        18291 ns        18296 ns        38586 bytes_per_second=53.3893G/s
ReadRecordBatch/256/real_time       84852 ns        84872 ns         8317 bytes_per_second=11.5091G/s
ReadRecordBatch/1024/real_time     341091 ns       341168 ns         2049 bytes_per_second=2.86306G/s
ReadRecordBatch/4096/real_time    1368049 ns      1368361 ns          511 bytes_per_second=730.968M/s
ReadRecordBatch/8192/real_time    2676778 ns      2677341 ns          265 bytes_per_second=373.584M/s

After:

Running release/arrow-ipc-read-write-benchmark
Run on (12 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 256K (x6)
  L3 Unified 12288K (x1)
Load Average: 0.88, 0.85, 0.66
-----------------------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------
ReadRecordBatch/1/real_time           891 ns          891 ns       769579 bytes_per_second=1095.57G/s
ReadRecordBatch/4/real_time          1599 ns         1599 ns       435756 bytes_per_second=610.746G/s
ReadRecordBatch/16/real_time         4834 ns         4835 ns       144374 bytes_per_second=202.027G/s
ReadRecordBatch/64/real_time        18204 ns        18206 ns        38190 bytes_per_second=53.6465G/s
ReadRecordBatch/256/real_time       84142 ns        84154 ns         8309 bytes_per_second=11.6061G/s
ReadRecordBatch/1024/real_time     343105 ns       343148 ns         2035 bytes_per_second=2.84625G/s
ReadRecordBatch/4096/real_time    1399287 ns      1399484 ns          511 bytes_per_second=714.65M/s
ReadRecordBatch/8192/real_time    2641529 ns      2641845 ns          263 bytes_per_second=378.569M/s

@github-actions
Copy link

github-actions bot commented Apr 2, 2020

@pitrou
Copy link
Member

pitrou commented Apr 2, 2020

Thank you. I think that we should get the API as general as possible, so I would suggest the following:

class ARROW_EXPORT Receiver {
 public:
  // Subclasses should override the methods they're interested in.
  // Default implementations return NotImplemented.
  virtual Status RecordBatchReceived(std::shared_ptr<RecordBatch>);
  virtual Status TensorReceived(std::shared_ptr<Tensor>);
  virtual Status SparseTensorReceived(std::shared_ptr<SparseTensor>);
};

@pitrou
Copy link
Member

pitrou commented Apr 2, 2020

(this will also be useful for Flight @lidavidm )

@lidavidm
Copy link
Member

lidavidm commented Apr 2, 2020

This will be very useful! Once this lands I'll see about wiring this up to the gRPC async APIs.

@wesm wesm self-requested a review April 2, 2020 17:27
@kou
Copy link
Member Author

kou commented Apr 3, 2020

@pitrou Thanks for the suggestion! It's a good idea.
I've added a arrow::Reciver only with MessageReceived() and RecordBatchReceived(). We can add more XXXReceived() when we need.

@kou
Copy link
Member Author

kou commented Apr 3, 2020

@lidavidm Thanks! I can help you when you work on it.

The code will look like the followings:

void on_read(const uint8_t* data, size_t data_size) {
  std::shared_ptr<Buffer> chunk;
  arrow::Buffer(data, data_size).Copy(0, data_size, &chunk);
  emitter_.Consume(chunk);
  while (!chunks_.empty()) {
    if (chunks_[0].use_count() > 1) {
      break;
    }
    chunks_.erase(chunks_.begin());
  }
  if (chunk.use_count() > 1) {
    chunks_.push_back(std::move(chunk));
  }
}

@pitrou pitrou self-requested a review April 3, 2020 11:57
@kou kou force-pushed the cpp-record-batch-emitter branch from d46b706 to 6784278 Compare April 3, 2020 22:14
@wesm
Copy link
Member

wesm commented Apr 7, 2020

Sorry about not reviewing this yet, it's on my "short list".

@pitrou
Copy link
Member

pitrou commented Apr 7, 2020

Thanks for the update @kou.

I don't think it makes sense to have both MessageReceived and RecordBatchReceived, since message and record batch are different levels of abstraction. I don't see how MessageReceived can be useful, to be honest (do you expect the consumer to reimplement message decoding?).

Once we agree on the basic abstraction, I will make a more thorough review.

@kou
Copy link
Member Author

kou commented Apr 7, 2020

I thought you suggested that we add a general receiver API like existing arrow::Iterator instead of multiple receiver APIs for each received object. I thought that it's a good idea because it simplifies our API.

If we have a receiver API for arrow::ipc::Message and a receiver API for arrow::RecordBatch, arrow::Tensor, arrow::SparseTensor and so on. I prefer receiver APIs for each object (MessageReceiver, RecordBatchReceiver and so on) to two receiver APIs (for arrow::ipc::Message and for others). If we have receiver APIs for each object, users can detect "forget to implement" error on compile time because we can provide an abstract receiver class with virtual Status Received(...) = 0.

We don't have data format that mixes RecordBatch, Tensor and SparseTensor for now. Users will want to implement one Received() API for most case. Compile time error detection will help users.

I don't see how MessageReceived can be useful, to be honest (do you expect the consumer to reimplement message decoding?).

This pull request implements the following push style readers:

  • arrow::ipc::MessageEmitter for arrow::ipc::Message
  • arrow::ipc::RecordBatchStreamEmitter for arrow::RecordBatch

arrow::ipc::RecordBatchStreamerEmitter is implemented with arrow::ipc::MessageEmitter. arrow::ipc::MessageEmitter uses MessageReceived API.

For arrow::Tensor, we don't have convenience API to read multiple arrow::Tensors. We need to call arrow::ipc::ReadTensor() multiple times but this is not push style:

while (true) {
  auto tensor = arrow::ipc::ReadTensor(input);
  if (!tensor.status().ok()) {
    break; // tensor.status() will be arrow::Status::Invalid
   }
  // process tensor
}

Users can implement push style arrow::Tensor reader with MessageReceived API (I think that we provide a convenient API instead if this use case makes sence):

class TensorProcessor : public arrow::Receiver {
  arrow::Status MessageReceive(arrow::unique_ptr<Message> message) override {
    ARROW_ASSIGN_OR_RAISE(auto tensor, arrow::ipc::ReadTensor(*message));
   // process tensor
  }
};

TensorProcesor processor;
arrow::ipc::MessageEmitter emitter(&processor);
while (emitter.state() != arrow::ipc::MessageEmitter::State::EOS) {
  emitter.Consume(data, data_size);
}

Normally, users should not use MessageReceived API because this arrow::ipc::Message is a low level object. Advanced users may use it.

Do you prefer the following API?

// only for arrow::ipc::Message
class ARROW_EXPORT MessageReceiver {
  virtual Status Receive(std::unique_ptr<Message> message) = 0;
};

// for others
class ARROW_EXPORT Receiver {
  // Default implementations return NotImplemented.
  virtual Status RecordBatchReceived(std::shared_ptr<RecordBatch> record_batch);
  virtual Status TensorReceived(std::shared_ptr<Tensor> tensor);
  virtual Status SparseTensorReceived(std::shared_ptr<SparseTensor> tensor);
};

@kou kou force-pushed the cpp-record-batch-emitter branch from 6784278 to b31977d Compare April 7, 2020 21:53
@pitrou
Copy link
Member

pitrou commented Apr 7, 2020

Do you prefer the following API? [snip]

Yes. This is what I meant. Either you decode messages yourself and you implement MessageReceiver, or you let Arrow decode them and you implement Receiver.

@kou
Copy link
Member Author

kou commented Apr 8, 2020

OK. I've changed to use the API.

@wesm
Copy link
Member

wesm commented Apr 8, 2020

I started reviewing, will try to finish soon

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, thanks for working on this -- I think this will make it easier to implement delta dictionaries and dictionary replacements. Some minor stylistic comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of this parameter is not totally clear. Maybe "the number of bytes needed"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
I've changed to use "the number of bytes needed" for description.
Should we also improve parameter name (next_required_size)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function need to retain ownership of the Buffer (versus const Buffer&)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
If the given buffer doesn't have enough data, emitter keeps the buffer in chunks_ instead of using it immediately. If we doesn't retain ownership of the given buffer, the buffer may be destructed when emitter uses it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style choice: We could use the same function name for all the receivers, like Receive, but with different input argument types. Not sure if all compilers would be happy about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the following API, right?

class ARROW_EXPORT Receiver {
  virtual Status Received(std::shared_ptr<RecordBatch> record_batch);
  virtual Status Received(std::shared_ptr<Tensor> tensor);
  virtual Status Received(std::shared_ptr<SparseTensor> sparse_tensor);
};

I don't have a preference for this.

@pitrou What do you think about this API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference, but Received alone sounds weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @bkietz

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use this API if we use Receiver::EosReceived(). Because EosReceived() has no argument.

Copy link
Member Author

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Thanks for your review!
I've fixed most of problems.
What do you think about next_required_size name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
I've changed to use "the number of bytes needed" for description.
Should we also improve parameter name (next_required_size)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
If the given buffer doesn't have enough data, emitter keeps the buffer in chunks_ instead of using it immediately. If we doesn't retain ownership of the given buffer, the buffer may be destructed when emitter uses it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the following API, right?

class ARROW_EXPORT Receiver {
  virtual Status Received(std::shared_ptr<RecordBatch> record_batch);
  virtual Status Received(std::shared_ptr<Tensor> tensor);
  virtual Status Received(std::shared_ptr<SparseTensor> sparse_tensor);
};

I don't have a preference for this.

@pitrou What do you think about this API?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still only looking at the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this StreamDecoder or something. At some point we'll add other methods to Receiver, so it won't emit just record batches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll rename this to arrow::ipc::StreamDecoder.

You mean that we will extend https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format later to support more data type such as tensor. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we would add TensorReceived (or OnTensor or whatever the chosing naming is :-)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead this could be a EosReceived method on Receiver or something.
(note: the terminology I'm proposing is inspired by https://docs.python.org/3/library/asyncio-protocol.html#streaming-protocols , but YMMV)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add Receiver::EosReceived() and remove is_eos().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this means. What is the next action?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's "advancing the state of the emitter". So if you just read the metadata size prefix then this would return the size of the metadata, or the size of the body

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean you're expected to give exactly that number of bytes to Consume? Or does the emitter do its own buffering inside? The docs should probably make that clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feed the emitter too much data, it will retain a slice of it internally, yes. This can be made more clear in the docs indeed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more documentations. Could you confirm it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this function is okay with me

Copy link
Member Author

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Thanks for your review!
I've applied your suggestion except StreamDecoder. I want to confirm what you meant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll rename this to arrow::ipc::StreamDecoder.

You mean that we will extend https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format later to support more data type such as tensor. Right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more documentations. Could you confirm it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add Receiver::EosReceived() and remove is_eos().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use this API if we use Receiver::EosReceived(). Because EosReceived() has no argument.

@wesm
Copy link
Member

wesm commented Apr 8, 2020

Bunch of CI jobs failed with "no space left on device".

Overall this patch looks good to me. I'll await @pitrou to make a final review / sign off per the comments above

@kou
Copy link
Member Author

kou commented Apr 9, 2020

If we may have any not-received callback such as a callback that is called on error and a callback that is called on dictionary updated, Listener may be better than Receiver.
If we use Listener, On${Event} will be better than ${Target}Received such as Listener::OnRecordBatch() (or Listner::OnRecordBatchReceived()?) and Listener::OnError().

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @wesm comment on the naming.

High-level question: is it possible to reimplement RecordBatchStreamReader and MessageReader on top of this infrastructure? It's not terrific to duplicate the decoding logic in several places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, I think it would be better to have an EosReceived method on MessageReceiver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added EOS callback but I want to keep this method.
Because this information can be used to optimize performance. I've added documentation for this.

@wesm
Copy link
Member

wesm commented Apr 9, 2020

I'm looking again. This needs to be rebased now after ARROW-7233, I'll see if I can perform the rebase

@wesm wesm force-pushed the cpp-record-batch-emitter branch from 9e280d0 to 247e118 Compare April 9, 2020 16:18
@wesm
Copy link
Member

wesm commented Apr 9, 2020

High-level question: is it possible to reimplement RecordBatchStreamReader and MessageReader on top of this infrastructure? It's not terrific to duplicate the decoding logic in several places.

Unless I'm missing something, that's exactly what this patch does. MessageReader just calls ReadMessage(InputStream*) which uses RecordBatchStreamEmitter

I'm looking at the other naming issues

@wesm wesm force-pushed the cpp-record-batch-emitter branch from b23a193 to f80f953 Compare April 9, 2020 16:42
@wesm
Copy link
Member

wesm commented Apr 9, 2020

Assorted thoughts:

  • We should probably mark these new APIs as experimental so we do not feel pressure to resolve all concerns in a single patch
  • arrow/util/receiver.h should probably be part of arrow/ipc
  • I don't have a strong opinion between Receiver and Listener name

@pitrou
Copy link
Member

pitrou commented Apr 9, 2020

It seems MessageReader calls the top-level ReadMessage(io::InputStream* file, MemoryPool* pool) for each message... which will instantiate a new Emitter and Receiver every time.

@wesm
Copy link
Member

wesm commented Apr 9, 2020

I agree it would be good to improve that (persisting the emitter between calls to ReadNextMessage)

@pitrou
Copy link
Member

pitrou commented Apr 9, 2020

By the way, since this is a new API, perhaps it should incorporate per-message metadata as well? (the custom_metadata field). For example:

virtual Status RecordBatchReceived(
  std::shared_ptr<RecordBatch> record_batch,
  std::shared_ptr<KeyValueMetadata> custom_metadata);
virtual Status SchemaReceived(
  std::shared_ptr<Schema> schema,
  std::shared_ptr<KeyValueMetadata> custom_metadata);

kou added 10 commits April 10, 2020 09:24
This change adds the following push style reader classes:

  * ipc::MessageEmitter
  * ipc::RecordBatchStreamEmitter

Push style readers don't read data from stream directly. They receive
already read data by users. This style is useful with event driven
style IO API. We can't read data from stream directly in event driven
style IO API. We just receive already read data from event driven style
IO API like:

    void on_read(const uint8_t* data, size_t data_size) {
       process_data(data, data_size);
    }
    register_read_event(on_read);
    run_event_loop();

We can't use the current reader API with event driven style IO API but
we can use this push style reader with event driven style IO API.

The current Message reader is changed to use ipc::MessageEmitter
internally. So we don't have duplicated reader implementation. And no
performance regression with our benchmark.

Before:

    Running release/arrow-ipc-read-write-benchmark
    Run on (12 X 4600 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 256K (x6)
      L3 Unified 12288K (x1)
    Load Average: 0.85, 0.84, 0.65
    -----------------------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations UserCounters...
    -----------------------------------------------------------------------------------------
    ReadRecordBatch/1/real_time           886 ns          886 ns       774286 bytes_per_second=1102.15G/s
    ReadRecordBatch/4/real_time          1601 ns         1601 ns       436258 bytes_per_second=610.078G/s
    ReadRecordBatch/16/real_time         4819 ns         4820 ns       143568 bytes_per_second=202.663G/s
    ReadRecordBatch/64/real_time        18291 ns        18296 ns        38586 bytes_per_second=53.3893G/s
    ReadRecordBatch/256/real_time       84852 ns        84872 ns         8317 bytes_per_second=11.5091G/s
    ReadRecordBatch/1024/real_time     341091 ns       341168 ns         2049 bytes_per_second=2.86306G/s
    ReadRecordBatch/4096/real_time    1368049 ns      1368361 ns          511 bytes_per_second=730.968M/s
    ReadRecordBatch/8192/real_time    2676778 ns      2677341 ns          265 bytes_per_second=373.584M/s

After:

    Running release/arrow-ipc-read-write-benchmark
    Run on (12 X 4600 MHz CPU s)
    CPU Caches:
      L1 Data 32K (x6)
      L1 Instruction 32K (x6)
      L2 Unified 256K (x6)
      L3 Unified 12288K (x1)
    Load Average: 0.88, 0.85, 0.66
    -----------------------------------------------------------------------------------------
    Benchmark                               Time             CPU   Iterations UserCounters...
    -----------------------------------------------------------------------------------------
    ReadRecordBatch/1/real_time           891 ns          891 ns       769579 bytes_per_second=1095.57G/s
    ReadRecordBatch/4/real_time          1599 ns         1599 ns       435756 bytes_per_second=610.746G/s
    ReadRecordBatch/16/real_time         4834 ns         4835 ns       144374 bytes_per_second=202.027G/s
    ReadRecordBatch/64/real_time        18204 ns        18206 ns        38190 bytes_per_second=53.6465G/s
    ReadRecordBatch/256/real_time       84142 ns        84154 ns         8309 bytes_per_second=11.6061G/s
    ReadRecordBatch/1024/real_time     343105 ns       343148 ns         2035 bytes_per_second=2.84625G/s
    ReadRecordBatch/4096/real_time    1399287 ns      1399484 ns          511 bytes_per_second=714.65M/s
    ReadRecordBatch/8192/real_time    2641529 ns      2641845 ns          263 bytes_per_second=378.569M/s

Fix format

Fix lint errors

Fix lint errors

Fix sanitizer errors

Use AllocateBuffer to create empty 64-bit aligned buffer

Introduce general Receiver API

Add missing include

Fix error type

Use new Receiver API

Fix format

Split MessageReceiver again

Remove duplicated comments

Fix style

Don't use deprecated API

Don't use deprecated API

Add missing slice for non CPU buffer

Fix next_required_size parameter description

Use ABORT_NOT_OK()

Remove needless forward declaration

Use different test suite name

Fix include location

Fix a bug that next_required_size() doesn't care buffered_size_

Use std::shared_ptr<Receiver>

Add SchemaReceived

Add more documentation for next_required_size()

Use EosReceived() instead of is_eos()

Rebase

Remove unused variable from test
@kou kou force-pushed the cpp-record-batch-emitter branch from f80f953 to 7fab0e3 Compare April 10, 2020 00:25
Copy link
Member Author

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied all suggestions except per-message metadata.

Summary:

  • Renamed emitter to decoder
  • Marked new APIs experimental
  • Moved Receiver from arrow/util to arrow/ipc
  • Renamed Receiver to Listener
  • MessageReader reuses decoder

For per-message metadata, I'm not sure which message's metadata is used when any dictionary batch message exists. Schema message's metadata? Should we merge all metadata in schema message and dictionary batch messages?

Can we do it as a follow-up task?

For RecordBatchStreamReader, we can't use StreamDecoder in RecordBatchStreamReader internally. Because we have RecordBatchStreamReader::Open(std::unique_ptr<MessageReader> message_reader, ...) API. If we use StreamDecoder, we don't use MessageReader.
(We can use StreamDecoder with this API by extracting InputStreamMessageReader::stream_ from MessageReader and creating StreamDecoder from the extracted stream. Should we do this?)

Most of core logics are shared with RecordBatchStreamReader and StreamDecoder in this pull request. Should we reimplement RecordBatchStreamReader by StreamDecoder in this pull request? Or can we do it as a follow-up task?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added EOS callback but I want to keep this method.
Because this information can be used to optimize performance. I've added documentation for this.

@wesm
Copy link
Member

wesm commented Apr 10, 2020

Thank you @kou

For per-message metadata, I'm not sure which message's metadata is used when any dictionary batch message exists. Schema message's metadata? Should we merge all metadata in schema message and dictionary batch messages?

My initial thought was that we should only propagate the metadata from the Message::custom_metadata field, but you're right that it's unclear what to do with any metadata from a DictionaryBatch message. Let's think a bit more about it -- since the APIs are experimental we don't have to figure this out right now

@wesm
Copy link
Member

wesm commented Apr 10, 2020

+1, merging. The CI failure https://github.com/apache/arrow/pull/6804/checks?check_run_id=575674263 does not appear to be related to me

@wesm wesm closed this in 866e6a8 Apr 10, 2020
@kou kou deleted the cpp-record-batch-emitter branch April 10, 2020 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants