Skip to content

[C++] RecordBatchFileReader performance really bad in S3 #29993

@asfimport

Description

@asfimport

We are using RecordBatchFileWriter to write Arrow type directly to S3 using the S3FileSystem, then using RecordBatchFileReader to read from S3. The write is pretty efficient, write a 50MB finishes within 0.2s. But reading that file is taking 30s, which is definitely too long. Then I did several tests:

  1. I tried to use S3FileSystem to read the file into bytes, it's only taking 1s. which somehow makes me believe it's an issue with RecordBatchFileReader
  2. Half the size (around 25MB), with RecordBatchFileReader took 17s, without RecordBatchFileReader took 0.28s
  3. Double the size (around 100MB), with RecordBatchFileReader took 61s, without RecordBatchFileReader took 2.3s
  4. I tried to get all bytes using S3FileSystem first, then create a reader from the bytes. Then read all context from the reader, it's only taking 0.1s. 

Reporter: Lingkai Kong
Assignee: Weston Pace / @westonpace

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-14429. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions