-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
We are using RecordBatchFileWriter to write Arrow type directly to S3 using the S3FileSystem, then using RecordBatchFileReader to read from S3. The write is pretty efficient, write a 50MB finishes within 0.2s. But reading that file is taking 30s, which is definitely too long. Then I did several tests:
- I tried to use S3FileSystem to read the file into bytes, it's only taking 1s. which somehow makes me believe it's an issue with RecordBatchFileReader
- Half the size (around 25MB), with RecordBatchFileReader took 17s, without RecordBatchFileReader took 0.28s
- Double the size (around 100MB), with RecordBatchFileReader took 61s, without RecordBatchFileReader took 2.3s
- I tried to get all bytes using S3FileSystem first, then create a reader from the bytes. Then read all context from the reader, it's only taking 0.1s.
Reporter: Lingkai Kong
Assignee: Weston Pace / @westonpace
Related issues:
- [C++] Enable fine grained IO for async IPC reader (duplicates)
PRs and other links:
Note: This issue was originally created as ARROW-14429. Please see the migration documentation for further details.