ARROW-11772: [C++] Provide reentrant IPC file reader #9656

lidavidm · 2021-03-08T15:36:16Z

This provides an async-reentrant generator of record batches from an IPC file reader, intended to support Datasets once it becomes async itself.

IPC messages are read on an IO thread pool, then decoded on the CPU thread pool. All dictionaries must be read at the start, then record batches can be read independently.

github-actions · 2021-03-08T19:09:42Z

https://issues.apache.org/jira/browse/ARROW-11772

pitrou · 2021-03-09T15:31:49Z

cc @westonpace

pitrou · 2021-03-09T15:34:06Z

cpp/src/arrow/ipc/reader.h

Given that we're talking about an asynchronous primitive, forcing the caller to deal with lifetime issues is a bit unfriendly IMHO. It may be simpler to make RecordBatchFileReader inherit from enable_shared_from_this and capture a strong reference inside the generator.

Thanks - this was actually left over from an earlier approach. I've removed the note and adjusted the test to ensure the reader gets dropped before the generator.

pitrou · 2021-03-09T15:34:43Z

cpp/src/arrow/util/future.h

Add a trivial test for this?

pitrou · 2021-03-09T15:55:07Z

cpp/src/arrow/ipc/reader.cc

This looks a bit tedious. You're essentially copying most of those fields from RandomAccessFile?

Yes. I'll change it so that it just stores a RecordBatchFileReaderImpl and add a enable_shared_from_this instead to avoid duplicating fields.

You could also avoid the "state" struct and put the desired fields directly in the generator classes (I'm not sure why you need two of them, by the way, unless you plan to do something else later with the IPC message generator).

The generators are distinct - one reads from the filesystem and one decompresses/decodes the batch. By separating them, we can also apply readahead independently to each stage.

The state struct was because AsyncGenerators need to be copyable, and since DictionaryMemo is move-only, we had to put it behind an indirection.

pitrou · 2021-03-09T15:55:55Z

cpp/src/arrow/ipc/reader.cc

Can you make those functions static or put them in the anonymous namespace?

pitrou · 2021-03-09T16:01:40Z

cpp/src/arrow/ipc/reader.cc

Would be nice to avoid this.

pitrou · 2021-03-09T16:02:26Z

cpp/src/arrow/ipc/reader.cc

I don't think it's desirable to force all processing to go to the global thread pool unconditionally.
(also, are you sure the processing is heavy enough that it benefits from it?)

Usually not, but it would be in the case of compressed buffers. I could also change it to not offload onto a secondary pool by default (and hence do the work on the same thread used to read data) and/or benchmark if there's any overhead to this.

The overhead of dispatching tasks to a thread pool is difficult to evaluate rigorously, but you can get an idea by running arrow-thread-pool-benchmark. IMHO the bottom line is that we should aim for tasks in the millisecond range (rather than microsecond or less).

Looking at the ReadFile benchmark it seems reading a 1 MiB batch takes about 1ms once there are >=1024 columns.

------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------ ReadFile/1/real_time 8130 ns 8130 ns 86111 bytes_per_second=120.115G/s ReadFile/4/real_time 10734 ns 10734 ns 65153 bytes_per_second=90.9826G/s ReadFile/16/real_time 21779 ns 21779 ns 32081 bytes_per_second=44.8389G/s ReadFile/64/real_time 67087 ns 67086 ns 10189 bytes_per_second=14.5567G/s ReadFile/256/real_time 274905 ns 274901 ns 2543 bytes_per_second=3.55236G/s ReadFile/1024/real_time 1074018 ns 1074004 ns 650 bytes_per_second=931.083M/s ReadFile/4096/real_time 4307403 ns 4307316 ns 164 bytes_per_second=232.158M/s ReadFile/8192/real_time 8266500 ns 8266343 ns 84 bytes_per_second=120.97M/s

So I'll change this to not use a separate thread pool by default. (I'd also like to evaluate this benchmark when compression is involved, though.)

My learning yesterday (maybe, take this all with a grain of salt, it's always possible I was misreading/misinterpreting) was that even if the processing is light it pays to get off the thread pool quickly. If the I/O is slow you want to trigger the next read as quickly as possible. However, even when the I/O is fast (buffered), the read only ensures the pages accessed are in memory, it does not load it into the CPU caches, that happens when you access the memory.

So if you do some very fast decompression on the I/O thread you will be forcing the actual load into the CPU cache of the I/O thread's CPU. Then later, you will have transferred to the CPU thread for parsing or decoding or filtering or projecting. If you're unlucky the CPU thread you are given is on a different core and you end up having to pay the memory load cost all over again.

I will setup some micro-benchmarks later this week to try and suss out how accurate this statement is.

lidavidm · 2021-03-09T19:35:33Z

FWIW, here's a benchmark of reading a compressed file, with and without readahead:

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReadCompressedFile/1/real_time          254476620 ns     11010761 ns            3 bytes_per_second=502.993M/s
ReadCompressedFile/4/real_time           73552088 ns      4851059 ns            8 bytes_per_second=1.69948G/s
ReadCompressedFile/16/real_time          46830653 ns     14603649 ns           15 bytes_per_second=2.66919G/s
ReadCompressedFile/64/real_time          69125517 ns     52987906 ns           10 bytes_per_second=1.8083G/s
ReadCompressedFile/256/real_time        180001740 ns    171520913 ns            4 bytes_per_second=711.104M/s
ReadCompressedFile/1024/real_time       670619546 ns    646285305 ns            1 bytes_per_second=190.868M/s
ReadaheadCompressedFile/1/real_time      66585787 ns      4212685 ns           10 bytes_per_second=1.87728G/s
ReadaheadCompressedFile/4/real_time      28566113 ns      3307862 ns           25 bytes_per_second=4.37581G/s
ReadaheadCompressedFile/16/real_time     31308089 ns      6213048 ns           22 bytes_per_second=3.99258G/s
ReadaheadCompressedFile/64/real_time     43825981 ns     23104706 ns           16 bytes_per_second=2.85219G/s
ReadaheadCompressedFile/256/real_time   125055297 ns     92982211 ns            6 bytes_per_second=1023.55M/s
ReadaheadCompressedFile/1024/real_time  510944855 ns    375741074 ns            1 bytes_per_second=250.516M/s

westonpace

This is good, thanks for doing this.

westonpace · 2021-03-10T17:21:38Z

cpp/src/arrow/ipc/reader.cc

Why arrow:All instead of your new fail-fast version? Seems that would work here.

Here, I actually want the results, not just the status. I've renamed this - this future is to just read the dictionary messages, not to actually parse them.

westonpace · 2021-03-10T17:38:47Z

cpp/src/arrow/ipc/read_write_benchmark.cc

Nit: Unused?

westonpace · 2021-03-10T17:39:17Z

cpp/src/arrow/ipc/read_write_benchmark.cc

Nit: Unused?

westonpace · 2021-03-10T17:40:12Z

cpp/src/arrow/ipc/read_write_benchmark.cc

Nit: Does this readahead actually help given you have one at the message level? Also, you're consuming them pretty rapidly. My guess is you could probably do away with it.

I think actually the one at the message level is the redundant one here, since that's trivial (the file's in memory), but the decoding work is nontrivial (compression). Note that the compression isn't done in the IpcMessageGenerator, but is done as part of decoding the record batch.

With both sets of readahead:

ReadaheadCompressedFile/1/real_time 65481827 ns 4493732 ns 8 bytes_per_second=1.90893G/s ReadaheadCompressedFile/4/real_time 29108664 ns 3322772 ns 24 bytes_per_second=4.29425G/s ReadaheadCompressedFile/16/real_time 31461325 ns 6222532 ns 22 bytes_per_second=3.97313G/s ReadaheadCompressedFile/64/real_time 44750635 ns 24064817 ns 16 bytes_per_second=2.79326G/s ReadaheadCompressedFile/256/real_time 122477440 ns 91270202 ns 6 bytes_per_second=1045.09M/s ReadaheadCompressedFile/1024/real_time 515403669 ns 381340640 ns 1 bytes_per_second=248.349M/s

With only I/O-level readahead:

ReadaheadCompressedFile/1/real_time 533258427 ns 11385747 ns 1 bytes_per_second=240.034M/s ReadaheadCompressedFile/4/real_time 71233474 ns 2145576 ns 9 bytes_per_second=1.75479G/s ReadaheadCompressedFile/16/real_time 45455989 ns 3591831 ns 15 bytes_per_second=2.74991G/s ReadaheadCompressedFile/64/real_time 64612808 ns 8730379 ns 11 bytes_per_second=1.9346G/s ReadaheadCompressedFile/256/real_time 188120059 ns 18770867 ns 4 bytes_per_second=680.417M/s ReadaheadCompressedFile/1024/real_time 699025221 ns 58355812 ns 1 bytes_per_second=183.112M/s

With only batch-level readahead:

ReadaheadCompressedFile/1/real_time 59513029 ns 3280963 ns 11 bytes_per_second=2.10038G/s ReadaheadCompressedFile/4/real_time 27160986 ns 2378473 ns 25 bytes_per_second=4.60219G/s ReadaheadCompressedFile/16/real_time 30018574 ns 3746521 ns 23 bytes_per_second=4.16409G/s ReadaheadCompressedFile/64/real_time 41358054 ns 11515016 ns 17 bytes_per_second=3.02239G/s ReadaheadCompressedFile/256/real_time 80396235 ns 26029905 ns 9 bytes_per_second=1.5548G/s ReadaheadCompressedFile/1024/real_time 473883489 ns 96278350 ns 2 bytes_per_second=270.109M/s

With no readahead:

ReadaheadCompressedFile/1/real_time 544947010 ns 8763782 ns 2 bytes_per_second=234.885M/s ReadaheadCompressedFile/4/real_time 73232625 ns 1522407 ns 8 bytes_per_second=1.70689G/s ReadaheadCompressedFile/16/real_time 47781962 ns 2977593 ns 15 bytes_per_second=2.61605G/s ReadaheadCompressedFile/64/real_time 64026555 ns 7882798 ns 11 bytes_per_second=1.95231G/s ReadaheadCompressedFile/256/real_time 187098473 ns 18156513 ns 4 bytes_per_second=684.132M/s ReadaheadCompressedFile/1024/real_time 696976648 ns 57355852 ns 1 bytes_per_second=183.65M/s

So I'll change this to only test batch-level readahead. I/O level readahead would help more on something like S3, which we could set up a benchmark for as well.

westonpace · 2021-03-10T17:44:14Z

cpp/src/arrow/ipc/read_write_test.cc

I don't think this is safe the way the macro works. Use ASSERT_FINISHES_OK_AND_ASSIGN. It also adds a 10 second timeout in case something gunks up for whatever reason.

I had to define an EXPECT_FINISHES_OK_AND_ASSIGN to get it to work here.

westonpace · 2021-03-10T17:44:39Z

cpp/src/arrow/ipc/read_write_test.cc

Same as below, ASSERT_FINISHES_OK_AND_ASSIGN

westonpace · 2021-03-10T17:53:30Z

cpp/src/arrow/ipc/reader.cc

An unavoidable naming nit I think. This is the sync counterpart to RecordBatchFileReader and the two names aren't very parallel. Ideally the other could be named IpcFileRecordBatchReader.

westonpace · 2021-03-26T01:00:14Z

Can you rebase this? I'm using it as part of ARROW-12014

lidavidm · 2021-03-26T01:14:25Z

@westonpace done.

lidavidm · 2021-03-31T20:31:54Z

I've rebased this to use the background generator, however, it doesn't help much, and it makes us non-reentrant, so we also lose any advantage with compressed data as we can't parallelize the decompression anymore.

The async reader gets anywhere from 30-90% of the throughput of the synchronous one.

Cases here are numbered by the number of columns in the file. The cases with very few columns are a worst case for async, since decoding is basically 0-cost and async is purely overhead. Conversely the cases with many columns are a best case, since decoding is expensive. However async doesn't help because I/O is relatively cheap in all cases benchmarked here and there is no pipelining to be had.

Frankly, the fastest approach I tested was to just wrap the synchronous reader in a Future and block the caller, which isn't encouraging. A flamegraph shows that using the thread pool for decoding work is still rather expensive, and so it might be better if we used something like the background generator for that as well. In that case it would be convenient if we could somehow pull directly from the background generator's queue instead of having to get and block on futures; also this still means we can't get any benefit from parallelizing decompression if needed. For datasets with files >= cores that's probably not a big deal if you only care about throughput (we'll still decode in parallel) but if you need results in order and/or you have few files relative to cores then it won't be optimal.

You may question why in-memory (ReadFile) is slower than a temp file (ReadTempFile). In the flamegraphs, the culprit appears to be BufferReader's use of MemoryAdviseWillNeed, which spends a significant amount of time in the kernel. Removing it improves performance drastically.

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReadFile/1/real_time                         7858 ns         7858 ns        85629 bytes_per_second=124.269G/s
ReadFile/4/real_time                        10698 ns        10698 ns        64406 bytes_per_second=91.2852G/s
ReadFile/16/real_time                       21661 ns        21661 ns        32684 bytes_per_second=45.0839G/s
ReadFile/64/real_time                       67470 ns        67470 ns        10406 bytes_per_second=14.4741G/s
ReadFile/256/real_time                     275292 ns       275282 ns         2553 bytes_per_second=3.54738G/s
ReadFile/1024/real_time                   1071125 ns      1071065 ns          652 bytes_per_second=933.598M/s
ReadFile/4096/real_time                   4245107 ns      4245052 ns          165 bytes_per_second=235.565M/s
ReadFile/8192/real_time                   8157924 ns      8157957 ns           85 bytes_per_second=122.58M/s
ReadFileAsync/1/real_time                   23883 ns         7835 ns        29390 bytes_per_second=40.8887G/s
ReadFileAsync/4/real_time                   27242 ns         9040 ns        25836 bytes_per_second=35.8478G/s
ReadFileAsync/16/real_time                  40988 ns        14562 ns        17154 bytes_per_second=23.8253G/s
ReadFileAsync/64/real_time                  93104 ns        33633 ns         7334 bytes_per_second=10.489G/s
ReadFileAsync/256/real_time                303852 ns       116901 ns         2313 bytes_per_second=3.21394G/s
ReadFileAsync/1024/real_time              1430233 ns       531043 ns          546 bytes_per_second=699.187M/s
ReadFileAsync/4096/real_time              4589980 ns      1895584 ns          153 bytes_per_second=217.866M/s
ReadFileAsync/8192/real_time              8793373 ns      3865574 ns           82 bytes_per_second=113.722M/s
ReadTempFile/1/real_time                    70972 ns        70936 ns         9712 bytes_per_second=220.157G/s
ReadTempFile/4/real_time                    74053 ns        74022 ns         9243 bytes_per_second=210.997G/s
ReadTempFile/16/real_time                   85777 ns        85749 ns         8100 bytes_per_second=182.158G/s
ReadTempFile/64/real_time                  132803 ns       132783 ns         5331 bytes_per_second=117.656G/s
ReadTempFile/256/real_time                 333974 ns       333967 ns         2093 bytes_per_second=46.785G/s
ReadTempFile/1024/real_time               1131198 ns      1131179 ns          607 bytes_per_second=13.8128G/s
ReadTempFile/4096/real_time               4330575 ns      4330568 ns          161 bytes_per_second=3.60807G/s
ReadTempFile/8192/real_time               8270275 ns      8270100 ns           85 bytes_per_second=1.8893G/s
ReadTempFileAsync/1/real_time               88569 ns        12731 ns         7814 bytes_per_second=176.417G/s
ReadTempFileAsync/4/real_time               94127 ns        14422 ns         7477 bytes_per_second=165.998G/s
ReadTempFileAsync/16/real_time             104455 ns        20203 ns         6652 bytes_per_second=149.586G/s
ReadTempFileAsync/64/real_time             158604 ns        38862 ns         4443 bytes_per_second=98.516G/s
ReadTempFileAsync/256/real_time            372728 ns       122446 ns         1831 bytes_per_second=41.9207G/s
ReadTempFileAsync/1024/real_time          1347728 ns       485078 ns          520 bytes_per_second=11.5936G/s
ReadTempFileAsync/4096/real_time          4649311 ns      1930484 ns          151 bytes_per_second=3.36071G/s
ReadTempFileAsync/8192/real_time          8773800 ns      3815852 ns           80 bytes_per_second=1.78087G/s
ReadCompressedFile/1/real_time           30636840 ns      1421583 ns           23 bytes_per_second=522.247M/s
ReadCompressedFile/4/real_time            9529811 ns       628655 ns           65 bytes_per_second=1.63959G/s
ReadCompressedFile/16/real_time           5673642 ns      1863531 ns          122 bytes_per_second=2.75396G/s
ReadCompressedFile/64/real_time           8372634 ns      6633169 ns           84 bytes_per_second=1.8662G/s
ReadCompressedFile/256/real_time         22590210 ns     21607133 ns           28 bytes_per_second=708.271M/s
ReadCompressedFile/1024/real_time        84274350 ns     81412117 ns            9 bytes_per_second=189.856M/s
ReadCompressedFile/4096/real_time       330157333 ns    317542733 ns            2 bytes_per_second=48.4617M/s
ReadCompressedFile/8192/real_time       648075491 ns    627804731 ns            1 bytes_per_second=24.6885M/s
ReadCompressedFileAsync/1/real_time      57512529 ns      1849864 ns            9 bytes_per_second=278.2M/s
ReadCompressedFileAsync/4/real_time       9702801 ns       553906 ns           71 bytes_per_second=1.61036G/s
ReadCompressedFileAsync/16/real_time      6001873 ns      1765858 ns          114 bytes_per_second=2.60335G/s
ReadCompressedFileAsync/64/real_time      8414578 ns      6398791 ns           81 bytes_per_second=1.8569G/s
ReadCompressedFileAsync/256/real_time    22844448 ns     20703843 ns           30 bytes_per_second=700.389M/s
ReadCompressedFileAsync/1024/real_time   83260767 ns     75605439 ns            8 bytes_per_second=192.167M/s
ReadCompressedFileAsync/4096/real_time  329809506 ns    298760917 ns            2 bytes_per_second=48.5129M/s
ReadCompressedFileAsync/8192/real_time  643886356 ns    584995701 ns            1 bytes_per_second=24.8491M/s

lidavidm · 2021-03-31T21:26:04Z

That said, the case here is not too relevant to Datasets, so the next thing will be to integrate this with the ARROW-7001 WIP and benchmark against local/S3 file systems with varying numbers of files.

lidavidm · 2021-04-01T15:10:08Z

I've tested on EC2/S3 now. In terms of Datasets, this doesn't affect scans when files <= cores, but improves performance when files > cores. It also fixes a regression compared to just using ARROW-7001 by itself. (The ARROW-11772 case in the charts is with ARROW-7001 combined with ARROW-11772.)

The dataset tested was a 16-file feather dataset, scanning either 1, 4, or all 16 files. Each file had 32 columns of mixed primitive types with 128 batches of 4096 rows per batch. Files had LZ4 compression. The EC2 instance tested was a t3.xlarge (4 cores).

lidavidm · 2021-04-01T15:11:38Z

And these are all the cases tested:

Details

arrow-3.0; local; # of files: 1
Min:     0.096s
Mean:    0.111s
Median:  0.112s
Max:     0.131s
arrow-3.0; local; # of files: 4
Min:     0.113s
Mean:    0.134s
Median:  0.133s
Max:     0.163s
arrow-3.0; local; # of files: 16
Min:     0.469s
Mean:    0.487s
Median:  0.482s
Max:     0.517s
arrow-3.0; s3; # of files: 1
Min:     5.026s
Mean:    5.709s
Median:  5.698s
Max:     6.325s
arrow-3.0; s3; # of files: 4
Min:     4.712s
Mean:    5.185s
Median:  5.200s
Max:     5.723s
arrow-3.0; s3; # of files: 16
Min:    21.941s
Mean:   23.111s
Median: 23.045s
Max:    25.167s
arrow-4.0-async; local; # of files: 1
Min:     0.098s
Mean:    0.126s
Median:  0.129s
Max:     0.152s
arrow-4.0-async; local; # of files: 4
Min:     0.330s
Mean:    0.354s
Median:  0.351s
Max:     0.386s
arrow-4.0-async; local; # of files: 16
Min:     1.256s
Mean:    1.319s
Median:  1.310s
Max:     1.412s
arrow-4.0-async; s3; # of files: 1
Min:     5.578s
Mean:    6.060s
Median:  5.796s
Max:     6.778s
arrow-4.0-async; s3; # of files: 4
Min:    22.715s
Mean:   24.146s
Median: 23.957s
Max:    26.078s
arrow-4.0-async; s3; # of files: 16
Min:    91.531s
Mean:   96.067s
Median: 94.825s
Max:    103.538s
arrow-4.0-full-async; local; # of files: 1
Min:     0.087s
Mean:    0.113s
Median:  0.116s
Max:     0.133s
arrow-4.0-full-async; local; # of files: 4
Min:     0.202s
Mean:    0.270s
Median:  0.242s
Max:     0.381s
arrow-4.0-full-async; local; # of files: 16
Min:     0.777s
Mean:    1.260s
Median:  1.172s
Max:     1.886s
arrow-4.0-full-async; s3; # of files: 1
Min:     5.060s
Mean:    5.672s
Median:  5.665s
Max:     6.526s
arrow-4.0-full-async; s3; # of files: 4
Min:     5.373s
Mean:    5.669s
Median:  5.667s
Max:     5.892s
arrow-4.0-full-async; s3; # of files: 16
Min:    10.891s
Mean:   12.102s
Median: 11.441s
Max:    15.285s
arrow-4.0; local; # of files: 1
Min:     0.089s
Mean:    0.115s
Median:  0.113s
Max:     0.143s
arrow-4.0; local; # of files: 4
Min:     0.157s
Mean:    0.166s
Median:  0.162s
Max:     0.188s
arrow-4.0; local; # of files: 16
Min:     0.608s
Mean:    0.631s
Median:  0.632s
Max:     0.663s
arrow-4.0; s3; # of files: 1
Min:     5.023s
Mean:    5.186s
Median:  5.163s
Max:     5.483s
arrow-4.0; s3; # of files: 4
Min:     5.264s
Mean:    6.013s
Median:  5.590s
Max:     8.655s
arrow-4.0; s3; # of files: 16
Min:    21.682s
Mean:   22.759s
Median: 22.432s
Max:    24.636s

lidavidm · 2021-04-01T15:17:26Z

However, it is still a large regression with local files:

pitrou · 2021-04-01T15:34:42Z

cpp/src/arrow/ipc/reader.cc

Ok, I don't think this is the right approach. Basically you are taking a blocking message reader (ReadMessageFromBlock) and pushing it to a thread pool to hide latencies. But when reading a message from a mmap'ed file, you are adding the thread pool overhead to a very small execution cost.

Instead, there should be a ReadMessageFromBlockAsync that returns a Future<Message>. It's up to the underlying IO object (memory mapped file, S3 object...) to decide whether the async read goes to a thread pool or is synchronous.

lidavidm · 2021-04-26T22:42:08Z

Transferring off of the I/O pool fixes the performance degradation (at some cost to the 16-file S3 case, but even then, we're still much faster than before). This also gets us in about the same position as we were when testing ARROW-7001.

westonpace

Just a few minor cleanup notes.

westonpace · 2021-04-27T17:18:31Z

cpp/src/arrow/dataset/file_ipc.cc

Nit: I'm pretty sure you can do const ScanOptions& scan_options here.

westonpace · 2021-04-27T17:18:57Z

cpp/src/arrow/dataset/file_ipc.cc

Same with format.

westonpace · 2021-04-27T17:22:22Z

cpp/src/arrow/dataset/file_ipc.cc

Why force use_threads to false?

westonpace · 2021-04-27T17:26:42Z

cpp/src/arrow/dataset/file_ipc.cc

Nit: Change reader to a const reference. Since there could be multiple callbacks we can never move into a callback so it's always const reference or a copy.

westonpace · 2021-04-27T17:37:23Z

cpp/src/arrow/ipc/read_write_benchmark.cc

Maybe kBatchSize? kTotalSize makes me think it is the size of all the batches.

westonpace · 2021-04-27T18:27:13Z

cpp/src/arrow/ipc/reader.h

This overhead seems superfluous to me given there is one that takes io::RandomAccessFile* already.

This is just mirroring the sync API. If we wanted to cut down on overloads, I'd rather have only this overload as it keeps the file alive (the other one would be prone to dangling pointers/isn't used by datasets).

westonpace · 2021-04-27T18:27:37Z

cpp/src/arrow/ipc/reader.h

Same as above, do we need both versions?

westonpace · 2021-04-27T18:31:09Z

cpp/src/arrow/ipc/reader.cc

Why a macro instead of a small helper function?

I am not sure why it was originally done that way, I guess to be extra sure it was inlined?

westonpace · 2021-04-27T18:43:06Z

cpp/src/arrow/ipc/reader.cc

I agree with the comment above that this forces read_dictionaries_ to be a rather odd future. Is there any reason you don't want to do:

auto read_messages = read_dictionaries_.Then([] (...) {return read_message;});

I think it cleans up the surrounding code nicely and you can change read_dictionaries_ to Future<>

westonpace · 2021-04-27T18:59:25Z

cpp/src/arrow/ipc/reader.cc

This could trigger a lot of small reads depending on record batch configuration but I think this is tackled with coalescing?

I intend to rebase on ARROW-12522 and add coalescing.

lidavidm

Thanks for the review. I've addressed the feedback. Now this is rebased on the coalescing PR, so I will re-test this branch and verify that coalescing helps & whether transferring is necessary.

lidavidm · 2021-04-27T19:30:56Z

cpp/src/arrow/ipc/reader.h

This is just mirroring the sync API. If we wanted to cut down on overloads, I'd rather have only this overload as it keeps the file alive (the other one would be prone to dangling pointers/isn't used by datasets).

lidavidm · 2021-04-27T19:37:10Z

cpp/src/arrow/ipc/message.cc

Unfortunately yes, since MessageDecoder takes shared_ptr/unique_ptr directly.

lidavidm · 2021-04-27T19:39:36Z

cpp/src/arrow/ipc/read_write_test.cc

Right, I meant "the generator will keep the reader alive for us".

lidavidm · 2021-04-27T19:40:20Z

cpp/src/arrow/ipc/read_write_test.cc

The ASSERT variants don't work in non-void returning contexts so I'll add the EXPECT.

lidavidm · 2021-04-27T19:42:49Z

cpp/src/arrow/ipc/read_write_test.cc

They aren't called ever so I assume they are vestigial (I will delete them).

lidavidm · 2021-04-27T19:43:22Z

cpp/src/arrow/ipc/reader.cc

I intend to rebase on ARROW-12522 and add coalescing.

lidavidm · 2021-04-27T19:49:55Z

cpp/src/arrow/ipc/reader.cc

I am not sure why it was originally done that way, I guess to be extra sure it was inlined?

lidavidm · 2021-04-28T14:34:59Z

I instrumented a bit of the scanner and took a look at pre-buffering's effect on local file performance, to see if I could figure out why it seems detrimental.

Without pre-buffering locally, we're utilizing the CPU thread pool (nearly) fully. (I'm not sure why one thread seems to receive little/no work, but I didn't fully instrument the scanner and perhaps it's assembling the table or something like that.)

Just enabling pre-buffer causes a slowdown, as it appears nearly all work gets forced onto a single thread!

Before I saw that transferring the CPU task helped with this, and indeed, it does, though it doesn't quite get us on par. Also note I had to change the code to force transfer by callling Spawn instead of just using Transfer.

Given this, I'd guess what happens is as follows: with coalescing, since multiple record batches get mapped onto one I/O future, later requests to read a record batch will find an already-completed Future and will synchronously run all the callbacks inline, causing everything to pile onto the main thread. Forcing transfer alleviates this and better utilizes the thread pool, but the CPU-bound tasks are very fast (<< 1ms) and so we face the overhead of creating extra futures/spawning tasks instead. (Note that the non-pre-buffer case probably isn't even touching the CPU thread pool, because we don't transfer in that case - that's why you only see 16 threads (8 I/O + 8 CPU) in the last case.)

westonpace · 2021-04-28T18:01:08Z

Indeed. Check out ARROW-12560 which should achieve the best of both worlds (minimizing spawn while maximizing core utilization).

lidavidm · 2021-04-28T20:19:38Z

I verified things on EC2. ARROW-12560 will let us just always 'do the right thing' so to speak, but for now, it looks like always transferring is beneficial in essentially all cases, so I've updated the PR.

westonpace · 2021-04-28T20:29:07Z

Keep in mind too that UseAsync is optional so you don't need to solve all the possible performance permutations in this PR.

lidavidm · 2021-04-28T20:30:11Z

Yes, I may have gotten a bit carried away…I'm happy to split this up if that's more manageable.

westonpace · 2021-04-28T20:34:23Z

I don't think that's necessary. I was going more for "this looks really good, let's get it in" and not so much "this is getting too big".

lidavidm · 2021-05-12T19:50:46Z

Just to follow up, @pitrou @westonpace did either of you have final comments here?

westonpace · 2021-05-12T22:44:36Z

No final comments from me. I scanned through the last few updates and all seems good.

lidavidm · 2021-05-14T19:46:04Z

CI passes, so I'll go ahead and merge this.

lidavidm added the Component: C++ label Mar 8, 2021

This comment has been minimized.

Sign in to view

lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 45ab43f to e2c3cfa Compare March 8, 2021 16:48

lidavidm changed the title ~~ARROW-11272: [C++] Provide reentrant IPC file reader~~ ARROW-11772: [C++] Provide reentrant IPC file reader Mar 8, 2021

pitrou reviewed Mar 9, 2021

View reviewed changes

cpp/src/arrow/util/future.h Outdated

Copy link

Member

pitrou Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a trivial test for this?

lidavidm reacted with thumbs up emoji

lidavidm force-pushed the arrow-11772 branch from e2c3cfa to 222a49d Compare March 9, 2021 15:43

pitrou reviewed Mar 9, 2021

View reviewed changes

lidavidm force-pushed the arrow-11772 branch 2 times, most recently from dfdac6e to f11cf6a Compare March 9, 2021 18:24

lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 48a67a2 to 67bb3e3 Compare March 10, 2021 15:14

westonpace reviewed Mar 10, 2021

View reviewed changes

lidavidm force-pushed the arrow-11772 branch from 67bb3e3 to 0e08714 Compare March 10, 2021 18:41

lidavidm force-pushed the arrow-11772 branch from a7304c4 to c3cc31f Compare March 19, 2021 13:01

lidavidm force-pushed the arrow-11772 branch from c3cc31f to 687ffb7 Compare March 26, 2021 01:13

lidavidm force-pushed the arrow-11772 branch from 687ffb7 to 13ae106 Compare March 31, 2021 20:20

lidavidm force-pushed the arrow-11772 branch from 13ae106 to b43d6a9 Compare April 1, 2021 13:28

pitrou reviewed Apr 1, 2021

View reviewed changes

lidavidm force-pushed the arrow-11772 branch from 59260b6 to e153cf7 Compare April 26, 2021 18:55

westonpace requested changes Apr 27, 2021

View reviewed changes

lidavidm force-pushed the arrow-11772 branch from e153cf7 to 68d4691 Compare April 27, 2021 21:10

lidavidm commented Apr 27, 2021

View reviewed changes

lidavidm force-pushed the arrow-11772 branch 3 times, most recently from 3a4b67c to a1995f8 Compare April 28, 2021 14:09

lidavidm force-pushed the arrow-11772 branch 3 times, most recently from 2e0beca to 85a5d7c Compare May 7, 2021 13:58

lidavidm force-pushed the arrow-11772 branch from 85a5d7c to 18b4b4d Compare May 12, 2021 19:50

lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 10bfcb7 to f5ab4b5 Compare May 13, 2021 14:55

lidavidm and others added 3 commits May 14, 2021 11:11

ARROW-11772: [C++] Add IPC record batch generator

9bd7914

Only delete memory in ~PoolBuffer if not finalizing pools

19478e8

ARROW-11772: [C++] Implement ScanBatchesAsync

c29d946

lidavidm force-pushed the arrow-11772 branch from f5ab4b5 to c29d946 Compare May 14, 2021 15:11

lidavidm closed this in b01bcf2 May 14, 2021

asfimport mentioned this pull request May 14, 2021

[C++] Add asynchronous read to ipc::RecordBatchFileReader #27624

Closed

ARROW-11772: [C++] Provide reentrant IPC file reader #9656

ARROW-11772: [C++] Provide reentrant IPC file reader #9656

Uh oh!

Conversation

lidavidm commented Mar 8, 2021

Uh oh!

This comment has been minimized.

github-actions bot commented Mar 8, 2021

Uh oh!

pitrou commented Mar 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Mar 9, 2021

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Mar 26, 2021

Uh oh!

lidavidm commented Mar 26, 2021

Uh oh!

lidavidm commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Mar 31, 2021

Uh oh!

lidavidm commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Apr 1, 2021

Uh oh!

lidavidm commented Apr 1, 2021

Uh oh!

lidavidm commented Mar 31, 2021 •

edited

Loading

lidavidm commented Apr 1, 2021 •

edited

Loading