Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 8, 2021

This provides an async-reentrant generator of record batches from an IPC file reader, intended to support Datasets once it becomes async itself.

IPC messages are read on an IO thread pool, then decoded on the CPU thread pool. All dictionaries must be read at the start, then record batches can be read independently.

@github-actions

This comment has been minimized.

@lidavidm lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 45ab43f to e2c3cfa Compare March 8, 2021 16:48
@lidavidm lidavidm changed the title ARROW-11272: [C++] Provide reentrant IPC file reader ARROW-11772: [C++] Provide reentrant IPC file reader Mar 8, 2021
@github-actions
Copy link

github-actions bot commented Mar 8, 2021

@pitrou
Copy link
Member

pitrou commented Mar 9, 2021

cc @westonpace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're talking about an asynchronous primitive, forcing the caller to deal with lifetime issues is a bit unfriendly IMHO. It may be simpler to make RecordBatchFileReader inherit from enable_shared_from_this and capture a strong reference inside the generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this was actually left over from an earlier approach. I've removed the note and adjusted the test to ensure the reader gets dropped before the generator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a trivial test for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit tedious. You're essentially copying most of those fields from RandomAccessFile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll change it so that it just stores a RecordBatchFileReaderImpl and add a enable_shared_from_this instead to avoid duplicating fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also avoid the "state" struct and put the desired fields directly in the generator classes (I'm not sure why you need two of them, by the way, unless you plan to do something else later with the IPC message generator).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generators are distinct - one reads from the filesystem and one decompresses/decodes the batch. By separating them, we can also apply readahead independently to each stage.

The state struct was because AsyncGenerators need to be copyable, and since DictionaryMemo is move-only, we had to put it behind an indirection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make those functions static or put them in the anonymous namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to avoid this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's desirable to force all processing to go to the global thread pool unconditionally.
(also, are you sure the processing is heavy enough that it benefits from it?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually not, but it would be in the case of compressed buffers. I could also change it to not offload onto a secondary pool by default (and hence do the work on the same thread used to read data) and/or benchmark if there's any overhead to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead of dispatching tasks to a thread pool is difficult to evaluate rigorously, but you can get an idea by running arrow-thread-pool-benchmark. IMHO the bottom line is that we should aim for tasks in the millisecond range (rather than microsecond or less).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the ReadFile benchmark it seems reading a 1 MiB batch takes about 1ms once there are >=1024 columns.

------------------------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------
ReadFile/1/real_time                  8130 ns         8130 ns        86111 bytes_per_second=120.115G/s
ReadFile/4/real_time                 10734 ns        10734 ns        65153 bytes_per_second=90.9826G/s
ReadFile/16/real_time                21779 ns        21779 ns        32081 bytes_per_second=44.8389G/s
ReadFile/64/real_time                67087 ns        67086 ns        10189 bytes_per_second=14.5567G/s
ReadFile/256/real_time              274905 ns       274901 ns         2543 bytes_per_second=3.55236G/s
ReadFile/1024/real_time            1074018 ns      1074004 ns          650 bytes_per_second=931.083M/s
ReadFile/4096/real_time            4307403 ns      4307316 ns          164 bytes_per_second=232.158M/s
ReadFile/8192/real_time            8266500 ns      8266343 ns           84 bytes_per_second=120.97M/s

So I'll change this to not use a separate thread pool by default. (I'd also like to evaluate this benchmark when compression is involved, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My learning yesterday (maybe, take this all with a grain of salt, it's always possible I was misreading/misinterpreting) was that even if the processing is light it pays to get off the thread pool quickly. If the I/O is slow you want to trigger the next read as quickly as possible. However, even when the I/O is fast (buffered), the read only ensures the pages accessed are in memory, it does not load it into the CPU caches, that happens when you access the memory.

So if you do some very fast decompression on the I/O thread you will be forcing the actual load into the CPU cache of the I/O thread's CPU. Then later, you will have transferred to the CPU thread for parsing or decoding or filtering or projecting. If you're unlucky the CPU thread you are given is on a different core and you end up having to pay the memory load cost all over again.

I will setup some micro-benchmarks later this week to try and suss out how accurate this statement is.

@lidavidm lidavidm force-pushed the arrow-11772 branch 2 times, most recently from dfdac6e to f11cf6a Compare March 9, 2021 18:24
@lidavidm
Copy link
Member Author

lidavidm commented Mar 9, 2021

FWIW, here's a benchmark of reading a compressed file, with and without readahead:

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReadCompressedFile/1/real_time          254476620 ns     11010761 ns            3 bytes_per_second=502.993M/s
ReadCompressedFile/4/real_time           73552088 ns      4851059 ns            8 bytes_per_second=1.69948G/s
ReadCompressedFile/16/real_time          46830653 ns     14603649 ns           15 bytes_per_second=2.66919G/s
ReadCompressedFile/64/real_time          69125517 ns     52987906 ns           10 bytes_per_second=1.8083G/s
ReadCompressedFile/256/real_time        180001740 ns    171520913 ns            4 bytes_per_second=711.104M/s
ReadCompressedFile/1024/real_time       670619546 ns    646285305 ns            1 bytes_per_second=190.868M/s
ReadaheadCompressedFile/1/real_time      66585787 ns      4212685 ns           10 bytes_per_second=1.87728G/s
ReadaheadCompressedFile/4/real_time      28566113 ns      3307862 ns           25 bytes_per_second=4.37581G/s
ReadaheadCompressedFile/16/real_time     31308089 ns      6213048 ns           22 bytes_per_second=3.99258G/s
ReadaheadCompressedFile/64/real_time     43825981 ns     23104706 ns           16 bytes_per_second=2.85219G/s
ReadaheadCompressedFile/256/real_time   125055297 ns     92982211 ns            6 bytes_per_second=1023.55M/s
ReadaheadCompressedFile/1024/real_time  510944855 ns    375741074 ns            1 bytes_per_second=250.516M/s

@lidavidm lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 48a67a2 to 67bb3e3 Compare March 10, 2021 15:14
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, thanks for doing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why arrow:All instead of your new fail-fast version? Seems that would work here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I actually want the results, not just the status. I've renamed this - this future is to just read the dictionary messages, not to actually parse them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unused?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this readahead actually help given you have one at the message level? Also, you're consuming them pretty rapidly. My guess is you could probably do away with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually the one at the message level is the redundant one here, since that's trivial (the file's in memory), but the decoding work is nontrivial (compression). Note that the compression isn't done in the IpcMessageGenerator, but is done as part of decoding the record batch.

With both sets of readahead:

ReadaheadCompressedFile/1/real_time      65481827 ns      4493732 ns            8 bytes_per_second=1.90893G/s
ReadaheadCompressedFile/4/real_time      29108664 ns      3322772 ns           24 bytes_per_second=4.29425G/s
ReadaheadCompressedFile/16/real_time     31461325 ns      6222532 ns           22 bytes_per_second=3.97313G/s
ReadaheadCompressedFile/64/real_time     44750635 ns     24064817 ns           16 bytes_per_second=2.79326G/s
ReadaheadCompressedFile/256/real_time   122477440 ns     91270202 ns            6 bytes_per_second=1045.09M/s
ReadaheadCompressedFile/1024/real_time  515403669 ns    381340640 ns            1 bytes_per_second=248.349M/s

With only I/O-level readahead:

ReadaheadCompressedFile/1/real_time     533258427 ns     11385747 ns            1 bytes_per_second=240.034M/s
ReadaheadCompressedFile/4/real_time      71233474 ns      2145576 ns            9 bytes_per_second=1.75479G/s
ReadaheadCompressedFile/16/real_time     45455989 ns      3591831 ns           15 bytes_per_second=2.74991G/s
ReadaheadCompressedFile/64/real_time     64612808 ns      8730379 ns           11 bytes_per_second=1.9346G/s
ReadaheadCompressedFile/256/real_time   188120059 ns     18770867 ns            4 bytes_per_second=680.417M/s
ReadaheadCompressedFile/1024/real_time  699025221 ns     58355812 ns            1 bytes_per_second=183.112M/s

With only batch-level readahead:

ReadaheadCompressedFile/1/real_time      59513029 ns      3280963 ns           11 bytes_per_second=2.10038G/s
ReadaheadCompressedFile/4/real_time      27160986 ns      2378473 ns           25 bytes_per_second=4.60219G/s
ReadaheadCompressedFile/16/real_time     30018574 ns      3746521 ns           23 bytes_per_second=4.16409G/s
ReadaheadCompressedFile/64/real_time     41358054 ns     11515016 ns           17 bytes_per_second=3.02239G/s
ReadaheadCompressedFile/256/real_time    80396235 ns     26029905 ns            9 bytes_per_second=1.5548G/s
ReadaheadCompressedFile/1024/real_time  473883489 ns     96278350 ns            2 bytes_per_second=270.109M/s

With no readahead:

ReadaheadCompressedFile/1/real_time     544947010 ns      8763782 ns            2 bytes_per_second=234.885M/s
ReadaheadCompressedFile/4/real_time      73232625 ns      1522407 ns            8 bytes_per_second=1.70689G/s
ReadaheadCompressedFile/16/real_time     47781962 ns      2977593 ns           15 bytes_per_second=2.61605G/s
ReadaheadCompressedFile/64/real_time     64026555 ns      7882798 ns           11 bytes_per_second=1.95231G/s
ReadaheadCompressedFile/256/real_time   187098473 ns     18156513 ns            4 bytes_per_second=684.132M/s
ReadaheadCompressedFile/1024/real_time  696976648 ns     57355852 ns            1 bytes_per_second=183.65M/s

So I'll change this to only test batch-level readahead. I/O level readahead would help more on something like S3, which we could set up a benchmark for as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe the way the macro works. Use ASSERT_FINISHES_OK_AND_ASSIGN. It also adds a 10 second timeout in case something gunks up for whatever reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to define an EXPECT_FINISHES_OK_AND_ASSIGN to get it to work here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below, ASSERT_FINISHES_OK_AND_ASSIGN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unavoidable naming nit I think. This is the sync counterpart to RecordBatchFileReader and the two names aren't very parallel. Ideally the other could be named IpcFileRecordBatchReader.

@westonpace
Copy link
Member

Can you rebase this? I'm using it as part of ARROW-12014

@lidavidm
Copy link
Member Author

@westonpace done.

@lidavidm
Copy link
Member Author

lidavidm commented Mar 31, 2021

I've rebased this to use the background generator, however, it doesn't help much, and it makes us non-reentrant, so we also lose any advantage with compressed data as we can't parallelize the decompression anymore.

The async reader gets anywhere from 30-90% of the throughput of the synchronous one.

Cases here are numbered by the number of columns in the file. The cases with very few columns are a worst case for async, since decoding is basically 0-cost and async is purely overhead. Conversely the cases with many columns are a best case, since decoding is expensive. However async doesn't help because I/O is relatively cheap in all cases benchmarked here and there is no pipelining to be had.

Frankly, the fastest approach I tested was to just wrap the synchronous reader in a Future and block the caller, which isn't encouraging. A flamegraph shows that using the thread pool for decoding work is still rather expensive, and so it might be better if we used something like the background generator for that as well. In that case it would be convenient if we could somehow pull directly from the background generator's queue instead of having to get and block on futures; also this still means we can't get any benefit from parallelizing decompression if needed. For datasets with files >= cores that's probably not a big deal if you only care about throughput (we'll still decode in parallel) but if you need results in order and/or you have few files relative to cores then it won't be optimal.

You may question why in-memory (ReadFile) is slower than a temp file (ReadTempFile). In the flamegraphs, the culprit appears to be BufferReader's use of MemoryAdviseWillNeed, which spends a significant amount of time in the kernel. Removing it improves performance drastically.

-------------------------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReadFile/1/real_time                         7858 ns         7858 ns        85629 bytes_per_second=124.269G/s
ReadFile/4/real_time                        10698 ns        10698 ns        64406 bytes_per_second=91.2852G/s
ReadFile/16/real_time                       21661 ns        21661 ns        32684 bytes_per_second=45.0839G/s
ReadFile/64/real_time                       67470 ns        67470 ns        10406 bytes_per_second=14.4741G/s
ReadFile/256/real_time                     275292 ns       275282 ns         2553 bytes_per_second=3.54738G/s
ReadFile/1024/real_time                   1071125 ns      1071065 ns          652 bytes_per_second=933.598M/s
ReadFile/4096/real_time                   4245107 ns      4245052 ns          165 bytes_per_second=235.565M/s
ReadFile/8192/real_time                   8157924 ns      8157957 ns           85 bytes_per_second=122.58M/s
ReadFileAsync/1/real_time                   23883 ns         7835 ns        29390 bytes_per_second=40.8887G/s
ReadFileAsync/4/real_time                   27242 ns         9040 ns        25836 bytes_per_second=35.8478G/s
ReadFileAsync/16/real_time                  40988 ns        14562 ns        17154 bytes_per_second=23.8253G/s
ReadFileAsync/64/real_time                  93104 ns        33633 ns         7334 bytes_per_second=10.489G/s
ReadFileAsync/256/real_time                303852 ns       116901 ns         2313 bytes_per_second=3.21394G/s
ReadFileAsync/1024/real_time              1430233 ns       531043 ns          546 bytes_per_second=699.187M/s
ReadFileAsync/4096/real_time              4589980 ns      1895584 ns          153 bytes_per_second=217.866M/s
ReadFileAsync/8192/real_time              8793373 ns      3865574 ns           82 bytes_per_second=113.722M/s
ReadTempFile/1/real_time                    70972 ns        70936 ns         9712 bytes_per_second=220.157G/s
ReadTempFile/4/real_time                    74053 ns        74022 ns         9243 bytes_per_second=210.997G/s
ReadTempFile/16/real_time                   85777 ns        85749 ns         8100 bytes_per_second=182.158G/s
ReadTempFile/64/real_time                  132803 ns       132783 ns         5331 bytes_per_second=117.656G/s
ReadTempFile/256/real_time                 333974 ns       333967 ns         2093 bytes_per_second=46.785G/s
ReadTempFile/1024/real_time               1131198 ns      1131179 ns          607 bytes_per_second=13.8128G/s
ReadTempFile/4096/real_time               4330575 ns      4330568 ns          161 bytes_per_second=3.60807G/s
ReadTempFile/8192/real_time               8270275 ns      8270100 ns           85 bytes_per_second=1.8893G/s
ReadTempFileAsync/1/real_time               88569 ns        12731 ns         7814 bytes_per_second=176.417G/s
ReadTempFileAsync/4/real_time               94127 ns        14422 ns         7477 bytes_per_second=165.998G/s
ReadTempFileAsync/16/real_time             104455 ns        20203 ns         6652 bytes_per_second=149.586G/s
ReadTempFileAsync/64/real_time             158604 ns        38862 ns         4443 bytes_per_second=98.516G/s
ReadTempFileAsync/256/real_time            372728 ns       122446 ns         1831 bytes_per_second=41.9207G/s
ReadTempFileAsync/1024/real_time          1347728 ns       485078 ns          520 bytes_per_second=11.5936G/s
ReadTempFileAsync/4096/real_time          4649311 ns      1930484 ns          151 bytes_per_second=3.36071G/s
ReadTempFileAsync/8192/real_time          8773800 ns      3815852 ns           80 bytes_per_second=1.78087G/s
ReadCompressedFile/1/real_time           30636840 ns      1421583 ns           23 bytes_per_second=522.247M/s
ReadCompressedFile/4/real_time            9529811 ns       628655 ns           65 bytes_per_second=1.63959G/s
ReadCompressedFile/16/real_time           5673642 ns      1863531 ns          122 bytes_per_second=2.75396G/s
ReadCompressedFile/64/real_time           8372634 ns      6633169 ns           84 bytes_per_second=1.8662G/s
ReadCompressedFile/256/real_time         22590210 ns     21607133 ns           28 bytes_per_second=708.271M/s
ReadCompressedFile/1024/real_time        84274350 ns     81412117 ns            9 bytes_per_second=189.856M/s
ReadCompressedFile/4096/real_time       330157333 ns    317542733 ns            2 bytes_per_second=48.4617M/s
ReadCompressedFile/8192/real_time       648075491 ns    627804731 ns            1 bytes_per_second=24.6885M/s
ReadCompressedFileAsync/1/real_time      57512529 ns      1849864 ns            9 bytes_per_second=278.2M/s
ReadCompressedFileAsync/4/real_time       9702801 ns       553906 ns           71 bytes_per_second=1.61036G/s
ReadCompressedFileAsync/16/real_time      6001873 ns      1765858 ns          114 bytes_per_second=2.60335G/s
ReadCompressedFileAsync/64/real_time      8414578 ns      6398791 ns           81 bytes_per_second=1.8569G/s
ReadCompressedFileAsync/256/real_time    22844448 ns     20703843 ns           30 bytes_per_second=700.389M/s
ReadCompressedFileAsync/1024/real_time   83260767 ns     75605439 ns            8 bytes_per_second=192.167M/s
ReadCompressedFileAsync/4096/real_time  329809506 ns    298760917 ns            2 bytes_per_second=48.5129M/s
ReadCompressedFileAsync/8192/real_time  643886356 ns    584995701 ns            1 bytes_per_second=24.8491M/s

@lidavidm
Copy link
Member Author

That said, the case here is not too relevant to Datasets, so the next thing will be to integrate this with the ARROW-7001 WIP and benchmark against local/S3 file systems with varying numbers of files.

@lidavidm
Copy link
Member Author

lidavidm commented Apr 1, 2021

I've tested on EC2/S3 now. In terms of Datasets, this doesn't affect scans when files <= cores, but improves performance when files > cores. It also fixes a regression compared to just using ARROW-7001 by itself. (The ARROW-11772 case in the charts is with ARROW-7001 combined with ARROW-11772.)

Median Scan Time (seconds)

The dataset tested was a 16-file feather dataset, scanning either 1, 4, or all 16 files. Each file had 32 columns of mixed primitive types with 128 batches of 4096 rows per batch. Files had LZ4 compression. The EC2 instance tested was a t3.xlarge (4 cores).

@lidavidm
Copy link
Member Author

lidavidm commented Apr 1, 2021

And these are all the cases tested:

Details
arrow-3.0; local; # of files: 1
Min:     0.096s
Mean:    0.111s
Median:  0.112s
Max:     0.131s
arrow-3.0; local; # of files: 4
Min:     0.113s
Mean:    0.134s
Median:  0.133s
Max:     0.163s
arrow-3.0; local; # of files: 16
Min:     0.469s
Mean:    0.487s
Median:  0.482s
Max:     0.517s
arrow-3.0; s3; # of files: 1
Min:     5.026s
Mean:    5.709s
Median:  5.698s
Max:     6.325s
arrow-3.0; s3; # of files: 4
Min:     4.712s
Mean:    5.185s
Median:  5.200s
Max:     5.723s
arrow-3.0; s3; # of files: 16
Min:    21.941s
Mean:   23.111s
Median: 23.045s
Max:    25.167s
arrow-4.0-async; local; # of files: 1
Min:     0.098s
Mean:    0.126s
Median:  0.129s
Max:     0.152s
arrow-4.0-async; local; # of files: 4
Min:     0.330s
Mean:    0.354s
Median:  0.351s
Max:     0.386s
arrow-4.0-async; local; # of files: 16
Min:     1.256s
Mean:    1.319s
Median:  1.310s
Max:     1.412s
arrow-4.0-async; s3; # of files: 1
Min:     5.578s
Mean:    6.060s
Median:  5.796s
Max:     6.778s
arrow-4.0-async; s3; # of files: 4
Min:    22.715s
Mean:   24.146s
Median: 23.957s
Max:    26.078s
arrow-4.0-async; s3; # of files: 16
Min:    91.531s
Mean:   96.067s
Median: 94.825s
Max:    103.538s
arrow-4.0-full-async; local; # of files: 1
Min:     0.087s
Mean:    0.113s
Median:  0.116s
Max:     0.133s
arrow-4.0-full-async; local; # of files: 4
Min:     0.202s
Mean:    0.270s
Median:  0.242s
Max:     0.381s
arrow-4.0-full-async; local; # of files: 16
Min:     0.777s
Mean:    1.260s
Median:  1.172s
Max:     1.886s
arrow-4.0-full-async; s3; # of files: 1
Min:     5.060s
Mean:    5.672s
Median:  5.665s
Max:     6.526s
arrow-4.0-full-async; s3; # of files: 4
Min:     5.373s
Mean:    5.669s
Median:  5.667s
Max:     5.892s
arrow-4.0-full-async; s3; # of files: 16
Min:    10.891s
Mean:   12.102s
Median: 11.441s
Max:    15.285s
arrow-4.0; local; # of files: 1
Min:     0.089s
Mean:    0.115s
Median:  0.113s
Max:     0.143s
arrow-4.0; local; # of files: 4
Min:     0.157s
Mean:    0.166s
Median:  0.162s
Max:     0.188s
arrow-4.0; local; # of files: 16
Min:     0.608s
Mean:    0.631s
Median:  0.632s
Max:     0.663s
arrow-4.0; s3; # of files: 1
Min:     5.023s
Mean:    5.186s
Median:  5.163s
Max:     5.483s
arrow-4.0; s3; # of files: 4
Min:     5.264s
Mean:    6.013s
Median:  5.590s
Max:     8.655s
arrow-4.0; s3; # of files: 16
Min:    21.682s
Mean:   22.759s
Median: 22.432s
Max:    24.636s

@lidavidm
Copy link
Member Author

lidavidm commented Apr 1, 2021

However, it is still a large regression with local files:

Median Scan Time (seconds)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't think this is the right approach. Basically you are taking a blocking message reader (ReadMessageFromBlock) and pushing it to a thread pool to hide latencies. But when reading a message from a mmap'ed file, you are adding the thread pool overhead to a very small execution cost.

Instead, there should be a ReadMessageFromBlockAsync that returns a Future<Message>. It's up to the underlying IO object (memory mapped file, S3 object...) to decide whether the async read goes to a thread pool or is synchronous.

@lidavidm
Copy link
Member Author

Transferring off of the I/O pool fixes the performance degradation (at some cost to the 16-file S3 case, but even then, we're still much faster than before). This also gets us in about the same position as we were when testing ARROW-7001.

Local Median Scan Time (seconds)(2)
S3 Median Scan Time (seconds)(6)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor cleanup notes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm pretty sure you can do const ScanOptions& scan_options here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why force use_threads to false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Change reader to a const reference. Since there could be multiple callbacks we can never move into a callback so it's always const reference or a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe kBatchSize? kTotalSize makes me think it is the size of all the batches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overhead seems superfluous to me given there is one that takes io::RandomAccessFile* already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just mirroring the sync API. If we wanted to cut down on overloads, I'd rather have only this overload as it keeps the file alive (the other one would be prone to dangling pointers/isn't used by datasets).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, do we need both versions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a macro instead of a small helper function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it was originally done that way, I guess to be extra sure it was inlined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comment above that this forces read_dictionaries_ to be a rather odd future. Is there any reason you don't want to do:

auto read_messages = read_dictionaries_.Then([] (...) {return read_message;});

I think it cleans up the surrounding code nicely and you can change read_dictionaries_ to Future<>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could trigger a lot of small reads depending on record batch configuration but I think this is tackled with coalescing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to rebase on ARROW-12522 and add coalescing.

Copy link
Member Author

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I've addressed the feedback. Now this is rebased on the coalescing PR, so I will re-test this branch and verify that coalescing helps & whether transferring is necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just mirroring the sync API. If we wanted to cut down on overloads, I'd rather have only this overload as it keeps the file alive (the other one would be prone to dangling pointers/isn't used by datasets).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes, since MessageDecoder takes shared_ptr/unique_ptr directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I meant "the generator will keep the reader alive for us".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASSERT variants don't work in non-void returning contexts so I'll add the EXPECT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't called ever so I assume they are vestigial (I will delete them).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to rebase on ARROW-12522 and add coalescing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it was originally done that way, I guess to be extra sure it was inlined?

@lidavidm lidavidm force-pushed the arrow-11772 branch 3 times, most recently from 3a4b67c to a1995f8 Compare April 28, 2021 14:09
@lidavidm
Copy link
Member Author

I instrumented a bit of the scanner and took a look at pre-buffering's effect on local file performance, to see if I could figure out why it seems detrimental.

Without pre-buffering locally, we're utilizing the CPU thread pool (nearly) fully. (I'm not sure why one thread seems to receive little/no work, but I didn't fully instrument the scanner and perhaps it's assembling the table or something like that.)
visualization

Just enabling pre-buffer causes a slowdown, as it appears nearly all work gets forced onto a single thread!
visualization(2)

Before I saw that transferring the CPU task helped with this, and indeed, it does, though it doesn't quite get us on par. Also note I had to change the code to force transfer by callling Spawn instead of just using Transfer.
visualization(1)

Given this, I'd guess what happens is as follows: with coalescing, since multiple record batches get mapped onto one I/O future, later requests to read a record batch will find an already-completed Future and will synchronously run all the callbacks inline, causing everything to pile onto the main thread. Forcing transfer alleviates this and better utilizes the thread pool, but the CPU-bound tasks are very fast (<< 1ms) and so we face the overhead of creating extra futures/spawning tasks instead. (Note that the non-pre-buffer case probably isn't even touching the CPU thread pool, because we don't transfer in that case - that's why you only see 16 threads (8 I/O + 8 CPU) in the last case.)

@westonpace
Copy link
Member

Indeed. Check out ARROW-12560 which should achieve the best of both worlds (minimizing spawn while maximizing core utilization).

@lidavidm
Copy link
Member Author

I verified things on EC2. ARROW-12560 will let us just always 'do the right thing' so to speak, but for now, it looks like always transferring is beneficial in essentially all cases, so I've updated the PR.

Local Median Scan Duration (s)(1)
S3 Median Scan Duration (s)(1)

@westonpace
Copy link
Member

Keep in mind too that UseAsync is optional so you don't need to solve all the possible performance permutations in this PR.

@lidavidm
Copy link
Member Author

Yes, I may have gotten a bit carried away…I'm happy to split this up if that's more manageable.

@westonpace
Copy link
Member

I don't think that's necessary. I was going more for "this looks really good, let's get it in" and not so much "this is getting too big".

@lidavidm lidavidm force-pushed the arrow-11772 branch 3 times, most recently from 2e0beca to 85a5d7c Compare May 7, 2021 13:58
@lidavidm
Copy link
Member Author

Just to follow up, @pitrou @westonpace did either of you have final comments here?

@westonpace
Copy link
Member

No final comments from me. I scanned through the last few updates and all seems good.

@lidavidm lidavidm force-pushed the arrow-11772 branch 2 times, most recently from 10bfcb7 to f5ab4b5 Compare May 13, 2021 14:55
@lidavidm
Copy link
Member Author

CI passes, so I'll go ahead and merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants