ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters #13799

marsupialtail · 2022-08-04T21:43:07Z

This exposes the Fragment Readahead and Batch Readahead flags in the C++ Scanner to the user in Python.

This can be used to finetune RAM usage and IO utilization during downloading large files from S3 or other network sources. I believe the default settings are overly conservative for small RAM settings and I observe less than 20% IO utilization on some instances on AWS.

The Python API is exposed only to methods where these flags make sense. Scanning from a RecordBatchIterator won't need those these flags nor will those flags make sense. Only the latter flag makes sense for making a scanner from a fragment.

To test this, set up an i3.2xlarge instance on AWS:

import pyarrow
import pyarrow.dataset as ds
import pyarrow.csv as csv
import time
pyarrow.set_cpu_count(8)
pyarrow.set_io_thread_count(16)
lineitem_scheme = ["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice",
"l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct",
"l_shipmode","l_comment", "null"]
csv_format = ds.CsvFileFormat(read_options=csv.ReadOptions(column_names=lineitem_scheme, block_size= 32 * 1024 * 1024), parse_options=csv.ParseOptions(delimiter="|"))
dataset = ds.dataset("s3://TPC",format=csv_format)
s = dataset.to_batches(batch_size=1000000000)
while count < 100:
    z = next(s)

For our purposes let's just make the TPC dataset consist of hundreds of Parquet files each with one row group. (something that Spark would generate). This script would get somewhere around 1Gbps. If you now do

s = dataset.to_batches(batch_size=1000000000, fragment_readahead=16)

You can get to 2.5Gbps which is the advertised steady rate cap for this instance type.

github-actions · 2022-08-04T21:43:27Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

westonpace

Thanks for adding this. I took a quick pass at review.

cpp/src/arrow/dataset/scanner.cc

cpp/src/arrow/dataset/scanner.h

python/pyarrow/_dataset.pyx

westonpace · 2022-08-04T21:54:57Z

python/pyarrow/_dataset.pyx

        _populate_builder(builder, columns=columns, filter=filter,
-                          batch_size=batch_size, use_threads=use_threads,
+                          batch_size=batch_size, batch_readahead=batch_readahead,
+                          fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,


Suggested change

fragment_readahead=_DEFAULT_FRAGMENT_READAHEAD,

I don't think we need to specify this kwarg if we're just going to specify the default.

This is a Cython quirk. You have to specify all the arguments.

python/pyarrow/_dataset.pyx

pitrou · 2022-08-09T08:00:31Z

@westonpace @bkietz Why exactly does ScannerBuilder allow setting the same things that can be set in ScanOptions?

cpp/src/arrow/dataset/scanner.h

github-actions · 2022-08-09T14:44:13Z

https://issues.apache.org/jira/browse/ARROW-17299

bkietz · 2022-08-09T15:12:52Z

@pitrou @westonpace IIUC, ScannerBuilder is at this point mostly a wrapper around a scan options. Once upon a time it was needed to mediate the difference between single threaded and async scanners and to guard construction of a dataset wrapping a record batch reader, but this becomes less and less necessary as more datasets functionality becomes subsumed by the compute engine. (for example, I'd say there's no longer a motivation to support constructing datasets from record batch readers since the compute engine can use them as sources directly.) In short, I think what you're observing is ScannerBuilder on a gentle walk toward deprecation

westonpace · 2022-08-09T16:11:05Z

Yes, scanner builder is on its way out, I hope, as part of #13782 (well, probably a follow-up). At the moment it still serves a slight purpose in that the projection option is a little hard to specify and it is something of a thorn when it comes to augmented fields.

I also agree with your other point. We spent considerable effort at one point making various things look like a dataset because datasets were the primary interface to the compute engine (e.g. filtering & projection). The record batch reader example is a good example. I'd even go so far as to say the InMemoryDataset is probably superfluous and a better option in the future would be a "table_source" node. The scanner should be reserved for the case where you have multiple sources of data, with the same (or devolved versions of the same) schema.

All that being said, I don't think readahead is going away. However, in the near future (again, #13782) I was pondering if we should reframe readahead as "roughly how many bytes of data should the scanner attempt to read ahead" instead of "batch readahead and fragment readahead".

marsupialtail · 2022-08-12T20:39:48Z

I believe this is ready to be merged. @pitrou @westonpace

marsupialtail · 2022-08-17T00:29:44Z

There is a potential problem with this. You can't increase the fragment readahead by too much, or else the first batch will be significantly delayed. Not sure how much a problem this is though.

westonpace

A few grammatical suggestions but otherwise I think this is a good addition. I think this may change to bytes_readahead / fragment_readahead before the release but it will be nice to have this in place already.

cpp/src/arrow/dataset/scanner.h

python/pyarrow/_dataset.pyx

Co-authored-by: Weston Pace <weston.pace@gmail.com>

marsupialtail · 2022-08-20T05:32:05Z

Don't think the failed checks have anything to do with me.

pitrou · 2022-08-24T08:27:30Z

Don't think the failed checks have anything to do with me.

Indeed, they don't.

pitrou · 2022-08-24T08:29:44Z

@marsupialtail Would you like to address @westonpace 's suggestions? Then I think we're good to go.

Co-authored-by: Weston Pace <weston.pace@gmail.com>

marsupialtail · 2022-08-25T21:19:14Z

OK. I commited all the changes. @pitrou @westonpace

pitrou

Thanks for the update, just two suggestions below.

python/pyarrow/_dataset.pyx

cpp/src/arrow/dataset/scanner.h

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

…nto JIRA-17299

marsupialtail · 2022-09-01T04:59:45Z

done

pitrou

LGTM. Thank you @marsupialtail !

ursabot · 2022-09-01T10:01:37Z

Benchmark runs are scheduled for baseline = 46f38dc and contender = ec7e250. ec7e250 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.75% ⬆️0.11%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] ec7e250c ec2-t3-xlarge-us-east-2
[Failed] ec7e250c test-mac-arm
[Failed] ec7e250c ursa-i9-9960x
[Finished] ec7e250c ursa-thinkcentre-m75q
[Finished] 46f38dca ec2-t3-xlarge-us-east-2
[Failed] 46f38dca test-mac-arm
[Failed] 46f38dca ursa-i9-9960x
[Finished] 46f38dca ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…and kDefaultFragmentReadahead parameters (apache#13799) This exposes the Fragment Readahead and Batch Readahead flags in the C++ Scanner to the user in Python. This can be used to finetune RAM usage and IO utilization during downloading large files from S3 or other network sources. I believe the default settings are overly conservative for small RAM settings and I observe less than 20% IO utilization on some instances on AWS. The Python API is exposed only to methods where these flags make sense. Scanning from a RecordBatchIterator won't need those these flags nor will those flags make sense. Only the latter flag makes sense for making a scanner from a fragment. To test this, set up an i3.2xlarge instance on AWS: ``` import pyarrow import pyarrow.dataset as ds import pyarrow.csv as csv import time pyarrow.set_cpu_count(8) pyarrow.set_io_thread_count(16) lineitem_scheme = ["l_orderkey","l_partkey","l_suppkey","l_linenumber","l_quantity","l_extendedprice", "l_discount","l_tax","l_returnflag","l_linestatus","l_shipdate","l_commitdate","l_receiptdate","l_shipinstruct", "l_shipmode","l_comment", "null"] csv_format = ds.CsvFileFormat(read_options=csv.ReadOptions(column_names=lineitem_scheme, block_size= 32 * 1024 * 1024), parse_options=csv.ParseOptions(delimiter="|")) dataset = ds.dataset("s3://TPC",format=csv_format) s = dataset.to_batches(batch_size=1000000000) while count < 100: z = next(s) ``` For our purposes let's just make the TPC dataset consist of hundreds of Parquet files each with one row group. (something that Spark would generate). This script would get somewhere around 1Gbps. If you now do ``` s = dataset.to_batches(batch_size=1000000000, fragment_readahead=16) ``` You can get to 2.5Gbps which is the advertised steady rate cap for this instance type. Authored-by: Ziheng Wang <zihengw@stanford.edu> Signed-off-by: Antoine Pitrou <antoine@python.org>

marsupialtail added 2 commits August 4, 2022 13:46

implemented exposing flags

0466f0d

fix typo

0e63689

github-actions bot added Component: C++ Component: Python labels Aug 4, 2022

westonpace reviewed Aug 4, 2022

View reviewed changes

marsupialtail changed the title ~~Arrow 17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters~~ Arrow-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters Aug 4, 2022

updated docs

3103dc6

pitrou reviewed Aug 9, 2022

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

bkietz changed the title ~~Arrow-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters~~ ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters Aug 9, 2022

addressed suggestions

0277eaf

westonpace self-requested a review August 15, 2022 18:31

westonpace approved these changes Aug 19, 2022

View reviewed changes

marsupialtail and others added 4 commits August 19, 2022 16:06

Update cpp/src/arrow/dataset/scanner.h

c2f2cf6

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/dataset/scanner.h

6068071

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update python/pyarrow/_dataset.pyx

185f252

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update python/pyarrow/_dataset.pyx

16ee6d7

Co-authored-by: Weston Pace <weston.pace@gmail.com>

marsupialtail and others added 3 commits August 25, 2022 14:18

Update python/pyarrow/_dataset.pyx

b266b95

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update python/pyarrow/_dataset.pyx

6a1eeb1

Co-authored-by: Weston Pace <weston.pace@gmail.com>

Update cpp/src/arrow/dataset/scanner.h

7a8b85e

Co-authored-by: Weston Pace <weston.pace@gmail.com>

pitrou reviewed Aug 30, 2022

View reviewed changes

python/pyarrow/_dataset.pyx Show resolved Hide resolved

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

marsupialtail and others added 5 commits August 31, 2022 18:59

added a very simple test

5d05d63

Update cpp/src/arrow/dataset/scanner.h

66e3fe2

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Merge branch 'JIRA-17299' of https://github.com/marsupialtail/arrow i…

724057d

…nto JIRA-17299

try again

4ce3473

now it works

e85e9e3

pitrou approved these changes Sep 1, 2022

View reviewed changes

pitrou merged commit ec7e250 into apache:master Sep 1, 2022

ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters #13799

ARROW-17299: [C++][Python] Expose the Scanner kDefaultBatchReadahead and kDefaultFragmentReadahead parameters #13799

Uh oh!

Conversation

marsupialtail commented Aug 4, 2022

Uh oh!

github-actions bot commented Aug 4, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

marsupialtail Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pitrou commented Aug 9, 2022

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 9, 2022

Uh oh!

bkietz commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Aug 9, 2022

Uh oh!

marsupialtail commented Aug 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marsupialtail commented Aug 17, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marsupialtail commented Aug 20, 2022

Uh oh!

pitrou commented Aug 24, 2022

Uh oh!

pitrou commented Aug 24, 2022

Uh oh!

marsupialtail commented Aug 25, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

marsupialtail commented Sep 1, 2022

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

ursabot commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bkietz commented Aug 9, 2022 •

edited

Loading

marsupialtail commented Aug 12, 2022 •

edited

Loading