ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239

bkietz · 2019-08-30T17:17:24Z

Currently ScanOptions has two distinct responsibilities: it contains the data selector (and eventually projection schema) for the current scan and it serves as the base class for format specific scan options. This makes providing scan options for more than a single format impossible (as the resulting merged options would for example need to inherit both JsonScanOptions and ParquetScanOptions).

This patch removes ScanOptions and promotes FileScanOptions to the abstract base class for format specific scan options.

ScanContext is now the root container of scan state: contains the data selector, projection schema, and a vector of FileScanOptions

codecov-io · 2019-08-30T21:53:16Z

Codecov Report

Merging #5239 into master will increase coverage by 1.56%.
The diff coverage is 92.68%.

@@            Coverage Diff             @@
##           master    #5239      +/-   ##
==========================================
+ Coverage   87.64%    89.2%   +1.56%     
==========================================
  Files        1033      750     -283     
  Lines      148463   107583   -40880     
  Branches     1437        0    -1437     
==========================================
- Hits       130118    95969   -34149     
+ Misses      17983    11614    -6369     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
cpp/src/arrow/dataset/dataset.h	`76.92% <0%> (ø)`	⬆️
cpp/src/arrow/dataset/file_base.cc	`92.85% <100%> (-0.33%)`	⬇️
cpp/src/arrow/dataset/file_parquet_test.cc	`93.75% <100%> (-0.1%)`	⬇️
cpp/src/arrow/dataset/dataset.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/dataset/file_parquet.cc	`100% <100%> (ø)`	⬆️
cpp/src/arrow/dataset/scanner.h	`100% <100%> (ø)`	⬆️
cpp/src/arrow/dataset/file_parquet.h	`87.5% <100%> (ø)`	⬆️
cpp/src/arrow/dataset/file_base.h	`86.66% <50%> (-0.84%)`	⬇️
cpp/src/arrow/dataset/test_util.h	`94.36% <91.66%> (+0.08%)`	⬆️
cpp/src/arrow/json/converter.cc	`90.05% <0%> (-1.76%)`	⬇️
... and 286 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update beea8f9...709ea62. Read the comment docs.

cpp/src/arrow/dataset/scanner.h

pitrou · 2019-09-11T13:33:40Z

cpp/src/arrow/dataset/scanner.h

Not sure about the naming. Should this kind of method be called set_selector? @wesm

Or add_selector

pitrou · 2019-09-11T13:33:55Z

Can you rebase in order to fix the GLib / C issues?

fsaintjacques

While working on discovery, I noted a problem with this change. Let's distinguish between 2 phases.

Discovery: in the discovery phase, directories/services will be crawled to find available files/fragments. Schema may or not yet be fixed, but the discovery phase allows inferring/enforcing the DataSource common schema before materializing the DataSource. In order to do this, it needs to read/peek files, and thus it needs CsvReaderOptions, ParquetReaderOptions, etc... The discovery phase needs to create a DataSource with said "read" options + schema fixed in the DataSource.
Scan: in the scan phase, each DataSource is iterated over to concatenate their fragments. The caller controls the filter clause (which fragments to skip), and the projected (subset or superset) schema.

The discovery options (how to open a file, connect url, reconciliation schema) should not be passed by the caller at Scan phase, they're fixed in the DataSource. The (Scan) user shouldn't care about CsvReadOption (that's the goal of the dataset project, hide those details).

I think the existing classes (ScanOptions, ScanContext) convey this but with unclear naming, e.g. it would probably make more sense to rename and re-organize as follow:

ScanContext -> ScanOptions(proj: Schema, selector: Filter, ctx: (MemoryPool...))
ScanOptions -> DataSourceOptions(reconcile: Schema, format: FileFormat, options: FileFormatOptions(CsvReader...)).

The DataFragment constructor would take DataSourceOptions, while FileFragment::Scan would take the ScanOptions as parameter.

pitrou · 2019-09-12T16:32:00Z

@fsaintjacques Is it DataSourceOptions or DiscoveryOptions?

bkietz · 2019-09-12T17:40:43Z

I think DataSourceOptions makes more sense; the options are determined during discovery but they primarily impact the behavior of a data source.

For example the reconciliation schema declares which columns in a data source will be available and the type to which they will be cast. This would be generated when a discovered source's inferred schema does not match that of other discovered sources, for example when an older data file doesn't include columns present in newer files

bkietz · 2019-09-25T16:32:44Z

Closing for now, will try again later

bkietz mentioned this pull request Aug 30, 2019

ARROW-6244: [C++][Dataset] Add partition key to DataSource interface #5221

Closed

fsaintjacques changed the title ~~ARROW-6398: [C++] consolidate ScanOptions and ScanContext~~ ARROW-6398: [C++] Consolidate ScanOptions and ScanContext Aug 30, 2019

pitrou reviewed Sep 2, 2019

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

pitrou reviewed Sep 2, 2019

View reviewed changes

cpp/src/arrow/dataset/scanner.h Outdated Show resolved Hide resolved

bkietz force-pushed the 6398-consolidate-ScanOptions-a branch 2 times, most recently from 134d853 to 35473cf Compare September 6, 2019 17:01

ARF1 mentioned this pull request Sep 7, 2019

ARROW-6465: [Python] Improvement to Windows build instructions #5294

Closed

bkietz force-pushed the 6398-consolidate-ScanOptions-a branch from 35473cf to 7317c15 Compare September 9, 2019 14:42

pitrou reviewed Sep 11, 2019

View reviewed changes

bkietz added 8 commits September 11, 2019 12:23

fold ScanOptions into ScanContext

7c9d467

remove ScanContext from shared_ptr

5ad3e0b

lint fix

a7f3b54

rename ScanContext to ScanOptions

25a2b90

renaming in ScanOptions

398a2d0

move FileScanOptions into ScanOptions

655a546

fix merge errors

7b243e7

clang-format

598ea6f

bkietz force-pushed the 6398-consolidate-ScanOptions-a branch from 3f8af91 to 598ea6f Compare September 11, 2019 16:24

fsaintjacques requested changes Sep 11, 2019

View reviewed changes

bkietz closed this Sep 25, 2019

bkietz deleted the 6398-consolidate-ScanOptions-a branch February 25, 2021 16:56

asfimport mentioned this pull request Apr 10, 2020

[C++] Consolidate ScanOptions and ScanContext #22771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239

ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239

Uh oh!

bkietz commented Aug 30, 2019

Uh oh!

codecov-io commented Aug 30, 2019

Uh oh!

Uh oh!

Uh oh!

pitrou Sep 11, 2019

Uh oh!

wesm Sep 12, 2019

Uh oh!

pitrou commented Sep 11, 2019

Uh oh!

fsaintjacques left a comment •

edited

Loading

Uh oh!

pitrou commented Sep 12, 2019

Uh oh!

bkietz commented Sep 12, 2019

Uh oh!

bkietz commented Sep 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239

ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239

Uh oh!

Conversation

bkietz commented Aug 30, 2019

Uh oh!

codecov-io commented Aug 30, 2019

Codecov Report

Uh oh!

Uh oh!

Uh oh!

pitrou Sep 11, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou commented Sep 11, 2019

Uh oh!

fsaintjacques left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Sep 12, 2019

Uh oh!

bkietz commented Sep 12, 2019

Uh oh!

bkietz commented Sep 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fsaintjacques left a comment •

edited

Loading