Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Aug 30, 2019

Currently ScanOptions has two distinct responsibilities: it contains the data selector (and eventually projection schema) for the current scan and it serves as the base class for format specific scan options. This makes providing scan options for more than a single format impossible (as the resulting merged options would for example need to inherit both JsonScanOptions and ParquetScanOptions).

This patch removes ScanOptions and promotes FileScanOptions to the abstract base class for format specific scan options.

ScanContext is now the root container of scan state: contains the data selector, projection schema, and a vector of FileScanOptions

@fsaintjacques fsaintjacques changed the title ARROW-6398: [C++] consolidate ScanOptions and ScanContext ARROW-6398: [C++] Consolidate ScanOptions and ScanContext Aug 30, 2019
@codecov-io
Copy link

Codecov Report

Merging #5239 into master will increase coverage by 1.56%.
The diff coverage is 92.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5239      +/-   ##
==========================================
+ Coverage   87.64%    89.2%   +1.56%     
==========================================
  Files        1033      750     -283     
  Lines      148463   107583   -40880     
  Branches     1437        0    -1437     
==========================================
- Hits       130118    95969   -34149     
+ Misses      17983    11614    -6369     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/dataset/dataset.h 76.92% <0%> (ø) ⬆️
cpp/src/arrow/dataset/file_base.cc 92.85% <100%> (-0.33%) ⬇️
cpp/src/arrow/dataset/file_parquet_test.cc 93.75% <100%> (-0.1%) ⬇️
cpp/src/arrow/dataset/dataset.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/dataset/file_parquet.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/dataset/scanner.h 100% <100%> (ø) ⬆️
cpp/src/arrow/dataset/file_parquet.h 87.5% <100%> (ø) ⬆️
cpp/src/arrow/dataset/file_base.h 86.66% <50%> (-0.84%) ⬇️
cpp/src/arrow/dataset/test_util.h 94.36% <91.66%> (+0.08%) ⬆️
cpp/src/arrow/json/converter.cc 90.05% <0%> (-1.76%) ⬇️
... and 286 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update beea8f9...709ea62. Read the comment docs.

@bkietz bkietz force-pushed the 6398-consolidate-ScanOptions-a branch 2 times, most recently from 134d853 to 35473cf Compare September 6, 2019 17:01
@bkietz bkietz force-pushed the 6398-consolidate-ScanOptions-a branch from 35473cf to 7317c15 Compare September 9, 2019 14:42
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming. Should this kind of method be called set_selector? @wesm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add_selector

@pitrou
Copy link
Member

pitrou commented Sep 11, 2019

Can you rebase in order to fix the GLib / C issues?

@bkietz bkietz force-pushed the 6398-consolidate-ScanOptions-a branch from 3f8af91 to 598ea6f Compare September 11, 2019 16:24
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on discovery, I noted a problem with this change. Let's distinguish between 2 phases.

  • Discovery: in the discovery phase, directories/services will be crawled to find available files/fragments. Schema may or not yet be fixed, but the discovery phase allows inferring/enforcing the DataSource common schema before materializing the DataSource. In order to do this, it needs to read/peek files, and thus it needs CsvReaderOptions, ParquetReaderOptions, etc... The discovery phase needs to create a DataSource with said "read" options + schema fixed in the DataSource.

  • Scan: in the scan phase, each DataSource is iterated over to concatenate their fragments. The caller controls the filter clause (which fragments to skip), and the projected (subset or superset) schema.

The discovery options (how to open a file, connect url, reconciliation schema) should not be passed by the caller at Scan phase, they're fixed in the DataSource. The (Scan) user shouldn't care about CsvReadOption (that's the goal of the dataset project, hide those details).

I think the existing classes (ScanOptions, ScanContext) convey this but with unclear naming, e.g. it would probably make more sense to rename and re-organize as follow:

  • ScanContext -> ScanOptions(proj: Schema, selector: Filter, ctx: (MemoryPool...))
  • ScanOptions -> DataSourceOptions(reconcile: Schema, format: FileFormat, options: FileFormatOptions(CsvReader...)).

The DataFragment constructor would take DataSourceOptions, while FileFragment::Scan would take the ScanOptions as parameter.

@pitrou
Copy link
Member

pitrou commented Sep 12, 2019

@fsaintjacques Is it DataSourceOptions or DiscoveryOptions?

@bkietz
Copy link
Member Author

bkietz commented Sep 12, 2019

I think DataSourceOptions makes more sense; the options are determined during discovery but they primarily impact the behavior of a data source.

For example the reconciliation schema declares which columns in a data source will be available and the type to which they will be cast. This would be generated when a discovered source's inferred schema does not match that of other discovered sources, for example when an older data file doesn't include columns present in newer files

@bkietz
Copy link
Member Author

bkietz commented Sep 25, 2019

Closing for now, will try again later

@bkietz bkietz closed this Sep 25, 2019
@bkietz bkietz deleted the 6398-consolidate-ScanOptions-a branch February 25, 2021 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants