-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6398: [C++] Consolidate ScanOptions and ScanContext #5239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5239 +/- ##
==========================================
+ Coverage 87.64% 89.2% +1.56%
==========================================
Files 1033 750 -283
Lines 148463 107583 -40880
Branches 1437 0 -1437
==========================================
- Hits 130118 95969 -34149
+ Misses 17983 11614 -6369
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
134d853 to
35473cf
Compare
35473cf to
7317c15
Compare
cpp/src/arrow/dataset/scanner.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the naming. Should this kind of method be called set_selector? @wesm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or add_selector
|
Can you rebase in order to fix the GLib / C issues? |
3f8af91 to
598ea6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on discovery, I noted a problem with this change. Let's distinguish between 2 phases.
-
Discovery: in the discovery phase, directories/services will be crawled to find available files/fragments. Schema may or not yet be fixed, but the discovery phase allows inferring/enforcing the DataSource common schema before materializing the DataSource. In order to do this, it needs to read/peek files, and thus it needs CsvReaderOptions, ParquetReaderOptions, etc... The discovery phase needs to create a DataSource with said "read" options + schema fixed in the DataSource.
-
Scan: in the scan phase, each DataSource is iterated over to concatenate their fragments. The caller controls the filter clause (which fragments to skip), and the projected (subset or superset) schema.
The discovery options (how to open a file, connect url, reconciliation schema) should not be passed by the caller at Scan phase, they're fixed in the DataSource. The (Scan) user shouldn't care about CsvReadOption (that's the goal of the dataset project, hide those details).
I think the existing classes (ScanOptions, ScanContext) convey this but with unclear naming, e.g. it would probably make more sense to rename and re-organize as follow:
ScanContext->ScanOptions(proj: Schema, selector: Filter, ctx: (MemoryPool...))ScanOptions->DataSourceOptions(reconcile: Schema, format: FileFormat, options: FileFormatOptions(CsvReader...)).
The DataFragment constructor would take DataSourceOptions, while FileFragment::Scan would take the ScanOptions as parameter.
|
@fsaintjacques Is it |
|
I think For example the reconciliation schema declares which columns in a data source will be available and the type to which they will be cast. This would be generated when a discovered source's inferred schema does not match that of other discovered sources, for example when an older data file doesn't include columns present in newer files |
|
Closing for now, will try again later |
Currently ScanOptions has two distinct responsibilities: it contains the data selector (and eventually projection schema) for the current scan and it serves as the base class for format specific scan options. This makes providing scan options for more than a single format impossible (as the resulting merged options would for example need to inherit both JsonScanOptions and ParquetScanOptions).
This patch removes ScanOptions and promotes FileScanOptions to the abstract base class for format specific scan options.
ScanContext is now the root container of scan state: contains the data selector, projection schema, and a vector of FileScanOptions