-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8065: [C++][Dataset] Refactor ScanOptions and Fragment relation #7000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fe06310 to
2506eb5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Browsed through the C++ code, looked a bit more in detail at the python code.
The skipped python tests are because the reconstruction of a fragment is not fully working yet? (didn't fully understand the relation to ARROW-8318)
Just to make sure I correctly understood it:
- A Fragment has only a physical schema, and is not aware of the dataset schema
- But when scanning the Fragment (or through
to_table), you can still specify this dataset schema (so scanning dataset vs fragments can still give the same result, i.e. basicallydataset.to_table()andpa.concat([f.to_table(schema=dataset.schema) for f in dataset.get_fragments()]))
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using Fragment.scan, it uses the Fragment's physical schema for the resulting table? (since the Fragment is not aware of the dataset "read" schema?)
If so, we should note that here in the docstring I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you can still pass schema (which is not documented. I'll add this.
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this but where the columns selection is passed to to_table ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via kwargs to self._scanner then forwarding kwargs to Scanner.from_fragment
9e9e553 to
0c8f006
Compare
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm impressed by how surgical this change was, looking great
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This coercion is common, should we have _unwrap_expression_default_true()?
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Return fragments matching the optional filter, either using the | |
| partition_expression or internal information like Parquet's | |
| statistics. | |
| Return fragments matching the optional filter, using explicit | |
| partition expressions and/or embedded information like Parquet's | |
| statistics. |
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It produces a stream of ScanTasks which is meant to be a unit of work | |
| to be dispatched. The tasks are not executed automatically, the user is | |
| responsible to execute and dispatch the individual tasks, so custom | |
| local task scheduling can be implemented. | |
| It produces a stream of ScanTasks. Each task is meant to be a unit of work | |
| to be dispatched by the user; they are not executed automatically. This allows | |
| customization of local task scheduling and execution. |
0c8f006 to
f711826
Compare
|
@ursabot build |
|
Addressed most comments and updated followup ticket with what's missing. PTAL and merge quickly so we can unblock the blocked tickets :) |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, defer to @bkietz to approve the C++ side ;)
7aba61f to
099a8c6
Compare
This is the first part of a refactor to make Fragment accessible without a Scan operation instance. This is a breaking change. It introduces the concept of a physical schema and read schema, these concepts are analogous to Avro writer and reader schema.