Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented Apr 21, 2020

This is the first part of a refactor to make Fragment accessible without a Scan operation instance. This is a breaking change. It introduces the concept of a physical schema and read schema, these concepts are analogous to Avro writer and reader schema.

  • Move ScanOptions at Fragment::Scan instead of a property
  • Refactor Dataset::GetFragments(ScanOptions) to Dataset::GetFragments(Expression)
  • Add Fragment::ReadPhysicalSchema
  • Normalize Python's {Dataset,Fragment,Scanner}.{scan,to_table,to_batches}()

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Browsed through the C++ code, looked a bit more in detail at the python code.

The skipped python tests are because the reconstruction of a fragment is not fully working yet? (didn't fully understand the relation to ARROW-8318)

Just to make sure I correctly understood it:

  • A Fragment has only a physical schema, and is not aware of the dataset schema
  • But when scanning the Fragment (or through to_table), you can still specify this dataset schema (so scanning dataset vs fragments can still give the same result, i.e. basically dataset.to_table() and pa.concat([f.to_table(schema=dataset.schema) for f in dataset.get_fragments()]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using Fragment.scan, it uses the Fragment's physical schema for the resulting table? (since the Fragment is not aware of the dataset "read" schema?)
If so, we should note that here in the docstring I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can still pass schema (which is not documented. I'll add this.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this but where the columns selection is passed to to_table ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via kwargs to self._scanner then forwarding kwargs to Scanner.from_fragment

@fsaintjacques fsaintjacques force-pushed the ARROW-8065 branch 2 times, most recently from 9e9e553 to 0c8f006 Compare April 21, 2020 18:17
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed by how surgical this change was, looking great

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This coercion is common, should we have _unwrap_expression_default_true()?

Comment on lines +113 to +115
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Return fragments matching the optional filter, either using the
partition_expression or internal information like Parquet's
statistics.
Return fragments matching the optional filter, using explicit
partition expressions and/or embedded information like Parquet's
statistics.

Comment on lines +144 to 147
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It produces a stream of ScanTasks which is meant to be a unit of work
to be dispatched. The tasks are not executed automatically, the user is
responsible to execute and dispatch the individual tasks, so custom
local task scheduling can be implemented.
It produces a stream of ScanTasks. Each task is meant to be a unit of work
to be dispatched by the user; they are not executed automatically. This allows
customization of local task scheduling and execution.

@kszucs
Copy link
Member

kszucs commented Apr 22, 2020

@ursabot build

@fsaintjacques
Copy link
Contributor Author

Addressed most comments and updated followup ticket with what's missing. PTAL and merge quickly so we can unblock the blocked tickets :)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, defer to @bkietz to approve the C++ side ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants