Skip to content

[C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids #18298

@asfimport

Description

@asfimport

From discussion at dask/dask#6534 (comment) (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.

Use cases:

  • Read only a set of row groups ids (this is similar as ParquetFile.read_row_groups), eg because you want to control the size of the resulting table by reading subsets of row groups

  • Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups

    The first case could for example be solved by adding a row_groups keyword to ParquetFileFragment.to_table (but, this is then a keyword specific to the parquet format, and we should then probably also add it to scan et al).

    The second case is something you can in principle do yourself manually by recreating a fragment with fragment.format.make_fragment(fragment.path, ..., row_groups=[...]). However, this is a) a bit cumbersome and b) statistics might need to be parsed again?
    The statistics of a set of filtered row groups could also be obtained by using split_by_row_group(filter) (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.

    So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids (fragment.subset(row_groups=[...]) or either based on a filter (fragment.subset(filter=...), which would be equivalent as split_by_row_group+recombining into a single fragment) ?

    cc @bkietz @rjzamora

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-10100. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions