Skip to content

[Python][Dataset] Improve ergonomy of the FileSystemDataset constructor #24483

@asfimport

Description

@asfimport

Currently, to manually create a FileSystemDataset, you can do something like:

dataset = ds.FileSystemDataset(
        schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
        ["data_file1.parquet", "data_file2.parquet"],
        [ds.field('file') == 1, ds.field('file') == 2])

There are some usibility improvements we can do though:

  • Allow passing the arguments by name to improve readability of the calling code (now they all need to be passed positionally, due to the way they are implemented in cython as not None)
  • I would maybe change the order of the arguments (eg start with the paths, we don't need to match the order of the C++ constructor)
  • Potentially allow partitions to be optional, in which case they need to be set to a list of ScalarExpression(True) values.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-8290. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions