Skip to content

Conversation

@fjetter
Copy link
Contributor

@fjetter fjetter commented Feb 19, 2024

Rationale for this change

Closes #40142

I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more.

Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved FileInfo objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings.

The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage.

This came up in #38389 (comment) and there's been related work already with #37857

What changes are included in this PR?

Python bindings for the DatasetFactory constructor that accepts a list/vector of FileInfo objects.

Are these changes tested?

I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests. I ended up removing this again since parsing the response is a bit brittle.

Are there any user-facing changes?

@github-actions

This comment was marked as outdated.

@fjetter fjetter changed the title Allow FileInfo instances to be passed to dataset init GH-40142: [Python] Allow FileInfo instances to be passed to dataset init Feb 19, 2024
@github-actions
Copy link

⚠️ GitHub issue #40142 has been automatically assigned in GitHub to PR creator.

@fjetter fjetter force-pushed the fileinfo_pyarrow_dataset branch from cb03e28 to dbdc9e2 Compare February 20, 2024 13:04
@fjetter
Copy link
Contributor Author

fjetter commented Feb 20, 2024

the appveyor build seems to have timed out while compiling

@fjetter
Copy link
Contributor Author

fjetter commented Feb 26, 2024

@jorisvandenbossche can I ask you to take a look at this?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, sounds as a useful addition, and looks good!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Feb 27, 2024
@jorisvandenbossche jorisvandenbossche merged commit c57115d into apache:main Feb 27, 2024
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Feb 27, 2024
@fjetter fjetter deleted the fileinfo_pyarrow_dataset branch February 27, 2024 14:37
@fjetter
Copy link
Contributor Author

fjetter commented Feb 27, 2024

Thank you!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit c57115d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Allow Dataset construction from FileInfo objects

2 participants