Skip to content

[C++][Dataset] Projection pushdown in IPC (feather) format #30060

@asfimport

Description

@asfimport

The datasets API uses the RecordBatchFileReader to read feather files. This reader will always "read" the entire file. If the file is memory mapped this might not be a true read. However, the datasets API never uses memory mapped files.

This large read from RAM (or worse, disk) becomes a bottleneck for simple queries that load only a few columns from the dataset.

The fix may be to modify the reader to seek out and pluck only the needed data. Or the fix may be to modify the datasets API to use memory mapped files when possible (although the former approach seems more generally applicable).

This is related to ARROW-8250 but that issue seems more focused on row filtering while this issue is for column filtering.

Reporter: Weston Pace / @westonpace

Related issues:

Note: This issue was originally created as ARROW-14503. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions