Skip to content

[Python] DataSet uses too much memory when filtering #7338

@lnicola

Description

@lnicola

I'm running this query over a 14 GB Arrow IPC file:

>>> ds = dataset.dataset("foo.ipc", format="ipc")
>>> t = ds.to_table(filter=dataset.field('ID') <= 1000).to_pandas()
>>> t
[snip]
[914 rows x 617 columns]

If I'm reading the documentation correctly, it should scan the file collecting the results, but not load it in memory. However, the RSS grows up to about 14 GB while running it, then goes back down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions