Skip to content

Perf: limit pushdown — avoid materializing full partitions for LIMIT queries #131

@alxmrs

Description

@alxmrs

Problem

The partition factory always materializes the full partition regardless of query:

df = pivot(ds.isel(block))  # always loads all 38.4M rows

For queries like SELECT * FROM ds LIMIT 100, DataFusion applies the limit after the factory produces its batch. The factory receives no signal to stop early.

Currently limit is passed from PrunableStreamingTable::scan() down to DataFusion's StreamingTable, which applies it at the execution-plan level — but only after the factory has already materialized the full partition.

Proposed fix

Pass an optional row limit into the factory (similar to projection pushdown). The scan() method can thread the limit through PyArrowStreamPartition so the factory can stop generating rows once the limit is reached.

def make_stream(projection, limit=None) -> pa.RecordBatchReader:
    rows_needed = limit or total_rows
    # only materialize up to rows_needed rows

Impact

For ARCO-ERA5, LIMIT 10 currently reads 38.4M rows from the first matching partition. With this fix it would read at most batch_size rows.

Parent: #126

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions