-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
The partition factory always materializes the full partition regardless of query:
df = pivot(ds.isel(block)) # always loads all 38.4M rowsFor queries like SELECT * FROM ds LIMIT 100, DataFusion applies the limit after the factory produces its batch. The factory receives no signal to stop early.
Currently limit is passed from PrunableStreamingTable::scan() down to DataFusion's StreamingTable, which applies it at the execution-plan level — but only after the factory has already materialized the full partition.
Proposed fix
Pass an optional row limit into the factory (similar to projection pushdown). The scan() method can thread the limit through PyArrowStreamPartition so the factory can stop generating rows once the limit is reached.
def make_stream(projection, limit=None) -> pa.RecordBatchReader:
rows_needed = limit or total_rows
# only materialize up to rows_needed rowsImpact
For ARCO-ERA5, LIMIT 10 currently reads 38.4M rows from the first matching partition. With this fix it would read at most batch_size rows.
Parent: #126