-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Problem
The current profiling workflow in perf_tests/ requires py-spy, which needs sudo on macOS:
sudo ./profile.sh groupby_air_full.pyThis makes profiling unavailable in many CI environments and for users without root access. Additionally, cProfile only sees the Python side of query execution — the entire DataFusion execution appears as a single opaque to_arrow_table() call (0.66s for 320 MB of data in our benchmarks), making it impossible to distinguish:
- Time spent in xarray I/O (
ds.isel(block),.values) - Time spent in numpy → Arrow conversion (
iter_record_batches) - Time spent in DataFusion (aggregation, hash-grouping, row filtering)
- Time spent waiting for GIL between partitions
Proposed solution
Add a lightweight per-partition timing mode to the _iteration_callback mechanism already in read_xarray_table:
# reader.py — extend make_stream to record timing
def make_stream(projected_names=None) -> pa.RecordBatchReader:
if _timing_callback is not None:
t0 = time.perf_counter()
result = pa.RecordBatchReader.from_batches(...)
if _timing_callback is not None:
_timing_callback({"elapsed_s": time.perf_counter() - t0, "block": block})
return resultOr alternatively: add a verbose=True flag to read_xarray_table that prints per-partition timings to stderr, giving insight into whether execution is I/O-bound or compute-bound without requiring external tools.
A lower-overhead option is to expose a PartitionTimingHook via LazyArrowStreamTable that fires before/after each factory call.
Current profiling workaround
For the query_large_synthetic.py benchmark (320 MB, 8 partitions):
Full scan AVG: 0.662s (341 MB peak) ← all in Rust's to_arrow_table()
Filter pushdown: 0.005s (0.7 MB peak) ← 7/8 partitions pruned
GROUP BY lat/lon: 0.839s (341 MB peak) ← +27% vs. AVG for hash aggregation
cProfile shows 246 Python calls for the full scan, but 0.660s of those are attributed to {method 'to_arrow_table' of 'datafusion.DataFrame' objects} — a Rust builtin that cProfile cannot introspect.
Related
_iteration_callbackinXarrayRecordBatchReaderandread_xarray_table— already used for testingperf_tests/profile.sh— current py-spy based approachperf_tests/query_large_synthetic.py— new benchmark script added alongside this issue