Skip to content

Perf: add in-process timing instrumentation to replace py-spy dependency #140

@alxmrs

Description

@alxmrs

Problem

The current profiling workflow in perf_tests/ requires py-spy, which needs sudo on macOS:

sudo ./profile.sh groupby_air_full.py

This makes profiling unavailable in many CI environments and for users without root access. Additionally, cProfile only sees the Python side of query execution — the entire DataFusion execution appears as a single opaque to_arrow_table() call (0.66s for 320 MB of data in our benchmarks), making it impossible to distinguish:

  • Time spent in xarray I/O (ds.isel(block), .values)
  • Time spent in numpy → Arrow conversion (iter_record_batches)
  • Time spent in DataFusion (aggregation, hash-grouping, row filtering)
  • Time spent waiting for GIL between partitions

Proposed solution

Add a lightweight per-partition timing mode to the _iteration_callback mechanism already in read_xarray_table:

# reader.py — extend make_stream to record timing
def make_stream(projected_names=None) -> pa.RecordBatchReader:
    if _timing_callback is not None:
        t0 = time.perf_counter()
    result = pa.RecordBatchReader.from_batches(...)
    if _timing_callback is not None:
        _timing_callback({"elapsed_s": time.perf_counter() - t0, "block": block})
    return result

Or alternatively: add a verbose=True flag to read_xarray_table that prints per-partition timings to stderr, giving insight into whether execution is I/O-bound or compute-bound without requiring external tools.

A lower-overhead option is to expose a PartitionTimingHook via LazyArrowStreamTable that fires before/after each factory call.

Current profiling workaround

For the query_large_synthetic.py benchmark (320 MB, 8 partitions):

Full scan AVG:    0.662s  (341 MB peak)   ← all in Rust's to_arrow_table()
Filter pushdown:  0.005s  (0.7 MB peak)   ← 7/8 partitions pruned
GROUP BY lat/lon: 0.839s  (341 MB peak)   ← +27% vs. AVG for hash aggregation

cProfile shows 246 Python calls for the full scan, but 0.660s of those are attributed to {method 'to_arrow_table' of 'datafusion.DataFrame' objects} — a Rust builtin that cProfile cannot introspect.

Related

  • _iteration_callback in XarrayRecordBatchReader and read_xarray_table — already used for testing
  • perf_tests/profile.sh — current py-spy based approach
  • perf_tests/query_large_synthetic.py — new benchmark script added alongside this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions