Perf: add in-process timing instrumentation to replace py-spy dependency

## Problem

The current profiling workflow in `perf_tests/` requires `py-spy`, which needs `sudo` on macOS:

```sh
sudo ./profile.sh groupby_air_full.py
```

This makes profiling unavailable in many CI environments and for users without root access. Additionally, `cProfile` only sees the Python side of query execution — the entire DataFusion execution appears as a single opaque `to_arrow_table()` call (0.66s for 320 MB of data in our benchmarks), making it impossible to distinguish:

- Time spent in xarray I/O (`ds.isel(block)`, `.values`)
- Time spent in numpy → Arrow conversion (`iter_record_batches`)
- Time spent in DataFusion (aggregation, hash-grouping, row filtering)
- Time spent waiting for GIL between partitions

## Proposed solution

Add a lightweight **per-partition timing mode** to the `_iteration_callback` mechanism already in `read_xarray_table`:

```python
# reader.py — extend make_stream to record timing
def make_stream(projected_names=None) -> pa.RecordBatchReader:
    if _timing_callback is not None:
        t0 = time.perf_counter()
    result = pa.RecordBatchReader.from_batches(...)
    if _timing_callback is not None:
        _timing_callback({"elapsed_s": time.perf_counter() - t0, "block": block})
    return result
```

Or alternatively: add a `verbose=True` flag to `read_xarray_table` that prints per-partition timings to stderr, giving insight into whether execution is I/O-bound or compute-bound without requiring external tools.

A lower-overhead option is to expose a `PartitionTimingHook` via `LazyArrowStreamTable` that fires before/after each factory call.

## Current profiling workaround

For the `query_large_synthetic.py` benchmark (320 MB, 8 partitions):

```
Full scan AVG:    0.662s  (341 MB peak)   ← all in Rust's to_arrow_table()
Filter pushdown:  0.005s  (0.7 MB peak)   ← 7/8 partitions pruned
GROUP BY lat/lon: 0.839s  (341 MB peak)   ← +27% vs. AVG for hash aggregation
```

cProfile shows 246 Python calls for the full scan, but 0.660s of those are attributed to `{method 'to_arrow_table' of 'datafusion.DataFrame' objects}` — a Rust builtin that cProfile cannot introspect.

## Related

- `_iteration_callback` in `XarrayRecordBatchReader` and `read_xarray_table` — already used for testing
- `perf_tests/profile.sh` — current py-spy based approach
- `perf_tests/query_large_synthetic.py` — new benchmark script added alongside this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: add in-process timing instrumentation to replace py-spy dependency #140

Problem

Proposed solution

Current profiling workaround

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Perf: add in-process timing instrumentation to replace py-spy dependency #140

Description

Problem

Proposed solution

Current profiling workaround

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions