Describe the bug
The dask aggregate path calls a numba kernel once per output pixel. In _agg_block_np (xrspatial/resample.py:471-498), the inner loop runs:
out[lo_y, lo_x] = func(sub, 1, 1)[0, 0]
for every output cell. func is _agg_mean, _agg_min, _agg_max, _agg_median, or _agg_mode. Each call dispatches into numba and allocates a fresh (1, 1) output. A 1000x1000 dask aggregate is about 1M kernel dispatches plus 1M tiny allocations, so the dask aggregate path is much slower than it needs to be.
Expected behavior
One numba call per chunk, walking the chunk's full output region in a single jitted loop and writing into a pre-allocated output buffer. The eager _run_numpy path already does this; the dask helper should too.
Fix
Add per-method block kernels that take the global geometry (global_in_h, global_out_h, cum_in_y, cum_out_y, in_y0, in_x0) as parameters and use int(go * global_in_h / global_out_h) - in_y0 for window bounds. Replace the inner func(sub, 1, 1)[0, 0] loop with one call into the new kernel. _agg_block_cupy already round-trips to CPU, so it picks up the speedup for free.
Additional context
The eager numpy aggregate path is unchanged. Only the dask block helper is touched. Reference: xrspatial/resample.py:471-498.
Describe the bug
The dask aggregate path calls a numba kernel once per output pixel. In
_agg_block_np(xrspatial/resample.py:471-498), the inner loop runs:for every output cell.
funcis_agg_mean,_agg_min,_agg_max,_agg_median, or_agg_mode. Each call dispatches into numba and allocates a fresh(1, 1)output. A 1000x1000 dask aggregate is about 1M kernel dispatches plus 1M tiny allocations, so the dask aggregate path is much slower than it needs to be.Expected behavior
One numba call per chunk, walking the chunk's full output region in a single jitted loop and writing into a pre-allocated output buffer. The eager
_run_numpypath already does this; the dask helper should too.Fix
Add per-method block kernels that take the global geometry (
global_in_h,global_out_h,cum_in_y,cum_out_y,in_y0,in_x0) as parameters and useint(go * global_in_h / global_out_h) - in_y0for window bounds. Replace the innerfunc(sub, 1, 1)[0, 0]loop with one call into the new kernel._agg_block_cupyalready round-trips to CPU, so it picks up the speedup for free.Additional context
The eager numpy aggregate path is unchanged. Only the dask block helper is touched. Reference:
xrspatial/resample.py:471-498.