Skip to content

Inline dask aggregate kernel to remove per-pixel numba dispatch #1463

@brendancol

Description

@brendancol

Describe the bug

The dask aggregate path calls a numba kernel once per output pixel. In _agg_block_np (xrspatial/resample.py:471-498), the inner loop runs:

out[lo_y, lo_x] = func(sub, 1, 1)[0, 0]

for every output cell. func is _agg_mean, _agg_min, _agg_max, _agg_median, or _agg_mode. Each call dispatches into numba and allocates a fresh (1, 1) output. A 1000x1000 dask aggregate is about 1M kernel dispatches plus 1M tiny allocations, so the dask aggregate path is much slower than it needs to be.

Expected behavior

One numba call per chunk, walking the chunk's full output region in a single jitted loop and writing into a pre-allocated output buffer. The eager _run_numpy path already does this; the dask helper should too.

Fix

Add per-method block kernels that take the global geometry (global_in_h, global_out_h, cum_in_y, cum_out_y, in_y0, in_x0) as parameters and use int(go * global_in_h / global_out_h) - in_y0 for window bounds. Replace the inner func(sub, 1, 1)[0, 0] loop with one call into the new kernel. _agg_block_cupy already round-trips to CPU, so it picks up the speedup for free.

Additional context

The eager numpy aggregate path is unchanged. Only the dask block helper is touched. Reference: xrspatial/resample.py:471-498.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions