Skip to content

Zonal dask paths materialize full arrays, OOM on large inputs #1110

@brendancol

Description

@brendancol

Describe the bug

The zonal module's dask code paths call .compute() on full arrays, so they can't process anything larger than RAM.

Four locations:

  1. _unique_finite_zones (line 134): da.unique(arr).compute() pulls the entire zones array into memory. Used by _stats_dask_numpy and _crosstab_dask_numpy.

  2. _unique_finite_cats (line 145): same thing on the values array. Used by _find_cats for crosstab.

  3. _regions_dask (line 1823): data.compute() on the full dask array. There's a size guard (nbytes * 5 > 0.5 * available_memory), but a 30TB raster blows past it.

  4. _regions_dask_cupy (line 1858): data.compute() on the full dask+cupy array. No size guard.

Benchmarks (512x512 array, 19 zones)

Backend Function Wall time (ms) Peak tracemalloc (MB) RSS delta (KB)
numpy stats 18.9 25.46 78,544
dask+numpy stats 1,120.67 37.54 45,980
numpy regions 16,334.96 5.62 0
dask+numpy regions 15,436.37 7.23 11,520

Dask stats uses more memory than numpy (37.5 MB vs 25.5 MB) on this small array and runs 59x slower. Dask regions just calls .compute() and falls through to the numpy path.

Expected behavior

Dask backends should bound memory by chunk size. stats() and crosstab() can gather unique zone/category values per-chunk and merge them. regions() should either do chunked connected-component labeling or raise MemoryError early for oversized inputs.

Impact

30TB dataset, 16GB machine: all four paths OOM before doing any work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityoomOut-of-memory risk with large datasetszonal tools

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions