Describe the bug
The zonal module's dask code paths call .compute() on full arrays, so they can't process anything larger than RAM.
Four locations:
-
_unique_finite_zones (line 134): da.unique(arr).compute() pulls the entire zones array into memory. Used by _stats_dask_numpy and _crosstab_dask_numpy.
-
_unique_finite_cats (line 145): same thing on the values array. Used by _find_cats for crosstab.
-
_regions_dask (line 1823): data.compute() on the full dask array. There's a size guard (nbytes * 5 > 0.5 * available_memory), but a 30TB raster blows past it.
-
_regions_dask_cupy (line 1858): data.compute() on the full dask+cupy array. No size guard.
Benchmarks (512x512 array, 19 zones)
| Backend |
Function |
Wall time (ms) |
Peak tracemalloc (MB) |
RSS delta (KB) |
| numpy |
stats |
18.9 |
25.46 |
78,544 |
| dask+numpy |
stats |
1,120.67 |
37.54 |
45,980 |
| numpy |
regions |
16,334.96 |
5.62 |
0 |
| dask+numpy |
regions |
15,436.37 |
7.23 |
11,520 |
Dask stats uses more memory than numpy (37.5 MB vs 25.5 MB) on this small array and runs 59x slower. Dask regions just calls .compute() and falls through to the numpy path.
Expected behavior
Dask backends should bound memory by chunk size. stats() and crosstab() can gather unique zone/category values per-chunk and merge them. regions() should either do chunked connected-component labeling or raise MemoryError early for oversized inputs.
Impact
30TB dataset, 16GB machine: all four paths OOM before doing any work.
Describe the bug
The zonal module's dask code paths call
.compute()on full arrays, so they can't process anything larger than RAM.Four locations:
_unique_finite_zones(line 134):da.unique(arr).compute()pulls the entire zones array into memory. Used by_stats_dask_numpyand_crosstab_dask_numpy._unique_finite_cats(line 145): same thing on the values array. Used by_find_catsfor crosstab._regions_dask(line 1823):data.compute()on the full dask array. There's a size guard (nbytes * 5 > 0.5 * available_memory), but a 30TB raster blows past it._regions_dask_cupy(line 1858):data.compute()on the full dask+cupy array. No size guard.Benchmarks (512x512 array, 19 zones)
Dask stats uses more memory than numpy (37.5 MB vs 25.5 MB) on this small array and runs 59x slower. Dask regions just calls
.compute()and falls through to the numpy path.Expected behavior
Dask backends should bound memory by chunk size.
stats()andcrosstab()can gather unique zone/category values per-chunk and merge them.regions()should either do chunked connected-component labeling or raiseMemoryErrorearly for oversized inputs.Impact
30TB dataset, 16GB machine: all four paths OOM before doing any work.