Describe the bug
When proximity() is called on a dask array with the default max_distance=np.inf, it rechunks the entire raster into a single chunk at line 1228:
raster.data = raster.data.rechunk({0: height, 1: width})
This forces the entire dataset into one worker's memory. For a 30TB raster on a 16GB machine, this OOMs immediately.
The comment in the code says "make sure your data fit your memory" but np.inf is the default, so every dask user hits this path unless they explicitly set max_distance.
Benchmarks (512x512 array, 100 targets)
| Backend |
max_distance |
Wall time (ms) |
Peak tracemalloc (MB) |
| numpy |
50.0 |
747 |
49.27 |
| numpy |
inf |
679 |
27.31 |
| dask+numpy |
50.0 |
2,938 |
77.07 |
| dask+numpy |
inf |
64 |
16.11 |
The dask+inf path is fast on 512x512 because it rechunks into one piece and runs the line-sweep. At scale this single-chunk approach is the OOM vector.
Expected behavior
For dask inputs with max_distance=np.inf, raise a ValueError explaining that infinite max_distance requires the full array in memory and that a finite value must be set for out-of-core processing.
Impact
Every dask user who calls proximity(dask_raster, target_values=[...]) without explicitly setting max_distance hits the single-chunk rechunk and OOMs on large data.
Describe the bug
When
proximity()is called on a dask array with the defaultmax_distance=np.inf, it rechunks the entire raster into a single chunk at line 1228:This forces the entire dataset into one worker's memory. For a 30TB raster on a 16GB machine, this OOMs immediately.
The comment in the code says "make sure your data fit your memory" but
np.infis the default, so every dask user hits this path unless they explicitly setmax_distance.Benchmarks (512x512 array, 100 targets)
The dask+inf path is fast on 512x512 because it rechunks into one piece and runs the line-sweep. At scale this single-chunk approach is the OOM vector.
Expected behavior
For dask inputs with
max_distance=np.inf, raise aValueErrorexplaining that infinite max_distance requires the full array in memory and that a finite value must be set for out-of-core processing.Impact
Every dask user who calls
proximity(dask_raster, target_values=[...])without explicitly settingmax_distancehits the single-chunk rechunk and OOMs on large data.