Skip to content

proximity() with default max_distance=inf rechunks dask array to single chunk, OOMs on large data #1111

@brendancol

Description

@brendancol

Describe the bug

When proximity() is called on a dask array with the default max_distance=np.inf, it rechunks the entire raster into a single chunk at line 1228:

raster.data = raster.data.rechunk({0: height, 1: width})

This forces the entire dataset into one worker's memory. For a 30TB raster on a 16GB machine, this OOMs immediately.

The comment in the code says "make sure your data fit your memory" but np.inf is the default, so every dask user hits this path unless they explicitly set max_distance.

Benchmarks (512x512 array, 100 targets)

Backend max_distance Wall time (ms) Peak tracemalloc (MB)
numpy 50.0 747 49.27
numpy inf 679 27.31
dask+numpy 50.0 2,938 77.07
dask+numpy inf 64 16.11

The dask+inf path is fast on 512x512 because it rechunks into one piece and runs the line-sweep. At scale this single-chunk approach is the OOM vector.

Expected behavior

For dask inputs with max_distance=np.inf, raise a ValueError explaining that infinite max_distance requires the full array in memory and that a finite value must be set for out-of-core processing.

Impact

Every dask user who calls proximity(dask_raster, target_values=[...]) without explicitly setting max_distance hits the single-chunk rechunk and OOMs on large data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityoomOut-of-memory risk with large datasetsproximity toolsProximity, allocation, direction, cost distance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions