Skip to content

balanced_allocation holds N full-grid cost surfaces, OOMs on large dask inputs #1114

@brendancol

Description

@brendancol

Describe the bug

balanced_allocation() computes N cost-distance surfaces (one per source) and holds all of them in memory simultaneously at line 301-306. For N sources on a 30TB raster, this needs N * 30TB of RAM.

Three materialization points:

  1. _extract_sources (line 56) calls _to_numpy(raster.data) which .compute()s the full dask source raster just to find unique source IDs.

  2. _allocate_from_costs (line 125) and _allocate_biased (line 173) call da.argmin().compute() which materializes an (H,W) index array. This one is actually fine (small output), but da.stack(cost_stack) at lines 122 and 172 builds (N, H, W) intermediates.

  3. fric_data.compute() at line 311 materializes the full friction surface.

Benchmarks (128x128 array, 3 sources, 3 iterations)

Backend Wall time (ms) Peak tracemalloc (MB) RSS delta (KB)
numpy 14.08 33.70 113,288
dask 284.44 48.65 51,732

Expected behavior

Add a memory guard that estimates N_sources * array_bytes before computing cost surfaces. Raise MemoryError with a clear message when this exceeds available memory. Replace _extract_sources to use da.unique for the source ID discovery so the full raster isn't materialized just to find unique values.

Impact

N sources on a 30TB raster: needs N * 30TB. Even 2 sources = 60TB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh-priorityoomOut-of-memory risk with large datasetsproximity toolsProximity, allocation, direction, cost distance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions