Describe the bug
balanced_allocation() computes N cost-distance surfaces (one per source) and holds all of them in memory simultaneously at line 301-306. For N sources on a 30TB raster, this needs N * 30TB of RAM.
Three materialization points:
-
_extract_sources (line 56) calls _to_numpy(raster.data) which .compute()s the full dask source raster just to find unique source IDs.
-
_allocate_from_costs (line 125) and _allocate_biased (line 173) call da.argmin().compute() which materializes an (H,W) index array. This one is actually fine (small output), but da.stack(cost_stack) at lines 122 and 172 builds (N, H, W) intermediates.
-
fric_data.compute() at line 311 materializes the full friction surface.
Benchmarks (128x128 array, 3 sources, 3 iterations)
| Backend |
Wall time (ms) |
Peak tracemalloc (MB) |
RSS delta (KB) |
| numpy |
14.08 |
33.70 |
113,288 |
| dask |
284.44 |
48.65 |
51,732 |
Expected behavior
Add a memory guard that estimates N_sources * array_bytes before computing cost surfaces. Raise MemoryError with a clear message when this exceeds available memory. Replace _extract_sources to use da.unique for the source ID discovery so the full raster isn't materialized just to find unique values.
Impact
N sources on a 30TB raster: needs N * 30TB. Even 2 sources = 60TB.
Describe the bug
balanced_allocation()computes N cost-distance surfaces (one per source) and holds all of them in memory simultaneously at line 301-306. For N sources on a 30TB raster, this needs N * 30TB of RAM.Three materialization points:
_extract_sources(line 56) calls_to_numpy(raster.data)which.compute()s the full dask source raster just to find unique source IDs._allocate_from_costs(line 125) and_allocate_biased(line 173) callda.argmin().compute()which materializes an(H,W)index array. This one is actually fine (small output), butda.stack(cost_stack)at lines 122 and 172 builds(N, H, W)intermediates.fric_data.compute()at line 311 materializes the full friction surface.Benchmarks (128x128 array, 3 sources, 3 iterations)
Expected behavior
Add a memory guard that estimates
N_sources * array_bytesbefore computing cost surfaces. RaiseMemoryErrorwith a clear message when this exceeds available memory. Replace_extract_sourcesto useda.uniquefor the source ID discovery so the full raster isn't materialized just to find unique values.Impact
N sources on a 30TB raster: needs N * 30TB. Even 2 sources = 60TB.