Problem
natural_breaks(agg, num_sample=None) on the numpy and cupy backends passes the entire flattened raster to _run_jenks, which allocates two (n_data + 1, n_classes + 1) float64 matrices in _run_numpy_jenks_matrices. Memory cost scales as 16 * n_data * (k + 1) bytes.
For a 10000 × 10000 raster with default k=5, that's ~9.6 GB per matrix pair, before considering the O(n²) Jenks runtime.
Reproduction
import numpy as np
import xarray as xr
from xrspatial.classify import natural_breaks
# 1000x1000 = 1M points. Matrices are 1M * 6 * 8 * 2 = ~96 MB.
# 10000x10000 would be ~9.6 GB.
agg = xr.DataArray(np.random.RandomState(0).rand(1000, 1000))
natural_breaks(agg, num_sample=None, k=5) # runs, but peak memory >> input size
Root Cause
_compute_natural_break_bins in xrspatial/classify.py:635:
if num_sample is not None and num_sample < num_data:
# ... sample from data
else:
sample_data = data_flat_np # full data
The dask paths (_run_dask_natural_break, _run_dask_cupy_natural_break) already cap num_sample at num_data and sample accordingly. The numpy and cupy paths pass num_sample straight through and skip sampling entirely when it is None.
Fix
Cap num_sample before the bin computation on the numpy and cupy paths, matching the dask cap. A conservative default keeps the Jenks matrices bounded regardless of input size.
Scope
natural_breaks() numpy backend
natural_breaks() cupy backend
- Dask paths already protected
Found by
Security sweep (Cat 1 - Unbounded Allocation, MEDIUM severity).
Problem
natural_breaks(agg, num_sample=None)on the numpy and cupy backends passes the entire flattened raster to_run_jenks, which allocates two(n_data + 1, n_classes + 1)float64 matrices in_run_numpy_jenks_matrices. Memory cost scales as16 * n_data * (k + 1)bytes.For a 10000 × 10000 raster with default
k=5, that's ~9.6 GB per matrix pair, before considering the O(n²) Jenks runtime.Reproduction
Root Cause
_compute_natural_break_binsinxrspatial/classify.py:635:The dask paths (
_run_dask_natural_break,_run_dask_cupy_natural_break) already capnum_sampleatnum_dataand sample accordingly. The numpy and cupy paths passnum_samplestraight through and skip sampling entirely when it isNone.Fix
Cap
num_samplebefore the bin computation on the numpy and cupy paths, matching the dask cap. A conservative default keeps the Jenks matrices bounded regardless of input size.Scope
natural_breaks()numpy backendnatural_breaks()cupy backendFound by
Security sweep (Cat 1 - Unbounded Allocation, MEDIUM severity).