Skip to content

natural_breaks(num_sample=None) unbounded allocation on numpy/cupy backends #1246

@brendancol

Description

@brendancol

Problem

natural_breaks(agg, num_sample=None) on the numpy and cupy backends passes the entire flattened raster to _run_jenks, which allocates two (n_data + 1, n_classes + 1) float64 matrices in _run_numpy_jenks_matrices. Memory cost scales as 16 * n_data * (k + 1) bytes.

For a 10000 × 10000 raster with default k=5, that's ~9.6 GB per matrix pair, before considering the O(n²) Jenks runtime.

Reproduction

import numpy as np
import xarray as xr
from xrspatial.classify import natural_breaks

# 1000x1000 = 1M points. Matrices are 1M * 6 * 8 * 2 = ~96 MB.
# 10000x10000 would be ~9.6 GB.
agg = xr.DataArray(np.random.RandomState(0).rand(1000, 1000))
natural_breaks(agg, num_sample=None, k=5)  # runs, but peak memory >> input size

Root Cause

_compute_natural_break_bins in xrspatial/classify.py:635:

if num_sample is not None and num_sample < num_data:
    # ... sample from data
else:
    sample_data = data_flat_np  # full data

The dask paths (_run_dask_natural_break, _run_dask_cupy_natural_break) already cap num_sample at num_data and sample accordingly. The numpy and cupy paths pass num_sample straight through and skip sampling entirely when it is None.

Fix

Cap num_sample before the bin computation on the numpy and cupy paths, matching the dask cap. A conservative default keeps the Jenks matrices bounded regardless of input size.

Scope

  • natural_breaks() numpy backend
  • natural_breaks() cupy backend
  • Dask paths already protected

Found by

Security sweep (Cat 1 - Unbounded Allocation, MEDIUM severity).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions