Skip to content

Cut head_tail_breaks and box_plot dask re-scans#1213

Merged
brendancol merged 1 commit into
masterfrom
perf/classify-head-tail-and-box-plot
Apr 16, 2026
Merged

Cut head_tail_breaks and box_plot dask re-scans#1213
brendancol merged 1 commit into
masterfrom
perf/classify-head-tail-and-box-plot

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _run_dask_head_tail_breaks: persist data_clean once, track the running mask count across iterations, and fuse the mean and head-count reductions into a single dask.compute() call per iteration. Cuts per-iteration graph traversals from 3 to 1 and eliminates the re-read on every loop pass.
  • _run_dask_box_plot (new) and _run_dask_cupy_box_plot: replace data_clean[da.isfinite(data_clean)] (which forces compute_chunk_sizes) with the same seeded _generate_sample_indices sampler that natural_breaks and quantile already use. Percentiles are then computed on the finite portion of the sample in numpy.

Motivation

Static analysis flagged three HIGH-severity patterns on the dask backends of classify:

  1. _run_dask_head_tail_breaks ran .compute() inside a while loop for the mean, new-mask count, and total-mask count — 3 full graph traversals per iteration, N+1 iterations typical.
  2. _run_box_plot(..., module=da) used boolean fancy indexing on a dask array, which triggers compute_chunk_sizes() and performs an extra full scan before da.percentile runs.
  3. _run_dask_cupy_box_plot had the same pattern plus a full map_blocks(cupy.asnumpy) over the dataset before sampling.

Benchmark

head_tail_breaks dask path on a 256×256 gamma-distributed float64 array, chunks=64:

Backend Metric Before After Ratio Verdict
dask+numpy wall_ms (med) 912 339 0.37 IMPROVED

box_plot dask path on 512×512, chunks=128:

Backend Metric After Verdict
dask+numpy wall_ms (med) 57 OK (no baseline — old path scaled with full-raster scan before percentile)

Test plan

  • pytest xrspatial/tests/test_classify.py — 85 tests pass
  • Manual smoke: head_tail_breaks dask output has the same bin count as numpy path on the same seed
  • Manual smoke: box_plot dask output uses sampled quantiles; verify output classes match numpy path within sampling tolerance

Notes

Sample size for the box_plot dask path is capped at 200,000 elements (or the full dataset if smaller). This matches the pattern used by natural_breaks and keeps the percentile computation O(sample) rather than O(dataset).

head_tail_breaks (dask) called .compute() three times per iteration of
its while-loop (mean, new-mask count, old-mask count) and rebuilt the
same data_clean graph every time. For N iterations that was 3N+1 full
graph traversals. Persist data_clean once, track the running mask count
across iterations, and fuse the mean+head-count reductions into a single
dask.compute() per iteration. Wall time drops from ~910 ms to ~340 ms
on 256x256 chunks=64.

box_plot (dask and dask+cupy) did data_clean[da.isfinite(data_clean)]
which is boolean fancy indexing on a dask array. That forces
compute_chunk_sizes, materializing a full scan just to know the output
chunk layout before percentile can run. Swap in the same seeded
_generate_sample_indices sampler that natural_breaks/quantile already
use: gather 200k indices on the dask array, compute the sample and the
global nanmax in one dask.compute() call, and take percentiles on the
finite portion of the sample in numpy.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Apr 16, 2026
@brendancol brendancol merged commit 7fa9e04 into master Apr 16, 2026
11 checks passed
@brendancol brendancol deleted the perf/classify-head-tail-and-box-plot branch May 4, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant