Skip to content

geotiff: in-place nodata mask on GPU (#1934)#1937

Merged
brendancol merged 3 commits into
mainfrom
rockout-performance-low-geotiff-2026-05-15
May 15, 2026
Merged

geotiff: in-place nodata mask on GPU (#1934)#1937
brendancol merged 3 commits into
mainfrom
rockout-performance-low-geotiff-2026-05-15

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _apply_nodata_mask_gpu ran cupy.where(arr == sentinel, nan, arr) for both the float path and the post-astype(float64) integer path. cupy.where allocates a fresh output buffer the same shape as the input, so each call paid one chunk-sized device allocation plus the corresponding device-to-device copy.
  • All three call sites in _backends/gpu.py pass a freshly decoded GPU buffer that no caller-visible state aliases:
    • gpu.py:387 (stripped read), gpu.py:684 (tiled GPU decode), gpu.py:1072 (per-chunk delayed task).
  • Replaced the cupy.where step with cupy.putmask so the existing buffer is mutated in place. Drops one chunk-sized allocation per call; matters most on the dask+cupy path where the chunk task runs once per chunk.

Test plan

  • pytest xrspatial/geotiff/tests/test_apply_nodata_mask_gpu_inplace_1934.py -x -q -- 7 new tests pass (correctness on float + int paths, in-place pointer guarantee, pool used_bytes ceiling on both paths, NaN-sentinel no-op, nodata=None passthrough).
  • pytest xrspatial/geotiff/tests/test_nodata_nan_int_1774.py xrspatial/geotiff/tests/test_nodata_out_of_range_1581.py xrspatial/geotiff/tests/test_gpu_nodata_1542.py xrspatial/geotiff/tests/test_miniswhite_nodata_1809.py xrspatial/geotiff/tests/test_gds_chunked_gpu_parity_1896.py -q -- 52 existing nodata-related tests pass.

Closes #1934.

`_apply_nodata_mask_gpu` ran `cupy.where(arr == sentinel, nan, arr)` for
both the float and the post-`astype(float64)` integer paths, which
allocates a fresh output buffer the same shape as the input. Every call
site passes a freshly decoded GPU buffer that no caller-visible state
aliases, so writing NaN into the existing buffer with `cupy.putmask`
drops one chunk-sized device allocation per call.

Adds `test_apply_nodata_mask_gpu_inplace_1934.py` covering the float
correctness path, the in-place pointer guarantee, a pool `used_bytes`
ceiling for both the float and integer paths, the NaN-sentinel no-op,
and the `nodata=None` passthrough.
Note that #1934 (`_apply_nodata_mask_gpu` in-place mutation) was filed
and fixed during the 2026-05-15 rockout pass on geotiff.
Copilot AI review requested due to automatic review settings May 15, 2026 14:34
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GeoTIFF GPU nodata masking by avoiding an extra device-sized allocation: _apply_nodata_mask_gpu now mutates the already-owned CuPy buffer in place (using cupy.putmask) instead of producing a new array via cupy.where. This reduces allocator pressure especially on dask+cupy per-chunk execution.

Changes:

  • Replace cupy.where(...) with in-place cupy.putmask(...) in _apply_nodata_mask_gpu for both float and integer→float64 paths.
  • Add new GPU regression tests covering correctness plus in-place/pool-allocation expectations.
  • Update internal performance sweep tracking state.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_backends/_gpu_helpers.py Switch nodata sentinel replacement to cupy.putmask to avoid allocating a fresh output array.
xrspatial/geotiff/tests/test_apply_nodata_mask_gpu_inplace_1934.py New regression tests for correctness and allocation/in-place behavior around _apply_nodata_mask_gpu.
.claude/sweep-performance-state.csv Record the performance sweep note for the change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +29 to +43
def _gpu_available() -> bool:
if importlib.util.find_spec("cupy") is None:
return False
try:
import cupy
return bool(cupy.cuda.is_available())
except Exception:
return False


_HAS_GPU = _gpu_available()
_gpu_only = pytest.mark.skipif(
not _HAS_GPU,
reason="cupy + CUDA required",
)
Comment on lines +156 to +179
arr_gpu = cupy.full((512, 512), 3, dtype=cupy.uint16)
arr_gpu[0, 0] = 1 # ensure non-sentinel pixel exists

pool = cupy.get_default_memory_pool()
cupy.cuda.Stream.null.synchronize()
used_before = pool.used_bytes()

out = _apply_nodata_mask_gpu(arr_gpu, 3)
cupy.cuda.Stream.null.synchronize()
used_after = pool.used_bytes()

# Required: one float64 buffer (512*512*8 = 2 MiB) from astype.
# Pre-fix would have allocated a second float64 buffer for cupy.where
# (another 2 MiB) on top of that.
float64_bytes = out.nbytes
growth = used_after - used_before
# Allow some slack for the bool mask + .any() scalar (well under
# one float64 buffer of slack).
assert growth < 2 * float64_bytes, (
f"unexpected allocation growth {growth} bytes >= "
f"2 * float64_bytes {2 * float64_bytes}; pre-fix double-alloc"
)


- Reuse the shared ``requires_gpu`` marker from
  ``xrspatial/geotiff/tests/conftest.py`` instead of redefining a
  local ``_HAS_GPU`` import-and-runtime check. The conftest helper
  already validates both ``import cupy`` and
  ``cupy.cuda.is_available()``.
- Run the two memory-pool allocation tests under an isolated
  ``MemoryPool`` allocator and switch the measurement from
  ``used_bytes`` to ``total_bytes`` (called after
  ``free_all_blocks``) so the assertion cannot be masked by the
  input buffer being refcount-freed back to the pool before
  ``used_after`` is sampled.
@brendancol brendancol merged commit 6bff87e into main May 15, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: _apply_nodata_mask_gpu allocates instead of mutating in-place

2 participants