perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107) by brendancol · Pull Request #2110 · xarray-contrib/xarray-spatial

brendancol · 2026-05-19T04:28:35Z

Summary

Fixes #2107. The nvJPEG2000 batch decode helper in _gpu_decode.py ran a per-tile loop that allocated samples fresh cupy.empty(tile_height * pitch) buffers and called cupy.cuda.Device().synchronize() once per tile. The per-tile sync forced default-stream serialisation that defeated nvJPEG2000's internal pipelining; the per-tile allocations each round-tripped through cupy's memory pool.

Pre-allocate a single d_comp_pool sized n_tiles * samples * tile_height * pitch (guarded by _check_gpu_memory) and derive per-tile / per-component output buffers as slab views into the pool. Replace the per-tile sync with a single batch-end sync.

Same shape as the prior fixes for sibling helpers in this module:

_try_nvcomp_from_device_bufs (perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659) -- single buffer + pointer offsets
_try_kvikio_read_tiles (perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688) -- single buffer + batched submit
_nvcomp_batch_compress (perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712) -- single pool + concat + single get

Test plan

xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py (7 tests):

AST-level: no cupy.empty inside the per-tile loop
AST-level: no Device().synchronize() inside the per-tile loop
Source contains d_comp_pool and per_tile_comp_bytes identifiers
_check_gpu_memory guard present before the pool allocation
Lib-absent short-circuit returns None without touching cupy
Unsupported-dtype branch cleans up handles in expected order
Cupy-only slab-non-overlap check (gated on gpu mark; libnvjpeg2k.so is not part of cuda-toolkit's default install so end-to-end coverage stays on RAPIDS-enabled hosts)

Existing tests:

test_jpeg2000.py + test_compression.py (30 tests) -- pass

Dask graph probe (4096x4096 deflate-tiled, chunks=512): 256 tasks for 64 chunks = 4 tasks/chunk, well under the 50K cap. OOM verdict stays SAFE/IO-bound; this change has no effect on graph cost.

State CSV

.claude/sweep-performance-state.csv -- Pass 12 note added to the geotiff row.

The nvJPEG2000 batch decode helper allocated `samples` fresh `cupy.empty(tile_height * pitch)` buffers inside its per-tile loop and called `cupy.cuda.Device().synchronize()` once per tile. The per-tile sync forced default-stream serialisation that defeated nvJPEG2000's internal pipelining, and the per-tile allocations each round-tripped through cupy's memory pool. Pre-allocate a single `d_comp_pool` sized `n_tiles * samples * tile_height * pitch` (guarded by `_check_gpu_memory`) and derive per-tile / per-component output buffers as slab views into the pool. Replace the per-tile sync with a single batch-end sync so successive tiles can pipeline through nvJPEG2000. Mirrors the prior fixes for sibling helpers in the same module: `_try_nvcomp_from_device_bufs` (#1659), `_try_kvikio_read_tiles` (#1688), and `_nvcomp_batch_compress` (#1712). Test coverage in `test_nvjpeg2k_single_alloc_2107.py`: - AST-level structural assertions: no `cupy.empty` and no `Device().synchronize()` inside the per-tile for-loop; pool buffer + slab-math identifiers present; `_check_gpu_memory` guard in place. - Lib-absent short-circuit returns `None` without touching cupy. - Unsupported-dtype branch cleans up handles in the expected order. - Cupy-only test confirms per-tile slabs cover the pool exactly with no overlap (gated on `gpu` mark; libnvjpeg2k.so is not part of the default cuda-toolkit install so end-to-end coverage stays on RAPIDS-enabled hosts). State CSV updated: Pass 12 note added to the geotiff row, verdict stays SAFE/IO-bound; issue set to 2107.

``_try_nvjpeg2k_batch_decode`` imported cupy at the top of the function, before the dtype-guard branch that short-circuits on unsupported dtypes. CI hosts without cupy hit ``ModuleNotFoundError`` on ``test_returns_none_for_unsupported_dtype``, which monkeypatches ``_get_nvjpeg2k`` to a fake C library and never reaches a GPU allocation. Move the cupy import to just before the first ``cupy.empty`` call so the early-return branches (lib missing or unsupported dtype) run on a CPU-only host. The dtype guard already destroys the four nvjpeg2k handles before returning ``None``; deferring the cupy import does not change that cleanup ordering.

github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026

brendancol merged commit 0c4632d into main May 19, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107)#2110

perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107)#2110
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-18-r2

brendancol commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 19, 2026

Summary

Test plan

State CSV

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant