Skip to content

perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107)#2110

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-18-r2
May 19, 2026
Merged

perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107)#2110
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-18-r2

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Fixes #2107. The nvJPEG2000 batch decode helper in _gpu_decode.py ran a per-tile loop that allocated samples fresh cupy.empty(tile_height * pitch) buffers and called cupy.cuda.Device().synchronize() once per tile. The per-tile sync forced default-stream serialisation that defeated nvJPEG2000's internal pipelining; the per-tile allocations each round-tripped through cupy's memory pool.

Pre-allocate a single d_comp_pool sized n_tiles * samples * tile_height * pitch (guarded by _check_gpu_memory) and derive per-tile / per-component output buffers as slab views into the pool. Replace the per-tile sync with a single batch-end sync.

Same shape as the prior fixes for sibling helpers in this module:

Test plan

xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py (7 tests):

  • AST-level: no cupy.empty inside the per-tile loop
  • AST-level: no Device().synchronize() inside the per-tile loop
  • Source contains d_comp_pool and per_tile_comp_bytes identifiers
  • _check_gpu_memory guard present before the pool allocation
  • Lib-absent short-circuit returns None without touching cupy
  • Unsupported-dtype branch cleans up handles in expected order
  • Cupy-only slab-non-overlap check (gated on gpu mark; libnvjpeg2k.so is not part of cuda-toolkit's default install so end-to-end coverage stays on RAPIDS-enabled hosts)

Existing tests:

  • test_jpeg2000.py + test_compression.py (30 tests) -- pass

Dask graph probe (4096x4096 deflate-tiled, chunks=512): 256 tasks for 64 chunks = 4 tasks/chunk, well under the 50K cap. OOM verdict stays SAFE/IO-bound; this change has no effect on graph cost.

State CSV

.claude/sweep-performance-state.csv -- Pass 12 note added to the geotiff row.

The nvJPEG2000 batch decode helper allocated `samples` fresh
`cupy.empty(tile_height * pitch)` buffers inside its per-tile loop and
called `cupy.cuda.Device().synchronize()` once per tile. The per-tile
sync forced default-stream serialisation that defeated nvJPEG2000's
internal pipelining, and the per-tile allocations each round-tripped
through cupy's memory pool.

Pre-allocate a single `d_comp_pool` sized
`n_tiles * samples * tile_height * pitch` (guarded by
`_check_gpu_memory`) and derive per-tile / per-component output buffers
as slab views into the pool. Replace the per-tile sync with a single
batch-end sync so successive tiles can pipeline through nvJPEG2000.

Mirrors the prior fixes for sibling helpers in the same module:
`_try_nvcomp_from_device_bufs` (#1659), `_try_kvikio_read_tiles`
(#1688), and `_nvcomp_batch_compress` (#1712).

Test coverage in `test_nvjpeg2k_single_alloc_2107.py`:

- AST-level structural assertions: no `cupy.empty` and no
  `Device().synchronize()` inside the per-tile for-loop; pool
  buffer + slab-math identifiers present; `_check_gpu_memory` guard
  in place.
- Lib-absent short-circuit returns `None` without touching cupy.
- Unsupported-dtype branch cleans up handles in the expected order.
- Cupy-only test confirms per-tile slabs cover the pool exactly with
  no overlap (gated on `gpu` mark; libnvjpeg2k.so is not part of the
  default cuda-toolkit install so end-to-end coverage stays on
  RAPIDS-enabled hosts).

State CSV updated: Pass 12 note added to the geotiff row, verdict stays
SAFE/IO-bound; issue set to 2107.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
``_try_nvjpeg2k_batch_decode`` imported cupy at the top of the
function, before the dtype-guard branch that short-circuits on
unsupported dtypes. CI hosts without cupy hit ``ModuleNotFoundError``
on ``test_returns_none_for_unsupported_dtype``, which monkeypatches
``_get_nvjpeg2k`` to a fake C library and never reaches a GPU
allocation. Move the cupy import to just before the first
``cupy.empty`` call so the early-return branches (lib missing or
unsupported dtype) run on a CPU-only host.

The dtype guard already destroys the four nvjpeg2k handles before
returning ``None``; deferring the cupy import does not change that
cleanup ordering.
@brendancol brendancol merged commit 0c4632d into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(geotiff): batch _try_nvjpeg2k_batch_decode per-tile allocations

1 participant