perf(geotiff): batch _try_nvjpeg2k_batch_decode allocations + sync (#2107)#2110
Merged
Merged
Conversation
The nvJPEG2000 batch decode helper allocated `samples` fresh `cupy.empty(tile_height * pitch)` buffers inside its per-tile loop and called `cupy.cuda.Device().synchronize()` once per tile. The per-tile sync forced default-stream serialisation that defeated nvJPEG2000's internal pipelining, and the per-tile allocations each round-tripped through cupy's memory pool. Pre-allocate a single `d_comp_pool` sized `n_tiles * samples * tile_height * pitch` (guarded by `_check_gpu_memory`) and derive per-tile / per-component output buffers as slab views into the pool. Replace the per-tile sync with a single batch-end sync so successive tiles can pipeline through nvJPEG2000. Mirrors the prior fixes for sibling helpers in the same module: `_try_nvcomp_from_device_bufs` (#1659), `_try_kvikio_read_tiles` (#1688), and `_nvcomp_batch_compress` (#1712). Test coverage in `test_nvjpeg2k_single_alloc_2107.py`: - AST-level structural assertions: no `cupy.empty` and no `Device().synchronize()` inside the per-tile for-loop; pool buffer + slab-math identifiers present; `_check_gpu_memory` guard in place. - Lib-absent short-circuit returns `None` without touching cupy. - Unsupported-dtype branch cleans up handles in the expected order. - Cupy-only test confirms per-tile slabs cover the pool exactly with no overlap (gated on `gpu` mark; libnvjpeg2k.so is not part of the default cuda-toolkit install so end-to-end coverage stays on RAPIDS-enabled hosts). State CSV updated: Pass 12 note added to the geotiff row, verdict stays SAFE/IO-bound; issue set to 2107.
``_try_nvjpeg2k_batch_decode`` imported cupy at the top of the function, before the dtype-guard branch that short-circuits on unsupported dtypes. CI hosts without cupy hit ``ModuleNotFoundError`` on ``test_returns_none_for_unsupported_dtype``, which monkeypatches ``_get_nvjpeg2k`` to a fake C library and never reaches a GPU allocation. Move the cupy import to just before the first ``cupy.empty`` call so the early-return branches (lib missing or unsupported dtype) run on a CPU-only host. The dtype guard already destroys the four nvjpeg2k handles before returning ``None``; deferring the cupy import does not change that cleanup ordering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2107. The nvJPEG2000 batch decode helper in
_gpu_decode.pyran a per-tile loop that allocatedsamplesfreshcupy.empty(tile_height * pitch)buffers and calledcupy.cuda.Device().synchronize()once per tile. The per-tile sync forced default-stream serialisation that defeated nvJPEG2000's internal pipelining; the per-tile allocations each round-tripped through cupy's memory pool.Pre-allocate a single
d_comp_poolsizedn_tiles * samples * tile_height * pitch(guarded by_check_gpu_memory) and derive per-tile / per-component output buffers as slab views into the pool. Replace the per-tile sync with a single batch-end sync.Same shape as the prior fixes for sibling helpers in this module:
_try_nvcomp_from_device_bufs(perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659) -- single buffer + pointer offsets_try_kvikio_read_tiles(perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688) -- single buffer + batched submit_nvcomp_batch_compress(perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712) -- single pool + concat + single getTest plan
xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py(7 tests):cupy.emptyinside the per-tile loopDevice().synchronize()inside the per-tile loopd_comp_poolandper_tile_comp_bytesidentifiers_check_gpu_memoryguard present before the pool allocationNonewithout touching cupygpumark; libnvjpeg2k.so is not part of cuda-toolkit's default install so end-to-end coverage stays on RAPIDS-enabled hosts)Existing tests:
test_jpeg2000.py+test_compression.py(30 tests) -- passDask graph probe (4096x4096 deflate-tiled,
chunks=512): 256 tasks for 64 chunks = 4 tasks/chunk, well under the 50K cap. OOM verdict stays SAFE/IO-bound; this change has no effect on graph cost.State CSV
.claude/sweep-performance-state.csv-- Pass 12 note added to the geotiff row.