Summary
_try_nvjpeg2k_batch_decode in xrspatial/geotiff/_gpu_decode.py (around L2725-2778) allocates per-component output buffers via cupy.empty(tile_height * pitch) inside its per-tile decode loop, and synchronizes on the default stream after every single tile. This mirrors the anti-pattern previously fixed in sibling helpers:
For an N-tile JPEG 2000 read with S samples per pixel, the current pattern is:
for i, tile_data in enumerate(compressed_tiles):
...
comp_bufs = []
pitch = tile_width * dtype.itemsize
for c in range(samples):
buf = cupy.empty(tile_height * pitch, dtype=cupy.uint8) # N*S allocs
comp_bufs.append(buf)
...
cupy.cuda.Device().synchronize() # serialises ALL streams, once per tile
The per-tile sync alone forces serial decode (no opportunity for nvJPEG2000's internal pipelining), even when the encode/decode path is otherwise stream-friendly. Each cupy.empty allocation round-trips through cupy's memory pool, adding tens of microseconds per call.
Anti-Pattern Reference
The other batched-codec helpers in the same module use the canonical fix:
- Allocate one contiguous output buffer of size
n_tiles * tile_bytes.
- Compute per-tile/per-component pointer offsets on host (
np.cumsum or np.arange * stride).
- Pass per-element device pointers as
base_ptr + offsets.
- Synchronize ONCE after the batch.
Applied to nvJPEG2000, components for tile i should live in slices of the existing d_all_tiles buffer (or one fresh d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch)); the loop body just writes pointers into the nvjpeg2kImage_t struct.
Impact
The function is gated behind _get_nvjpeg2k() returning the loaded shared library, so the impact is bounded to environments with nvJPEG2000 installed (CUDA toolkit + RAPIDS conda env). On those hosts, large J2K-tiled COG reads (hundreds to thousands of tiles) currently pay one default-stream sync per tile plus N*S small allocations. Worst case affects multi-thousand-tile satellite imagery reads where the per-tile sync cost adds up to seconds of unnecessary wall time.
Plan
- Pre-allocate
d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch, dtype=cupy.uint8) once outside the loop with a _check_gpu_memory guard for the total size.
- Inside the per-tile loop, derive per-component device pointers as views into
d_comp_pool.
- Drop the per-tile
cupy.cuda.Device().synchronize(); synchronize once after the loop terminates.
- Add a structural test that asserts the function calls
cupy.empty at most once for the output pool (matches the test_nvcomp_from_device_bufs_single_alloc_1659.py pattern).
Files
xrspatial/geotiff/_gpu_decode.py around L2740-2778
- New test under
xrspatial/geotiff/tests/
Related
This is a continuation of the same "single contiguous device buffer" refactor pattern.
Summary
_try_nvjpeg2k_batch_decodeinxrspatial/geotiff/_gpu_decode.py(around L2725-2778) allocates per-component output buffers viacupy.empty(tile_height * pitch)inside its per-tile decode loop, and synchronizes on the default stream after every single tile. This mirrors the anti-pattern previously fixed in sibling helpers:_try_nvcomp_from_device_bufs(perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659): per-tile alloc + trailing concat -> single contiguous buffer_try_kvikio_read_tiles(perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688): per-tile cupy.empty + blocking IOFuture.get -> single buffer + batched submit_nvcomp_batch_compress(perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712): per-tile cupy.get -> concat + single getFor an N-tile JPEG 2000 read with S samples per pixel, the current pattern is:
The per-tile sync alone forces serial decode (no opportunity for nvJPEG2000's internal pipelining), even when the encode/decode path is otherwise stream-friendly. Each
cupy.emptyallocation round-trips through cupy's memory pool, adding tens of microseconds per call.Anti-Pattern Reference
The other batched-codec helpers in the same module use the canonical fix:
n_tiles * tile_bytes.np.cumsumornp.arange * stride).base_ptr + offsets.Applied to nvJPEG2000, components for tile
ishould live in slices of the existingd_all_tilesbuffer (or one freshd_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch)); the loop body just writes pointers into the nvjpeg2kImage_t struct.Impact
The function is gated behind
_get_nvjpeg2k()returning the loaded shared library, so the impact is bounded to environments with nvJPEG2000 installed (CUDA toolkit + RAPIDS conda env). On those hosts, large J2K-tiled COG reads (hundreds to thousands of tiles) currently pay one default-stream sync per tile plus N*S small allocations. Worst case affects multi-thousand-tile satellite imagery reads where the per-tile sync cost adds up to seconds of unnecessary wall time.Plan
d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch, dtype=cupy.uint8)once outside the loop with a_check_gpu_memoryguard for the total size.d_comp_pool.cupy.cuda.Device().synchronize(); synchronize once after the loop terminates.cupy.emptyat most once for the output pool (matches thetest_nvcomp_from_device_bufs_single_alloc_1659.pypattern).Files
xrspatial/geotiff/_gpu_decode.pyaround L2740-2778xrspatial/geotiff/tests/Related
_try_nvcomp_from_device_bufs_try_kvikio_read_tiles_nvcomp_batch_compressnp.cumsumThis is a continuation of the same "single contiguous device buffer" refactor pattern.