Skip to content

perf(geotiff): batch _try_nvjpeg2k_batch_decode per-tile allocations #2107

@brendancol

Description

@brendancol

Summary

_try_nvjpeg2k_batch_decode in xrspatial/geotiff/_gpu_decode.py (around L2725-2778) allocates per-component output buffers via cupy.empty(tile_height * pitch) inside its per-tile decode loop, and synchronizes on the default stream after every single tile. This mirrors the anti-pattern previously fixed in sibling helpers:

For an N-tile JPEG 2000 read with S samples per pixel, the current pattern is:

for i, tile_data in enumerate(compressed_tiles):
    ...
    comp_bufs = []
    pitch = tile_width * dtype.itemsize
    for c in range(samples):
        buf = cupy.empty(tile_height * pitch, dtype=cupy.uint8)  # N*S allocs
        comp_bufs.append(buf)
    ...
    cupy.cuda.Device().synchronize()  # serialises ALL streams, once per tile

The per-tile sync alone forces serial decode (no opportunity for nvJPEG2000's internal pipelining), even when the encode/decode path is otherwise stream-friendly. Each cupy.empty allocation round-trips through cupy's memory pool, adding tens of microseconds per call.

Anti-Pattern Reference

The other batched-codec helpers in the same module use the canonical fix:

  • Allocate one contiguous output buffer of size n_tiles * tile_bytes.
  • Compute per-tile/per-component pointer offsets on host (np.cumsum or np.arange * stride).
  • Pass per-element device pointers as base_ptr + offsets.
  • Synchronize ONCE after the batch.

Applied to nvJPEG2000, components for tile i should live in slices of the existing d_all_tiles buffer (or one fresh d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch)); the loop body just writes pointers into the nvjpeg2kImage_t struct.

Impact

The function is gated behind _get_nvjpeg2k() returning the loaded shared library, so the impact is bounded to environments with nvJPEG2000 installed (CUDA toolkit + RAPIDS conda env). On those hosts, large J2K-tiled COG reads (hundreds to thousands of tiles) currently pay one default-stream sync per tile plus N*S small allocations. Worst case affects multi-thousand-tile satellite imagery reads where the per-tile sync cost adds up to seconds of unnecessary wall time.

Plan

  1. Pre-allocate d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch, dtype=cupy.uint8) once outside the loop with a _check_gpu_memory guard for the total size.
  2. Inside the per-tile loop, derive per-component device pointers as views into d_comp_pool.
  3. Drop the per-tile cupy.cuda.Device().synchronize(); synchronize once after the loop terminates.
  4. Add a structural test that asserts the function calls cupy.empty at most once for the output pool (matches the test_nvcomp_from_device_bufs_single_alloc_1659.py pattern).

Files

  • xrspatial/geotiff/_gpu_decode.py around L2740-2778
  • New test under xrspatial/geotiff/tests/

Related

This is a continuation of the same "single contiguous device buffer" refactor pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgpuCuPy / CUDA GPU supportperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions