Summary
_try_nvcomp_from_device_bufs in xrspatial/geotiff/_gpu_decode.py allocates N separate CuPy buffers (one per tile) for nvCOMP decompression output, then cupy.concatenates them after the decompress kernel returns. The other nvCOMP paths in this module (LZW at L1847, deflate at L1878, host-buffer at L1114) already allocate a single contiguous cupy.empty(n_tiles * tile_bytes) up front and derive per-tile device pointers as base_ptr + i * tile_bytes. The per-tile + concat pattern keeps two copies of the decompressed data on device simultaneously, and the trailing concatenate is pure overhead.
Where it triggers
This path runs when:
kvikio is importable (GPUDirect Storage available)
- the
nvCOMP shared library loads via ctypes
- the COG is ZSTD-compressed (compression tag 50000)
That is the fastest read path the GPU pipeline offers: NVMe to GPU via DMA, nvCOMP decompresses on device, no host involvement.
Measurement (overhead only)
The numbers below isolate the cost of the alloc + concat steps themselves. They do NOT include nvCOMP decompression latency, kvikio I/O, or any other work the function does -- they cover only what would be deleted by the refactor. CUDA 12, RTX-class GPU:
n= 64 tile_bytes= 65536: per-tile alloc+concat= 20.84 ms; single buf+ptrs= 0.56 ms (-20.3 ms)
n= 256 tile_bytes= 65536: per-tile alloc+concat= 3.66 ms; single buf+ptrs= 0.69 ms ( -3.0 ms)
n=1024 tile_bytes= 65536: per-tile alloc+concat= 11.91 ms; single buf+ptrs= 2.39 ms ( -9.5 ms)
n= 256 tile_bytes=262144: per-tile alloc+concat= 8.18 ms; single buf+ptrs= 0.13 ms ( -8.1 ms)
The proportion of end-to-end read_geotiff_gpu time that this overhead represents depends on how fast the actual nvCOMP+GDS work is on the target host. The local environment lacks kvikio so an end-to-end measurement of this exact path is not possible here; the table above is the isolated overhead.
Peak GPU memory during decompression also drops by half: d_decomp_bufs (N x tile_bytes) and the result of cupy.concatenate(d_decomp_bufs) are both live for the lifetime of the concat, and a single contiguous buffer keeps only one copy on device.
Proposed fix
Replace lines 1577 / 1580 / 1624 with the pattern the rest of the module already uses:
d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
base = int(d_decomp.data.ptr)
d_decomp_ptrs = cupy.asarray(
base + np.arange(n, dtype=np.uint64) * np.uint64(tile_bytes))
# pass d_decomp_ptrs.data.ptr into nvcompBatchedZstdDecompressAsync
return d_decomp
The return contract (single flat cupy.uint8 buffer of length n * tile_bytes) does not change, so _apply_predictor_and_assemble keeps working unchanged.
Affected backends
GPU only. CPU and dask paths do not exercise this code.
Tests
A direct unit test for _try_nvcomp_from_device_bufs needs kvikio plus a GDS-capable filesystem, so the regression target is the end-to-end read_geotiff_gpu call on a ZSTD COG. The existing GDS-fallback tests cover the path where this function is skipped; a new test can exercise the single-buffer code path on the bytes-based fallback, which goes through the same _apply_predictor_and_assemble consumer.
Summary
_try_nvcomp_from_device_bufsinxrspatial/geotiff/_gpu_decode.pyallocates N separate CuPy buffers (one per tile) for nvCOMP decompression output, thencupy.concatenates them after the decompress kernel returns. The other nvCOMP paths in this module (LZW at L1847, deflate at L1878, host-buffer at L1114) already allocate a single contiguouscupy.empty(n_tiles * tile_bytes)up front and derive per-tile device pointers asbase_ptr + i * tile_bytes. The per-tile + concat pattern keeps two copies of the decompressed data on device simultaneously, and the trailing concatenate is pure overhead.Where it triggers
This path runs when:
kvikiois importable (GPUDirect Storage available)nvCOMPshared library loads via ctypesThat is the fastest read path the GPU pipeline offers: NVMe to GPU via DMA, nvCOMP decompresses on device, no host involvement.
Measurement (overhead only)
The numbers below isolate the cost of the alloc + concat steps themselves. They do NOT include nvCOMP decompression latency, kvikio I/O, or any other work the function does -- they cover only what would be deleted by the refactor. CUDA 12, RTX-class GPU:
The proportion of end-to-end
read_geotiff_gputime that this overhead represents depends on how fast the actual nvCOMP+GDS work is on the target host. The local environment lacks kvikio so an end-to-end measurement of this exact path is not possible here; the table above is the isolated overhead.Peak GPU memory during decompression also drops by half:
d_decomp_bufs(N x tile_bytes) and the result ofcupy.concatenate(d_decomp_bufs)are both live for the lifetime of the concat, and a single contiguous buffer keeps only one copy on device.Proposed fix
Replace lines 1577 / 1580 / 1624 with the pattern the rest of the module already uses:
The return contract (single flat
cupy.uint8buffer of lengthn * tile_bytes) does not change, so_apply_predictor_and_assemblekeeps working unchanged.Affected backends
GPU only. CPU and dask paths do not exercise this code.
Tests
A direct unit test for
_try_nvcomp_from_device_bufsneeds kvikio plus a GDS-capable filesystem, so the regression target is the end-to-endread_geotiff_gpucall on a ZSTD COG. The existing GDS-fallback tests cover the path where this function is skipped; a new test can exercise the single-buffer code path on the bytes-based fallback, which goes through the same_apply_predictor_and_assembleconsumer.