Skip to content

perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659

@brendancol

Description

@brendancol

Summary

_try_nvcomp_from_device_bufs in xrspatial/geotiff/_gpu_decode.py allocates N separate CuPy buffers (one per tile) for nvCOMP decompression output, then cupy.concatenates them after the decompress kernel returns. The other nvCOMP paths in this module (LZW at L1847, deflate at L1878, host-buffer at L1114) already allocate a single contiguous cupy.empty(n_tiles * tile_bytes) up front and derive per-tile device pointers as base_ptr + i * tile_bytes. The per-tile + concat pattern keeps two copies of the decompressed data on device simultaneously, and the trailing concatenate is pure overhead.

Where it triggers

This path runs when:

  • kvikio is importable (GPUDirect Storage available)
  • the nvCOMP shared library loads via ctypes
  • the COG is ZSTD-compressed (compression tag 50000)

That is the fastest read path the GPU pipeline offers: NVMe to GPU via DMA, nvCOMP decompresses on device, no host involvement.

Measurement (overhead only)

The numbers below isolate the cost of the alloc + concat steps themselves. They do NOT include nvCOMP decompression latency, kvikio I/O, or any other work the function does -- they cover only what would be deleted by the refactor. CUDA 12, RTX-class GPU:

n=  64 tile_bytes= 65536: per-tile alloc+concat= 20.84 ms; single buf+ptrs=  0.56 ms  (-20.3 ms)
n= 256 tile_bytes= 65536: per-tile alloc+concat=  3.66 ms; single buf+ptrs=  0.69 ms  ( -3.0 ms)
n=1024 tile_bytes= 65536: per-tile alloc+concat= 11.91 ms; single buf+ptrs=  2.39 ms  ( -9.5 ms)
n= 256 tile_bytes=262144: per-tile alloc+concat=  8.18 ms; single buf+ptrs=  0.13 ms  ( -8.1 ms)

The proportion of end-to-end read_geotiff_gpu time that this overhead represents depends on how fast the actual nvCOMP+GDS work is on the target host. The local environment lacks kvikio so an end-to-end measurement of this exact path is not possible here; the table above is the isolated overhead.

Peak GPU memory during decompression also drops by half: d_decomp_bufs (N x tile_bytes) and the result of cupy.concatenate(d_decomp_bufs) are both live for the lifetime of the concat, and a single contiguous buffer keeps only one copy on device.

Proposed fix

Replace lines 1577 / 1580 / 1624 with the pattern the rest of the module already uses:

d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
base = int(d_decomp.data.ptr)
d_decomp_ptrs = cupy.asarray(
    base + np.arange(n, dtype=np.uint64) * np.uint64(tile_bytes))
# pass d_decomp_ptrs.data.ptr into nvcompBatchedZstdDecompressAsync
return d_decomp

The return contract (single flat cupy.uint8 buffer of length n * tile_bytes) does not change, so _apply_predictor_and_assemble keeps working unchanged.

Affected backends

GPU only. CPU and dask paths do not exercise this code.

Tests

A direct unit test for _try_nvcomp_from_device_bufs needs kvikio plus a GDS-capable filesystem, so the regression target is the end-to-end read_geotiff_gpu call on a ZSTD COG. The existing GDS-fallback tests cover the path where this function is skipped; a new test can exercise the single-buffer code path on the bytes-based fallback, which goes through the same _apply_predictor_and_assemble consumer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgpuCuPy / CUDA GPU supportperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions