perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path)

## Summary

`_try_nvcomp_from_device_bufs` in `xrspatial/geotiff/_gpu_decode.py` allocates N separate CuPy buffers (one per tile) for nvCOMP decompression output, then `cupy.concatenate`s them after the decompress kernel returns. The other nvCOMP paths in this module (LZW at L1847, deflate at L1878, host-buffer at L1114) already allocate a single contiguous `cupy.empty(n_tiles * tile_bytes)` up front and derive per-tile device pointers as `base_ptr + i * tile_bytes`. The per-tile + concat pattern keeps two copies of the decompressed data on device simultaneously, and the trailing concatenate is pure overhead.

## Where it triggers

This path runs when:
- `kvikio` is importable (GPUDirect Storage available)
- the `nvCOMP` shared library loads via ctypes
- the COG is ZSTD-compressed (compression tag 50000)

That is the fastest read path the GPU pipeline offers: NVMe to GPU via DMA, nvCOMP decompresses on device, no host involvement.

## Measurement (overhead only)

The numbers below isolate the cost of the alloc + concat steps themselves. They do NOT include nvCOMP decompression latency, kvikio I/O, or any other work the function does -- they cover only what would be deleted by the refactor. CUDA 12, RTX-class GPU:

```
n=  64 tile_bytes= 65536: per-tile alloc+concat= 20.84 ms; single buf+ptrs=  0.56 ms  (-20.3 ms)
n= 256 tile_bytes= 65536: per-tile alloc+concat=  3.66 ms; single buf+ptrs=  0.69 ms  ( -3.0 ms)
n=1024 tile_bytes= 65536: per-tile alloc+concat= 11.91 ms; single buf+ptrs=  2.39 ms  ( -9.5 ms)
n= 256 tile_bytes=262144: per-tile alloc+concat=  8.18 ms; single buf+ptrs=  0.13 ms  ( -8.1 ms)
```

The proportion of end-to-end `read_geotiff_gpu` time that this overhead represents depends on how fast the actual nvCOMP+GDS work is on the target host. The local environment lacks kvikio so an end-to-end measurement of this exact path is not possible here; the table above is the isolated overhead.

Peak GPU memory during decompression also drops by half: `d_decomp_bufs` (N x tile_bytes) and the result of `cupy.concatenate(d_decomp_bufs)` are both live for the lifetime of the concat, and a single contiguous buffer keeps only one copy on device.

## Proposed fix

Replace lines 1577 / 1580 / 1624 with the pattern the rest of the module already uses:

```python
d_decomp = cupy.empty(n * tile_bytes, dtype=cupy.uint8)
base = int(d_decomp.data.ptr)
d_decomp_ptrs = cupy.asarray(
    base + np.arange(n, dtype=np.uint64) * np.uint64(tile_bytes))
# pass d_decomp_ptrs.data.ptr into nvcompBatchedZstdDecompressAsync
return d_decomp
```

The return contract (single flat `cupy.uint8` buffer of length `n * tile_bytes`) does not change, so `_apply_predictor_and_assemble` keeps working unchanged.

## Affected backends

GPU only. CPU and dask paths do not exercise this code.

## Tests

A direct unit test for `_try_nvcomp_from_device_bufs` needs kvikio plus a GDS-capable filesystem, so the regression target is the end-to-end `read_geotiff_gpu` call on a ZSTD COG. The existing GDS-fallback tests cover the path where this function is skipped; a new test can exercise the single-buffer code path on the bytes-based fallback, which goes through the same `_apply_predictor_and_assemble` consumer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659

Summary

Where it triggers

Measurement (overhead only)

Proposed fix

Affected backends

Tests

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659

Description

Summary

Where it triggers

Measurement (overhead only)

Proposed fix

Affected backends

Tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions