perf(geotiff): batch _try_nvjpeg2k_batch_decode per-tile allocations

## Summary

`_try_nvjpeg2k_batch_decode` in `xrspatial/geotiff/_gpu_decode.py` (around L2725-2778) allocates per-component output buffers via `cupy.empty(tile_height * pitch)` inside its per-tile decode loop, and synchronizes on the default stream after every single tile. This mirrors the anti-pattern previously fixed in sibling helpers:

- `_try_nvcomp_from_device_bufs` (#1659): per-tile alloc + trailing concat -> single contiguous buffer
- `_try_kvikio_read_tiles` (#1688): per-tile cupy.empty + blocking IOFuture.get -> single buffer + batched submit
- `_nvcomp_batch_compress` (#1712): per-tile cupy.get -> concat + single get

For an N-tile JPEG 2000 read with S samples per pixel, the current pattern is:

```python
for i, tile_data in enumerate(compressed_tiles):
    ...
    comp_bufs = []
    pitch = tile_width * dtype.itemsize
    for c in range(samples):
        buf = cupy.empty(tile_height * pitch, dtype=cupy.uint8)  # N*S allocs
        comp_bufs.append(buf)
    ...
    cupy.cuda.Device().synchronize()  # serialises ALL streams, once per tile
```

The per-tile sync alone forces serial decode (no opportunity for nvJPEG2000's internal pipelining), even when the encode/decode path is otherwise stream-friendly. Each `cupy.empty` allocation round-trips through cupy's memory pool, adding tens of microseconds per call.

## Anti-Pattern Reference

The other batched-codec helpers in the same module use the canonical fix:

- Allocate one contiguous output buffer of size `n_tiles * tile_bytes`.
- Compute per-tile/per-component pointer offsets on host (`np.cumsum` or `np.arange * stride`).
- Pass per-element device pointers as `base_ptr + offsets`.
- Synchronize ONCE after the batch.

Applied to nvJPEG2000, components for tile `i` should live in slices of the existing `d_all_tiles` buffer (or one fresh `d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch)`); the loop body just writes pointers into the nvjpeg2kImage_t struct.

## Impact

The function is gated behind `_get_nvjpeg2k()` returning the loaded shared library, so the impact is bounded to environments with nvJPEG2000 installed (CUDA toolkit + RAPIDS conda env). On those hosts, large J2K-tiled COG reads (hundreds to thousands of tiles) currently pay one default-stream sync per tile plus N*S small allocations. Worst case affects multi-thousand-tile satellite imagery reads where the per-tile sync cost adds up to seconds of unnecessary wall time.

## Plan

1. Pre-allocate `d_comp_pool = cupy.empty(n_tiles * samples * tile_height * pitch, dtype=cupy.uint8)` once outside the loop with a `_check_gpu_memory` guard for the total size.
2. Inside the per-tile loop, derive per-component device pointers as views into `d_comp_pool`.
3. Drop the per-tile `cupy.cuda.Device().synchronize()`; synchronize once after the loop terminates.
4. Add a structural test that asserts the function calls `cupy.empty` at most once for the output pool (matches the `test_nvcomp_from_device_bufs_single_alloc_1659.py` pattern).

## Files

- `xrspatial/geotiff/_gpu_decode.py` around L2740-2778
- New test under `xrspatial/geotiff/tests/`

## Related

- #1659 (closed) - per-tile alloc + concat in `_try_nvcomp_from_device_bufs`
- #1688 (closed) - per-tile cupy.empty + serial pread in `_try_kvikio_read_tiles`
- #1712 (closed) - per-tile cupy.get and per-tile cupy.empty in `_nvcomp_batch_compress`
- #1950 (closed) - per-tile prefix-sum loop replaced with `np.cumsum`

This is a continuation of the same "single contiguous device buffer" refactor pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _try_nvjpeg2k_batch_decode per-tile allocations #2107

Summary

Anti-Pattern Reference

Impact

Plan

Files

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(geotiff): batch _try_nvjpeg2k_batch_decode per-tile allocations #2107

Description

Summary

Anti-Pattern Reference

Impact

Plan

Files

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions