perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations

## Reason or Problem

`_nvcomp_batch_compress` in `xrspatial/geotiff/_gpu_decode.py` is the GPU write-side compress path (deflate/zstd via nvCOMP). Two related anti-patterns remain in this function that mirror the ones already fixed elsewhere by #1552 and #1659:

**1. Per-tile cupy.empty allocations (line 2457):**

```python
d_comp_bufs = [cupy.empty(max_cs, dtype=cupy.uint8) for _ in range(n_tiles)]
```

For an N-tile compress, this issues N separate cupy allocations (each potentially a memory-pool query plus a kernel launch for zero-init under some allocator configs). The decode side at `_gpu_decode.py:2004` and the LZW kernel decode at `_gpu_decode.py:1973` already use a single contiguous allocation + offsets/views.

Microbench (1024 tiles, 16 KiB each):
- N separate `cupy.empty`: 4.74 ms
- 1 `cupy.empty` + N slice views: 1.02 ms

**2. Per-tile `.get().tobytes()` D2H readback (lines 2522-2526):**

```python
comp_sizes = d_comp_sizes.get().astype(int)
result = []
for i in range(n_tiles):
    cs = int(comp_sizes[i])
    raw = d_comp_bufs[i][:cs].get().tobytes()
    ...
```

Each `.get()` is a separate D2H DMA on the default stream, so they serialise. The comment block at lines 2503-2505 even calls out the same problem for the uncompressed-side adler32 path:

> Batch the GPU->CPU transfer so all tiles move in a single DMA instead of one .get() per tile (which serializes on the default stream and is the dominant cost on the deflate path).

But the compressed side at lines 2522-2526 still does per-tile `.get()`. The decode-side fix (#1552) batched the same pattern in `gpu_decode_tiles_from_file` (lines 870-913 region).

Microbench (1024 tiles, 16 KiB each, filled, on a single Ampere-class GPU):
- per-tile `.get().tobytes()`: 35.19 ms
- 1 concat + 1 `.get()` + Python slice: 19.31 ms (~45% reduction)

## Proposal

Mirror the existing batched D2H pattern used by `gpu_decode_tiles_from_file` (#1552) and `_try_nvcomp_from_device_bufs` (#1659):

1. Allocate one contiguous device buffer of size `n_tiles * max_cs`, view per-tile slabs into it, and pass per-buffer pointer offsets to the nvCOMP API.
2. After nvCOMP returns, read `d_comp_sizes` once, allocate one host buffer the size of the *actual* compressed total (sum of `comp_sizes`), do one concatenation + single `.get()`, then slice the host buffer into per-tile bytes.

Variable-size compressed output makes step 2 slightly more involved than the encode side. One option:
- Issue `cupy.concatenate` on a list of slab views shrunk to `comp_sizes[i]`, then a single `.get()`.
- Or compute `comp_offsets = np.concatenate(([0], np.cumsum(comp_sizes)))`, build the contiguous buffer via a scatter copy on device (one kernel), then one `.get()`.

The first option is closer to the LZW-fallback fix in #1552 and should be enough for the common deflate/zstd path. The second is a refinement if profiling shows the `cupy.concatenate` overhead dominates.

The adler32-handling logic for deflate (lines 2507-2519) already does the concat+`.get()` pattern for the **uncompressed** tiles. Reuse the same shape on the **compressed** side.

## Acceptance criteria

- The per-tile `.get()` loop at lines 2522-2526 is replaced with one batched D2H DMA.
- The per-tile `cupy.empty` loop at line 2457 is replaced with one contiguous allocation + views.
- Microbench (n=1024 tiles, 16 KiB each) shows D2H wall time reduction comparable to the 45% measured.
- Existing zstd / deflate write tests in `xrspatial/geotiff/tests/test_gpu_writer_compression_modes_*.py` continue to pass.
- Deflate output remains zlib-framed correctly (adler32 wrap on the host side preserved).

## Context

Found via deep-sweep performance audit on 2026-05-12. The codec-decode side (#1552) and the LZW-and-friends device-pointer side (#1659) have both already received this treatment; the encode-into-nvCOMP path is the last one with the per-tile pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712

Reason or Problem

Proposal

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712

Description

Reason or Problem

Proposal

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions