perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads

## Reason or Problem

`_try_kvikio_read_tiles` in `xrspatial/geotiff/_gpu_decode.py` (line 941) pulls each compressed tile into GPU memory through three per-tile operations that serialise:

```python
d_tiles = []
with kvikio.CuFile(file_path, 'r') as f:
    for off, bc in zip(tile_offsets, tile_byte_counts):
        buf = cupy.empty(bc, dtype=cupy.uint8)
        nbytes = f.pread(buf, file_offset=off)
        actual = nbytes.get() if hasattr(nbytes, 'get') else int(nbytes)
        if actual != bc:
            return None
        d_tiles.append(buf)
```

Three problems compound in that loop on a typical multi-tile COG read:

1. `cupy.empty(bc, ...)` allocates one device buffer per tile. Each `cupy.empty` call costs tens of microseconds of setup independent of `bc`. A 256-tile COG pays that overhead 256 times, the same per-tile allocation pattern PR #1659 just fixed in `_try_nvcomp_from_device_bufs`.
2. `f.pread(...)` returns an `IOFuture` so the call itself is non-blocking. The very next line calls `nbytes.get()`, which blocks until that one pread finishes. Every tile waits before the next pread is submitted, so kvikio's internal thread pool never sees more than one outstanding request and the parallel-IO design of `pread` collapses to serial.
3. No `_check_gpu_memory(sum(tile_byte_counts), ...)` guard runs before the allocations. A crafted COG with large `TileByteCounts` can OOM the device one tile at a time before any single allocation hits a guard. The sibling paths (`_try_nvcomp_from_device_bufs` at L1641, `_batched_d2h_to_bytes` at L930) all check the total bytes up front.

## Proposal

Move `_try_kvikio_read_tiles` to the same single-buffer + batched-future pattern the other GDS/nvCOMP paths in this file already use:

```python
sizes = [int(bc) for bc in tile_byte_counts]
offsets = np.zeros(len(sizes), dtype=np.int64)
if len(sizes) > 1:
    np.cumsum(sizes[:-1], out=offsets[1:])
total_bytes = int(np.sum(sizes))

_check_gpu_memory(total_bytes, what="kvikio tile read buffer")
combined = cupy.empty(total_bytes, dtype=cupy.uint8)

futures = []
with kvikio.CuFile(file_path, 'r') as f:
    for src_off, dst_off, bc in zip(tile_offsets, offsets, sizes):
        if bc == 0:
            futures.append(None)
            continue
        view = combined[dst_off:dst_off + bc]
        futures.append((f.pread(view, file_offset=src_off), bc))

    for fut in futures:
        if fut is None:
            continue
        future, expected_bc = fut
        actual = future.get() if hasattr(future, 'get') else int(future)
        if actual != expected_bc:
            return None

cupy.cuda.Device().synchronize()

d_tiles = []
for dst_off, bc in zip(offsets, sizes):
    d_tiles.append(combined[dst_off:dst_off + bc])
return d_tiles
```

**Design:** One `cupy.empty` replaces N. All `pread` calls are submitted before the first `.get()` so kvikio's worker pool can overlap them. The returned `d_tiles` is still a list of `cupy.uint8` 1-D views, so the nvCOMP / `_batched_d2h_to_bytes` consumers downstream stay unchanged. The base buffer is kept alive by the views, so the slices remain valid for the lifetime of the result.

**Usage:** Internal helper. No public API change.

**Value:** Recovers the parallelism the `IOFuture` design was meant to provide and removes the per-tile allocation overhead PR #1659 already addressed in the symmetric nvCOMP path. Adds the missing memory guard so the GDS path fails fast under crafted inputs like the other GPU paths already do.

## Stakeholders and Impacts

Users reading COGs from NVMe with kvikio + GDS installed. The non-GDS path (ImportError fallback) is untouched. Downstream consumers (`_try_nvcomp_from_device_bufs`, `_batched_d2h_to_bytes`) take the same `list[cupy.ndarray]` shape and require no changes.

## Drawbacks

The single contiguous buffer is `sum(tile_byte_counts)` bytes. On a sparse window read where only a few tiles are requested, the buffer size matches the existing per-tile total exactly, so peak VRAM is unchanged. The memory guard runs once at submit time rather than once per tile.

## Alternatives

A `concurrent.futures.ThreadPoolExecutor` wrapping the per-tile loop would also unblock parallelism but would not address the per-tile allocations or the missing memory guard. The single-buffer pattern is what the rest of this file already uses; keeping it consistent makes future audits simpler.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688

Description

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions