Skip to content

perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads #1688

@brendancol

Description

@brendancol

Reason or Problem

_try_kvikio_read_tiles in xrspatial/geotiff/_gpu_decode.py (line 941) pulls each compressed tile into GPU memory through three per-tile operations that serialise:

d_tiles = []
with kvikio.CuFile(file_path, 'r') as f:
    for off, bc in zip(tile_offsets, tile_byte_counts):
        buf = cupy.empty(bc, dtype=cupy.uint8)
        nbytes = f.pread(buf, file_offset=off)
        actual = nbytes.get() if hasattr(nbytes, 'get') else int(nbytes)
        if actual != bc:
            return None
        d_tiles.append(buf)

Three problems compound in that loop on a typical multi-tile COG read:

  1. cupy.empty(bc, ...) allocates one device buffer per tile. Each cupy.empty call costs tens of microseconds of setup independent of bc. A 256-tile COG pays that overhead 256 times, the same per-tile allocation pattern PR perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659 just fixed in _try_nvcomp_from_device_bufs.
  2. f.pread(...) returns an IOFuture so the call itself is non-blocking. The very next line calls nbytes.get(), which blocks until that one pread finishes. Every tile waits before the next pread is submitted, so kvikio's internal thread pool never sees more than one outstanding request and the parallel-IO design of pread collapses to serial.
  3. No _check_gpu_memory(sum(tile_byte_counts), ...) guard runs before the allocations. A crafted COG with large TileByteCounts can OOM the device one tile at a time before any single allocation hits a guard. The sibling paths (_try_nvcomp_from_device_bufs at L1641, _batched_d2h_to_bytes at L930) all check the total bytes up front.

Proposal

Move _try_kvikio_read_tiles to the same single-buffer + batched-future pattern the other GDS/nvCOMP paths in this file already use:

sizes = [int(bc) for bc in tile_byte_counts]
offsets = np.zeros(len(sizes), dtype=np.int64)
if len(sizes) > 1:
    np.cumsum(sizes[:-1], out=offsets[1:])
total_bytes = int(np.sum(sizes))

_check_gpu_memory(total_bytes, what="kvikio tile read buffer")
combined = cupy.empty(total_bytes, dtype=cupy.uint8)

futures = []
with kvikio.CuFile(file_path, 'r') as f:
    for src_off, dst_off, bc in zip(tile_offsets, offsets, sizes):
        if bc == 0:
            futures.append(None)
            continue
        view = combined[dst_off:dst_off + bc]
        futures.append((f.pread(view, file_offset=src_off), bc))

    for fut in futures:
        if fut is None:
            continue
        future, expected_bc = fut
        actual = future.get() if hasattr(future, 'get') else int(future)
        if actual != expected_bc:
            return None

cupy.cuda.Device().synchronize()

d_tiles = []
for dst_off, bc in zip(offsets, sizes):
    d_tiles.append(combined[dst_off:dst_off + bc])
return d_tiles

Design: One cupy.empty replaces N. All pread calls are submitted before the first .get() so kvikio's worker pool can overlap them. The returned d_tiles is still a list of cupy.uint8 1-D views, so the nvCOMP / _batched_d2h_to_bytes consumers downstream stay unchanged. The base buffer is kept alive by the views, so the slices remain valid for the lifetime of the result.

Usage: Internal helper. No public API change.

Value: Recovers the parallelism the IOFuture design was meant to provide and removes the per-tile allocation overhead PR #1659 already addressed in the symmetric nvCOMP path. Adds the missing memory guard so the GDS path fails fast under crafted inputs like the other GPU paths already do.

Stakeholders and Impacts

Users reading COGs from NVMe with kvikio + GDS installed. The non-GDS path (ImportError fallback) is untouched. Downstream consumers (_try_nvcomp_from_device_bufs, _batched_d2h_to_bytes) take the same list[cupy.ndarray] shape and require no changes.

Drawbacks

The single contiguous buffer is sum(tile_byte_counts) bytes. On a sparse window read where only a few tiles are requested, the buffer size matches the existing per-tile total exactly, so peak VRAM is unchanged. The memory guard runs once at submit time rather than once per tile.

Alternatives

A concurrent.futures.ThreadPoolExecutor wrapping the per-tile loop would also unblock parallelism but would not address the per-tile allocations or the missing memory guard. The single-buffer pattern is what the rest of this file already uses; keeping it consistent makes future audits simpler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgpuCuPy / CUDA GPU supportperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions