You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
_try_kvikio_read_tiles in xrspatial/geotiff/_gpu_decode.py (line 941) pulls each compressed tile into GPU memory through three per-tile operations that serialise:
f.pread(...) returns an IOFuture so the call itself is non-blocking. The very next line calls nbytes.get(), which blocks until that one pread finishes. Every tile waits before the next pread is submitted, so kvikio's internal thread pool never sees more than one outstanding request and the parallel-IO design of pread collapses to serial.
No _check_gpu_memory(sum(tile_byte_counts), ...) guard runs before the allocations. A crafted COG with large TileByteCounts can OOM the device one tile at a time before any single allocation hits a guard. The sibling paths (_try_nvcomp_from_device_bufs at L1641, _batched_d2h_to_bytes at L930) all check the total bytes up front.
Proposal
Move _try_kvikio_read_tiles to the same single-buffer + batched-future pattern the other GDS/nvCOMP paths in this file already use:
Design: One cupy.empty replaces N. All pread calls are submitted before the first .get() so kvikio's worker pool can overlap them. The returned d_tiles is still a list of cupy.uint8 1-D views, so the nvCOMP / _batched_d2h_to_bytes consumers downstream stay unchanged. The base buffer is kept alive by the views, so the slices remain valid for the lifetime of the result.
Usage: Internal helper. No public API change.
Value: Recovers the parallelism the IOFuture design was meant to provide and removes the per-tile allocation overhead PR #1659 already addressed in the symmetric nvCOMP path. Adds the missing memory guard so the GDS path fails fast under crafted inputs like the other GPU paths already do.
Stakeholders and Impacts
Users reading COGs from NVMe with kvikio + GDS installed. The non-GDS path (ImportError fallback) is untouched. Downstream consumers (_try_nvcomp_from_device_bufs, _batched_d2h_to_bytes) take the same list[cupy.ndarray] shape and require no changes.
Drawbacks
The single contiguous buffer is sum(tile_byte_counts) bytes. On a sparse window read where only a few tiles are requested, the buffer size matches the existing per-tile total exactly, so peak VRAM is unchanged. The memory guard runs once at submit time rather than once per tile.
Alternatives
A concurrent.futures.ThreadPoolExecutor wrapping the per-tile loop would also unblock parallelism but would not address the per-tile allocations or the missing memory guard. The single-buffer pattern is what the rest of this file already uses; keeping it consistent makes future audits simpler.
Reason or Problem
_try_kvikio_read_tilesinxrspatial/geotiff/_gpu_decode.py(line 941) pulls each compressed tile into GPU memory through three per-tile operations that serialise:Three problems compound in that loop on a typical multi-tile COG read:
cupy.empty(bc, ...)allocates one device buffer per tile. Eachcupy.emptycall costs tens of microseconds of setup independent ofbc. A 256-tile COG pays that overhead 256 times, the same per-tile allocation pattern PR perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659 just fixed in_try_nvcomp_from_device_bufs.f.pread(...)returns anIOFutureso the call itself is non-blocking. The very next line callsnbytes.get(), which blocks until that one pread finishes. Every tile waits before the next pread is submitted, so kvikio's internal thread pool never sees more than one outstanding request and the parallel-IO design ofpreadcollapses to serial._check_gpu_memory(sum(tile_byte_counts), ...)guard runs before the allocations. A crafted COG with largeTileByteCountscan OOM the device one tile at a time before any single allocation hits a guard. The sibling paths (_try_nvcomp_from_device_bufsat L1641,_batched_d2h_to_bytesat L930) all check the total bytes up front.Proposal
Move
_try_kvikio_read_tilesto the same single-buffer + batched-future pattern the other GDS/nvCOMP paths in this file already use:Design: One
cupy.emptyreplaces N. Allpreadcalls are submitted before the first.get()so kvikio's worker pool can overlap them. The returnedd_tilesis still a list ofcupy.uint81-D views, so the nvCOMP /_batched_d2h_to_bytesconsumers downstream stay unchanged. The base buffer is kept alive by the views, so the slices remain valid for the lifetime of the result.Usage: Internal helper. No public API change.
Value: Recovers the parallelism the
IOFuturedesign was meant to provide and removes the per-tile allocation overhead PR #1659 already addressed in the symmetric nvCOMP path. Adds the missing memory guard so the GDS path fails fast under crafted inputs like the other GPU paths already do.Stakeholders and Impacts
Users reading COGs from NVMe with kvikio + GDS installed. The non-GDS path (ImportError fallback) is untouched. Downstream consumers (
_try_nvcomp_from_device_bufs,_batched_d2h_to_bytes) take the samelist[cupy.ndarray]shape and require no changes.Drawbacks
The single contiguous buffer is
sum(tile_byte_counts)bytes. On a sparse window read where only a few tiles are requested, the buffer size matches the existing per-tile total exactly, so peak VRAM is unchanged. The memory guard runs once at submit time rather than once per tile.Alternatives
A
concurrent.futures.ThreadPoolExecutorwrapping the per-tile loop would also unblock parallelism but would not address the per-tile allocations or the missing memory guard. The single-buffer pattern is what the rest of this file already uses; keeping it consistent makes future audits simpler.