perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688) by brendancol · Pull Request #1693 · xarray-contrib/xarray-spatial

brendancol · 2026-05-12T17:57:59Z

Summary

Closes #1688.

Batch all pread submissions in _try_kvikio_read_tiles before waiting on any of the resulting IOFuture.get() calls so kvikio's worker pool can actually overlap the GDS reads.
Replace the N per-tile cupy.empty(bc, ...) allocations with one contiguous cupy.empty(sum(tile_byte_counts)) buffer, returning per-tile views into it. Matches the single-buffer pattern PR perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659 just landed for _try_nvcomp_from_device_bufs and the host-buffer / LZW / deflate nvCOMP paths already in this file.
Add a _check_gpu_memory(sum(tile_byte_counts), ...) guard before the allocation so a crafted COG with oversized TileByteCounts fails fast like the sibling GPU paths.

Microbench

8-worker pool simulation, 256 tiles @ 1ms per-IO latency:

old: 256 ms (serial)
new: 38.7 ms (~6.6x)

Single-thread submission simulation: 28.5 ms (~9x).

Test plan

xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py (9 new tests)
- empty-list short-circuit (kvikio import not attempted)
- kvikio-absent ImportError fallback returns None
- single contiguous allocation verified via per-tile data pointer arithmetic
- all preads submit before any .get() (structural ordering check)
- _check_gpu_memory runs with sum(tile_byte_counts) and a useful label
- partial-read short-read returns None for caller fallback
- end-to-end data round-trip through a recording fake CuFile
- zero-size single tile keeps a 0-length view in its original index
- all-sparse-tile total-bytes-zero short-circuit
Full geotiff suite: 1577 passed (3 unrelated matplotlib/py3.14 skips)

Copilot

Pull request overview

Improves the GeoTIFF GPU decode fast path by refactoring _try_kvikio_read_tiles to better leverage KvikIO’s asynchronous pread behavior and reduce GPU allocation overhead, with accompanying regression tests.

Changes:

Batch all KvikIO pread submissions before waiting on any futures to restore parallel I/O overlap.
Replace per-tile device allocations with a single contiguous cupy.empty(total_bytes) buffer and return per-tile views into it.
Add regression tests covering ordering, single-buffer behavior, memory-guard call, sparse/zero-size tiles, and fallback semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Refactors `_try_kvikio_read_tiles` to single-buffer + batched-future pattern and adds total-bytes GPU memory guard.
`xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py`	Adds regression tests for batched `pread` submission ordering, single allocation/view slicing, guard call, and edge cases.
`.claude/sweep-performance-state.csv`	Updates internal performance audit state entry to reference #1688.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Each tile got submitted exactly once. Submissions monotonically
+    # precede waits: the first ``.get()`` may not run until every
+    # submission already happened.
+    assert submission_log == [0, 1, 2, 3]
+    assert get_log == [0, 1, 2, 3]


+    import sys
+    fake_mod_obj = monkeypatch.setitem
+    fake_mod_obj  # silence unused


+        # Surface OOM unchanged. The caller can switch to the CPU-mmap
+        # path which does not pre-allocate the full compressed payload.


…uracy * test_all_preads_submitted_before_any_get now records both submit and get events into a single ordered timeline and asserts every submit occurs before the first get. The prior version asserted on per-event lists ([0,1,2,3] each), which the legacy interleaved submit->get->submit->get loop also satisfies, so the test could not catch a regression to that pattern. Verified by temporarily reverting _try_kvikio_read_tiles to the interleaved pattern: new assertion fails with a clear "preads and gets are interleaved" message showing the [submit,get,submit,get,...] timeline. * Removed the unused ``import sys`` and the no-op ``fake_mod_obj`` lines from test_all_zero_size_tiles_returns_zero_length_views. flake8 now reports no F401/F841 on the test file. * Reworded the MemoryError comment in _try_kvikio_read_tiles. The previous wording claimed the CPU-mmap fallback "does not pre-allocate the full compressed payload", but gpu_decode_tiles still calls ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes. The new wording explains the fallback skips the GDS-specific contiguous read buffer but still pays the bulk device allocation.

Replaces the per-tile cupy.empty + blocking IOFuture.get() inside the kvikio GDS path with a single contiguous device buffer, batched pread submissions, and a _check_gpu_memory guard up front. The old loop alternated submit -> wait -> submit -> wait, so the kvikio worker pool only saw one outstanding pread at a time and the per-tile cupy.empty() setup cost compounded across all tiles. The new pattern allocates once, submits every pread before the first .get(), and lets the worker pool overlap the reads. Microbench with 8-worker pool simulation, 256 tiles @ 1ms IO latency: old 256ms vs new 38.7ms (~6.6x). Single-thread simulation: 28.5ms (9x). Adds 9 unit tests covering the kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard contract, partial- read fallback, end-to-end data round-trip, and zero-size / all-sparse tile edge cases. The fake CuFile lets the structural checks run on hosts without a real GDS install.

…uracy * test_all_preads_submitted_before_any_get now records both submit and get events into a single ordered timeline and asserts every submit occurs before the first get. The prior version asserted on per-event lists ([0,1,2,3] each), which the legacy interleaved submit->get->submit->get loop also satisfies, so the test could not catch a regression to that pattern. Verified by temporarily reverting _try_kvikio_read_tiles to the interleaved pattern: new assertion fails with a clear "preads and gets are interleaved" message showing the [submit,get,submit,get,...] timeline. * Removed the unused ``import sys`` and the no-op ``fake_mod_obj`` lines from test_all_zero_size_tiles_returns_zero_length_views. flake8 now reports no F401/F841 on the test file. * Reworded the MemoryError comment in _try_kvikio_read_tiles. The previous wording claimed the CPU-mmap fallback "does not pre-allocate the full compressed payload", but gpu_decode_tiles still calls ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes. The new wording explains the fallback skips the GDS-specific contiguous read buffer but still pays the bulk device allocation.

github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026

brendancol requested a review from Copilot May 12, 2026 18:03

Copilot started reviewing on behalf of brendancol May 12, 2026 18:04 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

brendancol added 2 commits May 12, 2026 11:27

brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-12-104351 branch from d8d0e3c to cc157dd Compare May 12, 2026 18:28

brendancol merged commit 33a8b7d into main May 12, 2026
10 checks passed

brendancol deleted the deep-sweep-performance-geotiff-2026-05-12-104351 branch May 15, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688)#1693

perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688)#1693
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-104351

brendancol commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Surface OOM unchanged. The caller can switch to the CPU-mmap
		# path which does not pre-allocate the full compressed payload.

Conversation

brendancol commented May 12, 2026

Summary

Microbench

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants