perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688)#1693
Merged
brendancol merged 2 commits intoMay 12, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Improves the GeoTIFF GPU decode fast path by refactoring _try_kvikio_read_tiles to better leverage KvikIO’s asynchronous pread behavior and reduce GPU allocation overhead, with accompanying regression tests.
Changes:
- Batch all KvikIO
preadsubmissions before waiting on any futures to restore parallel I/O overlap. - Replace per-tile device allocations with a single contiguous
cupy.empty(total_bytes)buffer and return per-tile views into it. - Add regression tests covering ordering, single-buffer behavior, memory-guard call, sparse/zero-size tiles, and fallback semantics.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_gpu_decode.py |
Refactors _try_kvikio_read_tiles to single-buffer + batched-future pattern and adds total-bytes GPU memory guard. |
xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py |
Adds regression tests for batched pread submission ordering, single allocation/view slicing, guard call, and edge cases. |
.claude/sweep-performance-state.csv |
Updates internal performance audit state entry to reference #1688. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+250
to
+254
| # Each tile got submitted exactly once. Submissions monotonically | ||
| # precede waits: the first ``.get()`` may not run until every | ||
| # submission already happened. | ||
| assert submission_log == [0, 1, 2, 3] | ||
| assert get_log == [0, 1, 2, 3] |
Comment on lines
+393
to
+395
| import sys | ||
| fake_mod_obj = monkeypatch.setitem | ||
| fake_mod_obj # silence unused |
Comment on lines
+1020
to
+1021
| # Surface OOM unchanged. The caller can switch to the CPU-mmap | ||
| # path which does not pre-allocate the full compressed payload. |
brendancol
added a commit
that referenced
this pull request
May 12, 2026
…uracy * test_all_preads_submitted_before_any_get now records both submit and get events into a single ordered timeline and asserts every submit occurs before the first get. The prior version asserted on per-event lists ([0,1,2,3] each), which the legacy interleaved submit->get->submit->get loop also satisfies, so the test could not catch a regression to that pattern. Verified by temporarily reverting _try_kvikio_read_tiles to the interleaved pattern: new assertion fails with a clear "preads and gets are interleaved" message showing the [submit,get,submit,get,...] timeline. * Removed the unused ``import sys`` and the no-op ``fake_mod_obj`` lines from test_all_zero_size_tiles_returns_zero_length_views. flake8 now reports no F401/F841 on the test file. * Reworded the MemoryError comment in _try_kvikio_read_tiles. The previous wording claimed the CPU-mmap fallback "does not pre-allocate the full compressed payload", but gpu_decode_tiles still calls ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes. The new wording explains the fallback skips the GDS-specific contiguous read buffer but still pays the bulk device allocation.
Replaces the per-tile cupy.empty + blocking IOFuture.get() inside the kvikio GDS path with a single contiguous device buffer, batched pread submissions, and a _check_gpu_memory guard up front. The old loop alternated submit -> wait -> submit -> wait, so the kvikio worker pool only saw one outstanding pread at a time and the per-tile cupy.empty() setup cost compounded across all tiles. The new pattern allocates once, submits every pread before the first .get(), and lets the worker pool overlap the reads. Microbench with 8-worker pool simulation, 256 tiles @ 1ms IO latency: old 256ms vs new 38.7ms (~6.6x). Single-thread simulation: 28.5ms (9x). Adds 9 unit tests covering the kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard contract, partial- read fallback, end-to-end data round-trip, and zero-size / all-sparse tile edge cases. The fake CuFile lets the structural checks run on hosts without a real GDS install.
…uracy * test_all_preads_submitted_before_any_get now records both submit and get events into a single ordered timeline and asserts every submit occurs before the first get. The prior version asserted on per-event lists ([0,1,2,3] each), which the legacy interleaved submit->get->submit->get loop also satisfies, so the test could not catch a regression to that pattern. Verified by temporarily reverting _try_kvikio_read_tiles to the interleaved pattern: new assertion fails with a clear "preads and gets are interleaved" message showing the [submit,get,submit,get,...] timeline. * Removed the unused ``import sys`` and the no-op ``fake_mod_obj`` lines from test_all_zero_size_tiles_returns_zero_length_views. flake8 now reports no F401/F841 on the test file. * Reworded the MemoryError comment in _try_kvikio_read_tiles. The previous wording claimed the CPU-mmap fallback "does not pre-allocate the full compressed payload", but gpu_decode_tiles still calls ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes. The new wording explains the fallback skips the GDS-specific contiguous read buffer but still pays the bulk device allocation.
d8d0e3c to
cc157dd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1688.
preadsubmissions in_try_kvikio_read_tilesbefore waiting on any of the resultingIOFuture.get()calls so kvikio's worker pool can actually overlap the GDS reads.cupy.empty(bc, ...)allocations with one contiguouscupy.empty(sum(tile_byte_counts))buffer, returning per-tile views into it. Matches the single-buffer pattern PR perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659 just landed for_try_nvcomp_from_device_bufsand the host-buffer / LZW / deflate nvCOMP paths already in this file._check_gpu_memory(sum(tile_byte_counts), ...)guard before the allocation so a crafted COG with oversizedTileByteCountsfails fast like the sibling GPU paths.Microbench
8-worker pool simulation, 256 tiles @ 1ms per-IO latency:
Single-thread submission simulation: 28.5 ms (~9x).
Test plan
xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py(9 new tests).get()(structural ordering check)_check_gpu_memoryruns withsum(tile_byte_counts)and a useful label