Skip to content

perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688)#1693

Merged
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-104351
May 12, 2026
Merged

perf(geotiff): batch _try_kvikio_read_tiles preads + single buffer (#1688)#1693
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-104351

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Closes #1688.

  • Batch all pread submissions in _try_kvikio_read_tiles before waiting on any of the resulting IOFuture.get() calls so kvikio's worker pool can actually overlap the GDS reads.
  • Replace the N per-tile cupy.empty(bc, ...) allocations with one contiguous cupy.empty(sum(tile_byte_counts)) buffer, returning per-tile views into it. Matches the single-buffer pattern PR perf: replace per-tile alloc + concat in _try_nvcomp_from_device_bufs (GDS+nvCOMP fast path) #1659 just landed for _try_nvcomp_from_device_bufs and the host-buffer / LZW / deflate nvCOMP paths already in this file.
  • Add a _check_gpu_memory(sum(tile_byte_counts), ...) guard before the allocation so a crafted COG with oversized TileByteCounts fails fast like the sibling GPU paths.

Microbench

8-worker pool simulation, 256 tiles @ 1ms per-IO latency:

  • old: 256 ms (serial)
  • new: 38.7 ms (~6.6x)

Single-thread submission simulation: 28.5 ms (~9x).

Test plan

  • xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py (9 new tests)
    • empty-list short-circuit (kvikio import not attempted)
    • kvikio-absent ImportError fallback returns None
    • single contiguous allocation verified via per-tile data pointer arithmetic
    • all preads submit before any .get() (structural ordering check)
    • _check_gpu_memory runs with sum(tile_byte_counts) and a useful label
    • partial-read short-read returns None for caller fallback
    • end-to-end data round-trip through a recording fake CuFile
    • zero-size single tile keeps a 0-length view in its original index
    • all-sparse-tile total-bytes-zero short-circuit
  • Full geotiff suite: 1577 passed (3 unrelated matplotlib/py3.14 skips)

@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 18:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the GeoTIFF GPU decode fast path by refactoring _try_kvikio_read_tiles to better leverage KvikIO’s asynchronous pread behavior and reduce GPU allocation overhead, with accompanying regression tests.

Changes:

  • Batch all KvikIO pread submissions before waiting on any futures to restore parallel I/O overlap.
  • Replace per-tile device allocations with a single contiguous cupy.empty(total_bytes) buffer and return per-tile views into it.
  • Add regression tests covering ordering, single-buffer behavior, memory-guard call, sparse/zero-size tiles, and fallback semantics.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
xrspatial/geotiff/_gpu_decode.py Refactors _try_kvikio_read_tiles to single-buffer + batched-future pattern and adds total-bytes GPU memory guard.
xrspatial/geotiff/tests/test_kvikio_batched_pread_1688.py Adds regression tests for batched pread submission ordering, single allocation/view slicing, guard call, and edge cases.
.claude/sweep-performance-state.csv Updates internal performance audit state entry to reference #1688.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +250 to +254
# Each tile got submitted exactly once. Submissions monotonically
# precede waits: the first ``.get()`` may not run until every
# submission already happened.
assert submission_log == [0, 1, 2, 3]
assert get_log == [0, 1, 2, 3]
Comment on lines +393 to +395
import sys
fake_mod_obj = monkeypatch.setitem
fake_mod_obj # silence unused
Comment thread xrspatial/geotiff/_gpu_decode.py Outdated
Comment on lines +1020 to +1021
# Surface OOM unchanged. The caller can switch to the CPU-mmap
# path which does not pre-allocate the full compressed payload.
brendancol added a commit that referenced this pull request May 12, 2026
…uracy

* test_all_preads_submitted_before_any_get now records both submit and
  get events into a single ordered timeline and asserts every submit
  occurs before the first get. The prior version asserted on per-event
  lists ([0,1,2,3] each), which the legacy interleaved
  submit->get->submit->get loop also satisfies, so the test could not
  catch a regression to that pattern. Verified by temporarily reverting
  _try_kvikio_read_tiles to the interleaved pattern: new assertion
  fails with a clear "preads and gets are interleaved" message showing
  the [submit,get,submit,get,...] timeline.
* Removed the unused ``import sys`` and the no-op ``fake_mod_obj``
  lines from test_all_zero_size_tiles_returns_zero_length_views.
  flake8 now reports no F401/F841 on the test file.
* Reworded the MemoryError comment in _try_kvikio_read_tiles. The
  previous wording claimed the CPU-mmap fallback "does not pre-allocate
  the full compressed payload", but gpu_decode_tiles still calls
  ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes.
  The new wording explains the fallback skips the GDS-specific
  contiguous read buffer but still pays the bulk device allocation.
Replaces the per-tile cupy.empty + blocking IOFuture.get() inside the
kvikio GDS path with a single contiguous device buffer, batched pread
submissions, and a _check_gpu_memory guard up front.

The old loop alternated submit -> wait -> submit -> wait, so the kvikio
worker pool only saw one outstanding pread at a time and the per-tile
cupy.empty() setup cost compounded across all tiles. The new pattern
allocates once, submits every pread before the first .get(), and lets
the worker pool overlap the reads.

Microbench with 8-worker pool simulation, 256 tiles @ 1ms IO latency:
old 256ms vs new 38.7ms (~6.6x). Single-thread simulation: 28.5ms (9x).

Adds 9 unit tests covering the kvikio-absent path, single-buffer pointer
arithmetic, submit-before-get ordering, memory guard contract, partial-
read fallback, end-to-end data round-trip, and zero-size / all-sparse
tile edge cases. The fake CuFile lets the structural checks run on
hosts without a real GDS install.
…uracy

* test_all_preads_submitted_before_any_get now records both submit and
  get events into a single ordered timeline and asserts every submit
  occurs before the first get. The prior version asserted on per-event
  lists ([0,1,2,3] each), which the legacy interleaved
  submit->get->submit->get loop also satisfies, so the test could not
  catch a regression to that pattern. Verified by temporarily reverting
  _try_kvikio_read_tiles to the interleaved pattern: new assertion
  fails with a clear "preads and gets are interleaved" message showing
  the [submit,get,submit,get,...] timeline.
* Removed the unused ``import sys`` and the no-op ``fake_mod_obj``
  lines from test_all_zero_size_tiles_returns_zero_length_views.
  flake8 now reports no F401/F841 on the test file.
* Reworded the MemoryError comment in _try_kvikio_read_tiles. The
  previous wording claimed the CPU-mmap fallback "does not pre-allocate
  the full compressed payload", but gpu_decode_tiles still calls
  ``d_comp = cupy.asarray(comp_buf_host)`` over ``total_comp`` bytes.
  The new wording explains the fallback skips the GDS-specific
  contiguous read buffer but still pays the bulk device allocation.
@brendancol brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-12-104351 branch from d8d0e3c to cc157dd Compare May 12, 2026 18:28
@brendancol brendancol merged commit 33a8b7d into main May 12, 2026
10 checks passed
@brendancol brendancol deleted the deep-sweep-performance-geotiff-2026-05-12-104351 branch May 15, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(geotiff): batch _try_kvikio_read_tiles to one alloc + parallel preads

2 participants