Reason or Problem
_nvcomp_batch_compress in xrspatial/geotiff/_gpu_decode.py is the GPU write-side compress path (deflate/zstd via nvCOMP). Two related anti-patterns remain in this function that mirror the ones already fixed elsewhere by #1552 and #1659:
1. Per-tile cupy.empty allocations (line 2457):
d_comp_bufs = [cupy.empty(max_cs, dtype=cupy.uint8) for _ in range(n_tiles)]
For an N-tile compress, this issues N separate cupy allocations (each potentially a memory-pool query plus a kernel launch for zero-init under some allocator configs). The decode side at _gpu_decode.py:2004 and the LZW kernel decode at _gpu_decode.py:1973 already use a single contiguous allocation + offsets/views.
Microbench (1024 tiles, 16 KiB each):
- N separate
cupy.empty: 4.74 ms
- 1
cupy.empty + N slice views: 1.02 ms
2. Per-tile .get().tobytes() D2H readback (lines 2522-2526):
comp_sizes = d_comp_sizes.get().astype(int)
result = []
for i in range(n_tiles):
cs = int(comp_sizes[i])
raw = d_comp_bufs[i][:cs].get().tobytes()
...
Each .get() is a separate D2H DMA on the default stream, so they serialise. The comment block at lines 2503-2505 even calls out the same problem for the uncompressed-side adler32 path:
Batch the GPU->CPU transfer so all tiles move in a single DMA instead of one .get() per tile (which serializes on the default stream and is the dominant cost on the deflate path).
But the compressed side at lines 2522-2526 still does per-tile .get(). The decode-side fix (#1552) batched the same pattern in gpu_decode_tiles_from_file (lines 870-913 region).
Microbench (1024 tiles, 16 KiB each, filled, on a single Ampere-class GPU):
- per-tile
.get().tobytes(): 35.19 ms
- 1 concat + 1
.get() + Python slice: 19.31 ms (~45% reduction)
Proposal
Mirror the existing batched D2H pattern used by gpu_decode_tiles_from_file (#1552) and _try_nvcomp_from_device_bufs (#1659):
- Allocate one contiguous device buffer of size
n_tiles * max_cs, view per-tile slabs into it, and pass per-buffer pointer offsets to the nvCOMP API.
- After nvCOMP returns, read
d_comp_sizes once, allocate one host buffer the size of the actual compressed total (sum of comp_sizes), do one concatenation + single .get(), then slice the host buffer into per-tile bytes.
Variable-size compressed output makes step 2 slightly more involved than the encode side. One option:
- Issue
cupy.concatenate on a list of slab views shrunk to comp_sizes[i], then a single .get().
- Or compute
comp_offsets = np.concatenate(([0], np.cumsum(comp_sizes))), build the contiguous buffer via a scatter copy on device (one kernel), then one .get().
The first option is closer to the LZW-fallback fix in #1552 and should be enough for the common deflate/zstd path. The second is a refinement if profiling shows the cupy.concatenate overhead dominates.
The adler32-handling logic for deflate (lines 2507-2519) already does the concat+.get() pattern for the uncompressed tiles. Reuse the same shape on the compressed side.
Acceptance criteria
- The per-tile
.get() loop at lines 2522-2526 is replaced with one batched D2H DMA.
- The per-tile
cupy.empty loop at line 2457 is replaced with one contiguous allocation + views.
- Microbench (n=1024 tiles, 16 KiB each) shows D2H wall time reduction comparable to the 45% measured.
- Existing zstd / deflate write tests in
xrspatial/geotiff/tests/test_gpu_writer_compression_modes_*.py continue to pass.
- Deflate output remains zlib-framed correctly (adler32 wrap on the host side preserved).
Context
Found via deep-sweep performance audit on 2026-05-12. The codec-decode side (#1552) and the LZW-and-friends device-pointer side (#1659) have both already received this treatment; the encode-into-nvCOMP path is the last one with the per-tile pattern.
Reason or Problem
_nvcomp_batch_compressinxrspatial/geotiff/_gpu_decode.pyis the GPU write-side compress path (deflate/zstd via nvCOMP). Two related anti-patterns remain in this function that mirror the ones already fixed elsewhere by #1552 and #1659:1. Per-tile cupy.empty allocations (line 2457):
For an N-tile compress, this issues N separate cupy allocations (each potentially a memory-pool query plus a kernel launch for zero-init under some allocator configs). The decode side at
_gpu_decode.py:2004and the LZW kernel decode at_gpu_decode.py:1973already use a single contiguous allocation + offsets/views.Microbench (1024 tiles, 16 KiB each):
cupy.empty: 4.74 mscupy.empty+ N slice views: 1.02 ms2. Per-tile
.get().tobytes()D2H readback (lines 2522-2526):Each
.get()is a separate D2H DMA on the default stream, so they serialise. The comment block at lines 2503-2505 even calls out the same problem for the uncompressed-side adler32 path:But the compressed side at lines 2522-2526 still does per-tile
.get(). The decode-side fix (#1552) batched the same pattern ingpu_decode_tiles_from_file(lines 870-913 region).Microbench (1024 tiles, 16 KiB each, filled, on a single Ampere-class GPU):
.get().tobytes(): 35.19 ms.get()+ Python slice: 19.31 ms (~45% reduction)Proposal
Mirror the existing batched D2H pattern used by
gpu_decode_tiles_from_file(#1552) and_try_nvcomp_from_device_bufs(#1659):n_tiles * max_cs, view per-tile slabs into it, and pass per-buffer pointer offsets to the nvCOMP API.d_comp_sizesonce, allocate one host buffer the size of the actual compressed total (sum ofcomp_sizes), do one concatenation + single.get(), then slice the host buffer into per-tile bytes.Variable-size compressed output makes step 2 slightly more involved than the encode side. One option:
cupy.concatenateon a list of slab views shrunk tocomp_sizes[i], then a single.get().comp_offsets = np.concatenate(([0], np.cumsum(comp_sizes))), build the contiguous buffer via a scatter copy on device (one kernel), then one.get().The first option is closer to the LZW-fallback fix in #1552 and should be enough for the common deflate/zstd path. The second is a refinement if profiling shows the
cupy.concatenateoverhead dominates.The adler32-handling logic for deflate (lines 2507-2519) already does the concat+
.get()pattern for the uncompressed tiles. Reuse the same shape on the compressed side.Acceptance criteria
.get()loop at lines 2522-2526 is replaced with one batched D2H DMA.cupy.emptyloop at line 2457 is replaced with one contiguous allocation + views.xrspatial/geotiff/tests/test_gpu_writer_compression_modes_*.pycontinue to pass.Context
Found via deep-sweep performance audit on 2026-05-12. The codec-decode side (#1552) and the LZW-and-friends device-pointer side (#1659) have both already received this treatment; the encode-into-nvCOMP path is the last one with the per-tile pattern.