Skip to content

perf(geotiff): batch _nvcomp_batch_compress per-tile D2H and allocations #1712

@brendancol

Description

@brendancol

Reason or Problem

_nvcomp_batch_compress in xrspatial/geotiff/_gpu_decode.py is the GPU write-side compress path (deflate/zstd via nvCOMP). Two related anti-patterns remain in this function that mirror the ones already fixed elsewhere by #1552 and #1659:

1. Per-tile cupy.empty allocations (line 2457):

d_comp_bufs = [cupy.empty(max_cs, dtype=cupy.uint8) for _ in range(n_tiles)]

For an N-tile compress, this issues N separate cupy allocations (each potentially a memory-pool query plus a kernel launch for zero-init under some allocator configs). The decode side at _gpu_decode.py:2004 and the LZW kernel decode at _gpu_decode.py:1973 already use a single contiguous allocation + offsets/views.

Microbench (1024 tiles, 16 KiB each):

  • N separate cupy.empty: 4.74 ms
  • 1 cupy.empty + N slice views: 1.02 ms

2. Per-tile .get().tobytes() D2H readback (lines 2522-2526):

comp_sizes = d_comp_sizes.get().astype(int)
result = []
for i in range(n_tiles):
    cs = int(comp_sizes[i])
    raw = d_comp_bufs[i][:cs].get().tobytes()
    ...

Each .get() is a separate D2H DMA on the default stream, so they serialise. The comment block at lines 2503-2505 even calls out the same problem for the uncompressed-side adler32 path:

Batch the GPU->CPU transfer so all tiles move in a single DMA instead of one .get() per tile (which serializes on the default stream and is the dominant cost on the deflate path).

But the compressed side at lines 2522-2526 still does per-tile .get(). The decode-side fix (#1552) batched the same pattern in gpu_decode_tiles_from_file (lines 870-913 region).

Microbench (1024 tiles, 16 KiB each, filled, on a single Ampere-class GPU):

  • per-tile .get().tobytes(): 35.19 ms
  • 1 concat + 1 .get() + Python slice: 19.31 ms (~45% reduction)

Proposal

Mirror the existing batched D2H pattern used by gpu_decode_tiles_from_file (#1552) and _try_nvcomp_from_device_bufs (#1659):

  1. Allocate one contiguous device buffer of size n_tiles * max_cs, view per-tile slabs into it, and pass per-buffer pointer offsets to the nvCOMP API.
  2. After nvCOMP returns, read d_comp_sizes once, allocate one host buffer the size of the actual compressed total (sum of comp_sizes), do one concatenation + single .get(), then slice the host buffer into per-tile bytes.

Variable-size compressed output makes step 2 slightly more involved than the encode side. One option:

  • Issue cupy.concatenate on a list of slab views shrunk to comp_sizes[i], then a single .get().
  • Or compute comp_offsets = np.concatenate(([0], np.cumsum(comp_sizes))), build the contiguous buffer via a scatter copy on device (one kernel), then one .get().

The first option is closer to the LZW-fallback fix in #1552 and should be enough for the common deflate/zstd path. The second is a refinement if profiling shows the cupy.concatenate overhead dominates.

The adler32-handling logic for deflate (lines 2507-2519) already does the concat+.get() pattern for the uncompressed tiles. Reuse the same shape on the compressed side.

Acceptance criteria

  • The per-tile .get() loop at lines 2522-2526 is replaced with one batched D2H DMA.
  • The per-tile cupy.empty loop at line 2457 is replaced with one contiguous allocation + views.
  • Microbench (n=1024 tiles, 16 KiB each) shows D2H wall time reduction comparable to the 45% measured.
  • Existing zstd / deflate write tests in xrspatial/geotiff/tests/test_gpu_writer_compression_modes_*.py continue to pass.
  • Deflate output remains zlib-framed correctly (adler32 wrap on the host side preserved).

Context

Found via deep-sweep performance audit on 2026-05-12. The codec-decode side (#1552) and the LZW-and-friends device-pointer side (#1659) have both already received this treatment; the encode-into-nvCOMP path is the last one with the per-tile pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgpuCuPy / CUDA GPU supportperformancePR touches performance-sensitive code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions