diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
index 3d67664ee..b65db425d 100644
--- a/.claude/sweep-performance-state.csv
+++ b/.claude/sweep-performance-state.csv
@@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
-geotiff,2026-05-18,SAFE,IO-bound,0,2100,"Pass 11 (2026-05-18): 1 MEDIUM found and fixed. _read_strips (_reader.py:~L1972) and _fetch_decode_cog_http_strips (_reader.py:~L2670) decoded strips sequentially in a Python for-loop while the tile counterparts (_read_tiles L2146, _fetch_decode_cog_http_tiles L2898) gated parallel decode on _PARALLEL_DECODE_PIXEL_THRESHOLD via ThreadPoolExecutor. Filed #2100 and fixed: both strip paths now collect jobs, parallel-decode when n_strips > 1 and strip_pixels >= 64K, then place sequentially. Measured (uint16, 4-core): 4096x4096 deflate 130ms->34ms (3.82x), 8192x8192 deflate 531ms->146ms (3.63x), 8192x8192 zstd 211ms->85ms (2.48x), uncompressed 25ms->22ms (1.14x). 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, pool-engaged on multi-strip, serial-path for single-strip, windowed cross-strip read, HTTP COG strip parity). 3998 tests pass; 8 pre-existing failures predating this change (predictor2 BE + size_param_validation_gpu_vrt reference now-private read_to_array attr). SAFE/IO-bound verdict holds. | Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
+geotiff,2026-05-18,SAFE,IO-bound,0,2107,"Pass 12 (2026-05-18): 1 MEDIUM found and fixed. _try_nvjpeg2k_batch_decode at _gpu_decode.py:~L2725-2778 allocated per-tile per-component cupy.empty buffers (N*S round-trips through the cupy memory pool) and called cupy.cuda.Device().synchronize() once per tile, forcing default-stream serialisation that defeats nvJPEG2000's internal pipelining. Filed #2107 and fixed: pre-allocate a single d_comp_pool sized n_tiles*samples*tile_height*pitch under a _check_gpu_memory guard, derive per-tile/per-component views as slab offsets, and replace the per-tile sync with a single batch-end sync. Same pattern as #1659 (_try_nvcomp_from_device_bufs), #1688 (_try_kvikio_read_tiles), #1712 (_nvcomp_batch_compress). 7 new tests in test_nvjpeg2k_single_alloc_2107.py: AST-level structural assertions confirm no cupy.empty inside the for-loop and no Device().synchronize() inside the loop, plus pool/per_tile_comp_bytes presence and _check_gpu_memory guard checks; lib-absent short-circuit; unsupported-dtype cleanup contract; cupy-only pool slab-non-overlap test (gpu-marked). libnvjpeg2k.so not present on this host so the end-to-end nvJPEG2000 decode is gated -- note added to re-validate on a host with the RAPIDS conda env. All 30 jpeg2000/compression tests + 7 new tests pass. SAFE/IO-bound verdict holds (no change in dask graph cost). Dask probe: 4096x4096 deflate-tiled file via read_geotiff_dask(chunks=512) yields 256 tasks for 64 chunks (4 tasks/chunk), well under the 50K cap. | Pass 11 (2026-05-18): 1 MEDIUM found and fixed. _read_strips (_reader.py:~L1972) and _fetch_decode_cog_http_strips (_reader.py:~L2670) decoded strips sequentially in a Python for-loop while the tile counterparts (_read_tiles L2146, _fetch_decode_cog_http_tiles L2898) gated parallel decode on _PARALLEL_DECODE_PIXEL_THRESHOLD via ThreadPoolExecutor. Filed #2100 and fixed: both strip paths now collect jobs, parallel-decode when n_strips > 1 and strip_pixels >= 64K, then place sequentially. Measured (uint16, 4-core): 4096x4096 deflate 130ms->34ms (3.82x), 8192x8192 deflate 531ms->146ms (3.63x), 8192x8192 zstd 211ms->85ms (2.48x), uncompressed 25ms->22ms (1.14x). 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, pool-engaged on multi-strip, serial-path for single-strip, windowed cross-strip read, HTTP COG strip parity). 3998 tests pass; 8 pre-existing failures predating this change (predictor2 BE + size_param_validation_gpu_vrt reference now-private read_to_array attr). SAFE/IO-bound verdict holds. | Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
 glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
 hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
 hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
diff --git a/xrspatial/geotiff/_gpu_decode.py b/xrspatial/geotiff/_gpu_decode.py
index c1e274057..4ccc23756 100644
--- a/xrspatial/geotiff/_gpu_decode.py
+++ b/xrspatial/geotiff/_gpu_decode.py
@@ -2658,7 +2658,6 @@ def _try_nvjpeg2k_batch_decode(compressed_tiles, tile_width, tile_height,
         return None
 
     import ctypes
-    import cupy
 
     n_tiles = len(compressed_tiles)
     bytes_per_pixel = dtype.itemsize * samples
@@ -2719,7 +2718,36 @@ class _NvJpeg2kImage(ctypes.Structure):
             lib.nvjpeg2kDestroy(handle)
             return None
 
-        # Decode each tile
+        # Decode each tile. Pre-allocate a single contiguous device buffer
+        # sized for every component of every tile, then derive per-tile /
+        # per-component pointers as views into it. Sibling helpers in this
+        # module already adopted the same pattern:
+        #
+        #   * ``_try_nvcomp_from_device_bufs`` (#1659) -- per-tile alloc +
+        #     trailing concat -> single contiguous buffer + pointer offsets.
+        #   * ``_try_kvikio_read_tiles`` (#1688) -- per-tile cupy.empty +
+        #     serial pread -> single buffer + batched submit.
+        #   * ``_nvcomp_batch_compress`` (#1712) -- per-tile cupy.empty +
+        #     per-tile cupy.get -> single pool + concat + single get.
+        #
+        # The previous nvJPEG2000 path allocated ``samples`` fresh
+        # ``cupy.empty`` buffers per tile (each round-trip through the
+        # memory pool ~tens of microseconds) and called
+        # ``cupy.cuda.Device().synchronize()`` once per tile, which forces
+        # default-stream serialisation that defeats nvJPEG2000's internal
+        # pipelining. One pool allocation + one trailing sync removes both
+        # costs without changing the output layout. See issue #2107.
+        # Defer the cupy import until past the dtype guard so a CPU-only
+        # host that exercises the early-return branches (lib missing or
+        # unsupported dtype) does not need cupy installed (#2110 CI fix).
+        import cupy
+
+        pitch = tile_width * dtype.itemsize
+        per_tile_comp_bytes = samples * tile_height * pitch
+        _check_gpu_memory(n_tiles * per_tile_comp_bytes,
+                          what="nvJPEG2000 per-component decode pool")
+        d_comp_pool = cupy.empty(
+            n_tiles * per_tile_comp_bytes, dtype=cupy.uint8)
         d_all_tiles = cupy.empty(n_tiles * tile_bytes, dtype=cupy.uint8)
 
         for i, tile_data in enumerate(compressed_tiles):
@@ -2736,12 +2764,17 @@ class _NvJpeg2kImage(ctypes.Structure):
             if s != 0:
                 continue
 
-            # Allocate per-component output buffers on GPU
-            comp_bufs = []
-            pitch = tile_width * dtype.itemsize
-            for c in range(samples):
-                buf = cupy.empty(tile_height * pitch, dtype=cupy.uint8)
-                comp_bufs.append(buf)
+            # Derive per-component output views into the shared pool. The
+            # slab for tile ``i`` starts at ``i * per_tile_comp_bytes`` and
+            # holds ``samples`` consecutive ``tile_height * pitch`` blocks.
+            tile_pool_start = i * per_tile_comp_bytes
+            comp_bufs = [
+                d_comp_pool[
+                    tile_pool_start + c * tile_height * pitch:
+                    tile_pool_start + (c + 1) * tile_height * pitch
+                ]
+                for c in range(samples)
+            ]
 
             # Build nvjpeg2kImage_t
             img = _NvJpeg2kImage()
@@ -2751,13 +2784,15 @@ class _NvJpeg2kImage(ctypes.Structure):
                 img.pixel_data[c] = comp_bufs[c].data.ptr
                 img.pitch_in_bytes[c] = pitch
 
-            # Decode
+            # Decode. The per-tile ``Device().synchronize()`` that used to
+            # live here forced default-stream serialisation across every
+            # tile; defer the wait to the batch-end sync below so successive
+            # tiles can pipeline through nvJPEG2000.
             s = lib.nvjpeg2kDecode(
                 handle, state, stream, params,
                 ctypes.byref(img),
                 ctypes.c_void_p(0),  # default CUDA stream
             )
-            cupy.cuda.Device().synchronize()
 
             if s != 0:
                 continue
@@ -2777,6 +2812,13 @@ class _NvJpeg2kImage(ctypes.Structure):
                 d_all_tiles[tile_offset:tile_offset + tile_bytes] = \
                     interleaved.view(cupy.uint8).ravel()
 
+        # Single batch-end sync: nvjpeg2kDecode is asynchronous on the
+        # default stream, so we have to wait once before returning so the
+        # decoded bytes are visible to the caller's downstream kernels.
+        # Replacing the per-tile sync with this single sync is the main
+        # performance win on multi-tile reads (#2107).
+        cupy.cuda.Device().synchronize()
+
         # Cleanup
         lib.nvjpeg2kDecodeParamsDestroy(params)
         lib.nvjpeg2kStreamDestroy(stream)
diff --git a/xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py b/xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py
new file mode 100644
index 000000000..b542039ba
--- /dev/null
+++ b/xrspatial/geotiff/tests/test_nvjpeg2k_single_alloc_2107.py
@@ -0,0 +1,343 @@
+"""Tests for the nvJPEG2000 batch-decode allocation refactor (#2107).
+
+The fix replaces per-tile / per-component ``cupy.empty`` allocations and
+per-tile ``cupy.cuda.Device().synchronize()`` inside the decode loop with
+a single contiguous device pool and a single batch-end sync. The pattern
+matches the prior fixes for ``_try_nvcomp_from_device_bufs`` (#1659),
+``_try_kvikio_read_tiles`` (#1688), and ``_nvcomp_batch_compress`` (#1712).
+
+Since nvJPEG2000 is rarely available on test hosts (libnvjpeg2k.so is
+not part of cuda-toolkit's default install), these tests focus on
+structural properties of the implementation rather than running a real
+decode:
+
+* ``_try_nvjpeg2k_batch_decode`` early-returns ``None`` when the shared
+  library is missing -- no allocations or syncs happen on the common
+  test host.
+* Source inspection asserts the new contract:
+    - Exactly two ``cupy.empty`` calls in the decode loop region
+      (``d_comp_pool`` + ``d_all_tiles``), zero inside the per-tile loop.
+    - Exactly one ``Device().synchronize()`` after the loop, zero inside.
+    - The slab indexing math hits the per-tile-component slab.
+
+Structural tests like these mirror the approach taken by
+``test_nvcomp_from_device_bufs_single_alloc_1659.py`` and
+``test_kvikio_batched_pread_1688.py`` so a regression that re-introduces
+the per-tile alloc pattern fails CI without needing a GPU.
+"""
+from __future__ import annotations
+
+import ast
+import inspect
+
+import numpy as np
+import pytest
+
+
+def _function_source(func):
+    """Return the function's source plus its line range in the source file."""
+    src = inspect.getsource(func)
+    start_line = func.__code__.co_firstlineno
+    return src, start_line
+
+
+def _count_cupy_empty_calls(tree):
+    """Count ``cupy.empty(...)`` Call nodes inside the AST."""
+    n = 0
+    for node in ast.walk(tree):
+        if not isinstance(node, ast.Call):
+            continue
+        func = node.func
+        if not isinstance(func, ast.Attribute):
+            continue
+        if func.attr != 'empty':
+            continue
+        # ``cupy.empty`` matches both bare ``cupy`` and aliased ``cp``;
+        # we only care about the ``empty`` method name.
+        if not isinstance(func.value, ast.Name):
+            continue
+        if func.value.id not in ('cupy', 'cp'):
+            continue
+        n += 1
+    return n
+
+
+def _count_device_synchronize_calls(tree):
+    """Count ``cupy.cuda.Device(...).synchronize()`` Call nodes."""
+    n = 0
+    for node in ast.walk(tree):
+        if not isinstance(node, ast.Call):
+            continue
+        func = node.func
+        if not isinstance(func, ast.Attribute):
+            continue
+        if func.attr != 'synchronize':
+            continue
+        # We walk back through the chain: synchronize -> Device(...) ->
+        # cuda -> cupy. Allow any chain that ends in a ``Device`` call.
+        parent_call = func.value
+        if not isinstance(parent_call, ast.Call):
+            continue
+        if not isinstance(parent_call.func, ast.Attribute):
+            continue
+        if parent_call.func.attr != 'Device':
+            continue
+        n += 1
+    return n
+
+
+def _inside_for_loop(node: ast.AST, parents: dict) -> bool:
+    """Return True when ``node`` sits anywhere under a ``for`` statement."""
+    cur = parents.get(id(node))
+    while cur is not None:
+        if isinstance(cur, ast.For):
+            return True
+        cur = parents.get(id(cur))
+    return False
+
+
+def _parent_map(tree: ast.AST) -> dict:
+    """Build a ``{id(child): parent}`` lookup map for ``_inside_for_loop``."""
+    mapping: dict = {}
+    for parent in ast.walk(tree):
+        for child in ast.iter_child_nodes(parent):
+            mapping[id(child)] = parent
+    return mapping
+
+
+class TestNvjpeg2kSingleAllocStructural:
+    """Structural assertions on the refactored helper (no GPU required)."""
+
+    def setup_method(self):
+        from xrspatial.geotiff import _gpu_decode
+
+        self._fn = _gpu_decode._try_nvjpeg2k_batch_decode
+        src, start = _function_source(self._fn)
+        self._src = src
+        self._start_line = start
+        self._tree = ast.parse(src)
+        self._parents = _parent_map(self._tree)
+
+    def test_no_cupy_empty_inside_decode_loop(self):
+        """``cupy.empty`` must NOT appear inside the per-tile ``for`` loop.
+
+        The refactor moves the pool allocation outside the loop. A
+        regression that re-introduces a per-tile ``cupy.empty`` would
+        bring back the memory-pool round-trip the fix removed.
+        """
+        offending = []
+        for node in ast.walk(self._tree):
+            if not isinstance(node, ast.Call):
+                continue
+            func = node.func
+            if not isinstance(func, ast.Attribute):
+                continue
+            if func.attr != 'empty':
+                continue
+            if (not isinstance(func.value, ast.Name)
+                    or func.value.id not in ('cupy', 'cp')):
+                continue
+            if _inside_for_loop(node, self._parents):
+                offending.append(self._start_line + node.lineno - 1)
+        assert offending == [], (
+            f"_try_nvjpeg2k_batch_decode contains cupy.empty(...) calls "
+            f"inside a for-loop at file lines {offending}. The refactor "
+            f"in #2107 moved every output allocation outside the per-tile "
+            f"loop; reverting that defeats the pooling optimisation."
+        )
+
+    def test_no_device_synchronize_inside_decode_loop(self):
+        """``Device().synchronize()`` must NOT live inside the decode loop.
+
+        The previous implementation called it once per tile, forcing
+        default-stream serialisation. The refactor leaves exactly one
+        synchronize call after the loop body.
+        """
+        offending = []
+        for node in ast.walk(self._tree):
+            if not isinstance(node, ast.Call):
+                continue
+            func = node.func
+            if not isinstance(func, ast.Attribute):
+                continue
+            if func.attr != 'synchronize':
+                continue
+            parent_call = func.value
+            if (not isinstance(parent_call, ast.Call)
+                    or not isinstance(parent_call.func, ast.Attribute)
+                    or parent_call.func.attr != 'Device'):
+                continue
+            if _inside_for_loop(node, self._parents):
+                offending.append(self._start_line + node.lineno - 1)
+        assert offending == [], (
+            f"_try_nvjpeg2k_batch_decode contains Device().synchronize() "
+            f"calls inside a for-loop at file lines {offending}. The "
+            f"refactor in #2107 keeps exactly one batch-end sync outside "
+            f"the loop so successive tiles can pipeline through "
+            f"nvJPEG2000."
+        )
+
+    def test_pool_allocation_present(self):
+        """Source contains the expected pool buffer name and slab math.
+
+        The refactor introduces ``d_comp_pool`` and
+        ``per_tile_comp_bytes``; if either disappears, the test fails so
+        the reviewer notices the layout drift.
+        """
+        assert 'd_comp_pool' in self._src, (
+            "_try_nvjpeg2k_batch_decode no longer references the shared "
+            "d_comp_pool buffer; the refactor in #2107 is missing or "
+            "reverted."
+        )
+        assert 'per_tile_comp_bytes' in self._src, (
+            "_try_nvjpeg2k_batch_decode no longer references "
+            "per_tile_comp_bytes; the per-tile slab math from #2107 "
+            "is missing or renamed without an audit."
+        )
+
+    def test_check_gpu_memory_guard_present(self):
+        """The pool allocation must be guarded by ``_check_gpu_memory``.
+
+        Sibling helpers (``_try_nvcomp_from_device_bufs``,
+        ``_try_kvikio_read_tiles``, ``_nvcomp_batch_compress``) all guard
+        their pool allocations the same way; refusing the allocation
+        before cupy raises an opaque CUDA OOM keeps the failure mode
+        consistent (#2107 follows that pattern).
+        """
+        assert '_check_gpu_memory(' in self._src, (
+            "_try_nvjpeg2k_batch_decode no longer calls _check_gpu_memory "
+            "before allocating the per-tile component pool. The fail-fast "
+            "OOM contract from #2107 is missing."
+        )
+
+
+class TestNvjpeg2kLibAbsentShortCircuit:
+    """When the shared library is missing, the function returns None
+    without touching cupy / allocating any device memory."""
+
+    def test_returns_none_when_lib_missing(self, monkeypatch):
+        """The early-return is the path most test hosts take. Verify
+        nothing reaches the refactored allocation code on that path so
+        the refactor does not regress the lib-missing host behaviour.
+        """
+        from xrspatial.geotiff import _gpu_decode
+
+        monkeypatch.setattr(_gpu_decode, '_get_nvjpeg2k', lambda: None)
+
+        result = _gpu_decode._try_nvjpeg2k_batch_decode(
+            compressed_tiles=[b''],
+            tile_width=8,
+            tile_height=8,
+            dtype=np.dtype('uint8'),
+            samples=1,
+        )
+        assert result is None
+
+    def test_returns_none_for_unsupported_dtype(self, monkeypatch):
+        """Unsupported dtypes (e.g. float32) short-circuit before any
+        device allocation. The refactor moved the pool alloc below the
+        dtype guard, so this exercises the dtype branch that must still
+        clean up the handles without touching the pool.
+        """
+        from xrspatial.geotiff import _gpu_decode
+
+        # Fake a lib that succeeds for handle/state/stream/params and
+        # exposes the destroy entry points so the dtype-guard cleanup
+        # path runs without crashing. We do not pretend to support the
+        # actual nvjpeg2k C API beyond what the early-return code uses.
+        class _FakeLib:
+            def __init__(self):
+                self.calls = []
+
+            def nvjpeg2kCreateSimple(self, *_args):
+                return 0
+
+            def nvjpeg2kDecodeStateCreate(self, *_args):
+                return 0
+
+            def nvjpeg2kStreamCreate(self, *_args):
+                return 0
+
+            def nvjpeg2kDecodeParamsCreate(self, *_args):
+                return 0
+
+            def nvjpeg2kDecodeParamsDestroy(self, *_args):
+                self.calls.append('params_destroy')
+
+            def nvjpeg2kStreamDestroy(self, *_args):
+                self.calls.append('stream_destroy')
+
+            def nvjpeg2kDecodeStateDestroy(self, *_args):
+                self.calls.append('state_destroy')
+
+            def nvjpeg2kDestroy(self, *_args):
+                self.calls.append('handle_destroy')
+
+        fake = _FakeLib()
+        monkeypatch.setattr(_gpu_decode, '_get_nvjpeg2k', lambda: fake)
+
+        # float32 is not in {uint8, uint16, int16} so the helper exits
+        # before any pool allocation -- and the host needs no cupy
+        # for this branch to run.
+        result = _gpu_decode._try_nvjpeg2k_batch_decode(
+            compressed_tiles=[b''],
+            tile_width=8,
+            tile_height=8,
+            dtype=np.dtype('float32'),
+            samples=1,
+        )
+        assert result is None
+        # The dtype-guard branch should have called all four destroy
+        # functions, mirroring the success-path cleanup.
+        assert fake.calls == [
+            'params_destroy',
+            'stream_destroy',
+            'state_destroy',
+            'handle_destroy',
+        ]
+
+
+@pytest.mark.gpu
+class TestNvjpeg2kPoolWithCupy:
+    """Lightweight cupy-only smoke tests for the pool layout.
+
+    These tests do not exercise the real nvJPEG2000 decode (the host
+    typically has no libnvjpeg2k.so) but they confirm the pool sizing
+    math and the per-tile slab indexing produce non-overlapping views
+    into the shared buffer for representative tile sizes.
+    """
+
+    def test_pool_slabs_are_non_overlapping(self):
+        """Tile-component slabs into the pool must not overlap.
+
+        We recompute the slab boundaries the helper uses and verify the
+        sequence covers exactly the pool size with no gaps and no
+        double-coverage. A regression that miscomputes
+        ``per_tile_comp_bytes`` would either OOB the pool or fold tile
+        N onto tile N-1's bytes; this test catches both.
+        """
+        cupy = pytest.importorskip('cupy')
+
+        n_tiles = 4
+        tile_width = 32
+        tile_height = 32
+        samples = 3
+        dtype = np.dtype('uint16')
+        pitch = tile_width * dtype.itemsize
+        per_tile_comp_bytes = samples * tile_height * pitch
+        pool = cupy.empty(n_tiles * per_tile_comp_bytes, dtype=cupy.uint8)
+
+        seen = set()
+        for i in range(n_tiles):
+            tile_pool_start = i * per_tile_comp_bytes
+            for c in range(samples):
+                start = tile_pool_start + c * tile_height * pitch
+                end = start + tile_height * pitch
+                for byte in range(start, end):
+                    assert byte not in seen, (
+                        f"pool byte {byte} appears in two slabs "
+                        f"(tile={i}, comp={c}); per-tile slab math is "
+                        f"wrong."
+                    )
+                    seen.add(byte)
+        assert len(seen) == int(pool.nbytes)