From e85856538d2700d7bce00801c37b0bf8496bf7be Mon Sep 17 00:00:00 2001
From: Brendan Collins <brendancol@gmail.com>
Date: Mon, 18 May 2026 20:32:05 -0700
Subject: [PATCH 1/5] geotiff: parallelise strip decode in _read_strips and
 _fetch_decode_cog_http_strips (#2100)

The local strip path (`_read_strips`) and the HTTP COG strip path
(`_fetch_decode_cog_http_strips`) decoded strips sequentially in a Python
for-loop. The matching tile paths (`_read_tiles` and
`_fetch_decode_cog_http_tiles`) parallelise decode via a
`ThreadPoolExecutor` gated on `_PARALLEL_DECODE_PIXEL_THRESHOLD` (64K
pixels). Codec decode for deflate, zstd, and LZW releases the GIL inside
the C extension so threads overlap codec work across cores; the strip
paths missed that optimisation.

Refactored both paths into the collect-decode-place pattern the tile
paths use: build a job list of (band_idx, strip_idx, global_idx),
ThreadPool-map the decode when n_strips > 1 and strip_pixels >=
threshold, then place sequentially into the result buffer to avoid
contending writes.

Microbench (4-core, uint16):
- 4096x4096 deflate: 130 ms -> 34 ms (3.82x)
- 8192x8192 deflate: 531 ms -> 146 ms (3.63x)
- 8192x8192 zstd: 211 ms -> 85 ms (2.48x)
- 4096x4096 uncompressed: 25 ms -> 22 ms (1.14x, frombuffer + slice copy)

5 new tests in test_parallel_strip_decode_2100.py cover parallel/serial
parity, the pool-engaged branch on multi-strip files, the serial-path
short-circuit for single-strip files, windowed cross-strip reads, and
the HTTP COG strip parity. 3998 existing geotiff tests pass; 8
pre-existing failures predate this change (read_to_array is now a
private attr).

Closes #2100.
---
 .claude/sweep-performance-state.csv           |   2 +-
 xrspatial/geotiff/_reader.py                  | 137 +++++++++----
 .../tests/test_parallel_strip_decode_2100.py  | 185 ++++++++++++++++++
 3 files changed, 281 insertions(+), 43 deletions(-)
 create mode 100644 xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py

diff --git a/.claude/sweep-performance-state.csv b/.claude/sweep-performance-state.csv
index 848689345..3d67664ee 100644
--- a/.claude/sweep-performance-state.csv
+++ b/.claude/sweep-performance-state.csv
@@ -18,7 +18,7 @@ fire,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 flood,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 focal,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,
 geodesic,2026-03-31T18:00:00Z,N/A,compute-bound,0,,
-geotiff,2026-05-15,SAFE,IO-bound,0,1756,"Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
+geotiff,2026-05-18,SAFE,IO-bound,0,2100,"Pass 11 (2026-05-18): 1 MEDIUM found and fixed. _read_strips (_reader.py:~L1972) and _fetch_decode_cog_http_strips (_reader.py:~L2670) decoded strips sequentially in a Python for-loop while the tile counterparts (_read_tiles L2146, _fetch_decode_cog_http_tiles L2898) gated parallel decode on _PARALLEL_DECODE_PIXEL_THRESHOLD via ThreadPoolExecutor. Filed #2100 and fixed: both strip paths now collect jobs, parallel-decode when n_strips > 1 and strip_pixels >= 64K, then place sequentially. Measured (uint16, 4-core): 4096x4096 deflate 130ms->34ms (3.82x), 8192x8192 deflate 531ms->146ms (3.63x), 8192x8192 zstd 211ms->85ms (2.48x), uncompressed 25ms->22ms (1.14x). 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, pool-engaged on multi-strip, serial-path for single-strip, windowed cross-strip read, HTTP COG strip parity). 3998 tests pass; 8 pre-existing failures predating this change (predictor2 BE + size_param_validation_gpu_vrt reference now-private read_to_array attr). SAFE/IO-bound verdict holds. | Pass 10 (2026-05-15): 1 new MEDIUM found and fixed; 2 LOW noted. MEDIUM (_reader.py:2737): _fetch_decode_cog_http_tiles decoded tiles sequentially in a Python for-loop after the concurrent fetch landed (issue #1480). Local _read_tiles parallelises decode whenever tile_pixels >= 64K via ThreadPoolExecutor (_reader.py:2017); the HTTP path was structurally similar but never picked up the same gate, so wide windowed reads of multi-tile COGs left deflate/zstd decode single-threaded. Mirrored the local-path threshold + pool. 5 new tests in test_cog_http_parallel_decode_2026_05_15.py (parallel + serial round-trip correctness, pool-instantiation branch selection above the threshold, single-tile path skips the pool, structural _decode_strip_or_tile call count == n_tiles). All 262 COG/HTTP tests pass; 3162 of 3164 selected geotiff tests pass overall (2 pre-existing failures predating Pass 9 per prior notes -- test_predictor2_big_endian_gpu_1517 references the now-private read_to_array attr, and the test_size_param_validation_gpu_vrt_1776 tile_size=4 validator failure). LOW deferred (no fix in this PR): (1) _block_reduce_2d_gpu (_gpu_decode.py:3142/3163/3189) does bool(mask.any().item()) per overview level when nodata is set, paying one device sync per level; the alternative (unconditional cupy.putmask) always pays the work cost and the short-circuit is correct under the current API. (2) _nvcomp_batch_compress adler32 staging (_gpu_decode.py:2543-2546) issues n_tiles slice-assign kernels into a fresh contig buffer despite all callers passing slices of a single underlying d_tile_buf; an API refactor to accept the source buffer directly would skip the rebuild. SAFE/IO-bound verdict holds. Dask probe: 2560x2560 chunks=256 yields 400 tasks (4 per chunk), well under the 50000 cap. GPU probe: 1024x1024 float32 zstd read returns CuPy-backed in 236 ms with no host round-trip. | Rockout 2026-05-15: LOW filed #1934 -- _apply_nodata_mask_gpu used cupy.where (allocating); switched to cupy.putmask on the already-owned buffer (float path) and on the post-astype float64 buffer (int path). Saves one chunk-sized device allocation per call. 7 new tests in test_apply_nodata_mask_gpu_inplace_1934.py; 52 related nodata tests pass. | Pass 8 (2026-05-12): 1 new MEDIUM found and fixed. _assemble_standard_layout/_assemble_cog_layout returned bytes(bytearray), doubling peak memory transiently during eager writes. Filed #1756, fixed by returning the bytearray directly. Measured: 95 MB uint8 raster peak drops 202 MB -> 107 MB. _write_bytes / parse_header already accepted the buffer protocol so the change is transparent to callers. 6 new tests in test_assemble_layout_no_bytes_copy_1756.py. 2123 existing geotiff tests pass; the 10 unrelated failures (test_no_georef_windowed_coords_1710, test_predictor2_big_endian_gpu_1517) reference the now-private read_to_array attribute (commit 8adb749, issue #1708) and predate this change. SAFE/IO-bound verdict holds. | Pass 7 (2026-05-12): re-audit identified 4 MEDIUM findings, all real, all backed by microbenches. (1) unpack_bits sub-byte loops for bps=2/4/12 in _compression.py:836-878 were 100-200x slower than vectorised numpy (filed #1713, fixed in this branch: bps=4 2M pixels drops from 165ms to 3ms = 55x; bps=2/12 similar). (2) _write_vrt_tiled at __init__.py:1708 uses scheduler='synchronous' on independent tile writes; measured 33% slowdown on 256-tile zstd write vs threads scheduler (filed #1714, no fix yet). (3) _nvcomp_batch_compress at _gpu_decode.py:2522-2526 still does per-tile cupy.get().tobytes() despite #1552 / #1659 fixing the same pattern elsewhere; measured 45% reduction with concat+single get on n=1024 (filed #1712, no fix yet). (4) _nvcomp_batch_compress at _gpu_decode.py:2457 uses per-tile cupy.empty allocations; 1024 tiles 16KB drops from 4.7ms to 1.0ms with single contiguous + views (bundled into #1712). Cat 6 OOM verdict: SAFE/IO-bound holds -- read_geotiff_dask caps task count at _MAX_DASK_CHUNKS=50_000 and per-chunk memory is bounded by chunk size. _inflate_tiles_kernel resource usage on Ampere: 67 regs/thread, 2896B local/thread, 8192B shared/block (LZW kernel: 29 regs, 24576B shared) -- register pressure under control; high local memory in inflate is unavoidable (LZ77 state) but only thread 0 in each block uses it. | Pass 4 (2026-05-10): re-audit after #1559 (centralise attrs across all read backends). New _populate_attrs_from_geo_info helper at __init__.py:301 runs once per read, not per-chunk -- no perf impact. Probe: 2560x2560 deflate-tiled file opened via read_geotiff_dask yields 400 tasks (4 tasks/chunk for 100 chunks), well under 1M cap. read_geotiff_gpu(1024x1024) returns cupy.ndarray end-to-end with no host round-trip (226ms incl. write+decode). No new HIGH/MEDIUM findings. SAFE/IO-bound holds. | Pass 3 (2026-05-10): SAFE/IO-bound. Audited 4 perf commits: #1558 (in-place NaN writes on uniquely-owned buffers correct), #1556 (fp-predictor ngjit ~297us/tile for 256x256 float32), #1552 (single cupy.concatenate + one .get() for batched D2H at _gpu_decode.py:870-913), #1551 (parallel decode threshold >=65536px engages 256x256 default at _reader.py:1121). Bench: 8192x8192 f32 deflate+pred2 256-tile write 782ms; 4096x4096 f32 deflate read 83ms with parallel decode. Deferred LOW (none filed, all <10% MEDIUM threshold): _writer.py:459/1109 redundant .copy() before predictor encode (~1% per tile), _compression.py:280 lzw_decompress dst[:n].copy() (~2% per LZW tile decode), _writer.py:1419 seg_np.copy() before in-place NaN substitution (negligible, conditional path), _CloudSource.read_range opens fresh fsspec handle per range (pre-existing, predates audit scope). nvCOMP per-tile D2H batching break-even confirmed (variable sizes need staging buffer, no win). | Pass 3 (2026-05-10): audited f157746,39322c3,f23ec8f,1aac3b7. All 5 commits correct. Redundant .copy() in _writer.py:459,1109 and _compression.py:280 (1-2% overhead, LOW). _CloudSource.read_range() per-call open is pre-existing arch issue. No HIGH/MEDIUM regressions. SAFE. | re-audit 2026-05-02: 6 commits since 2026-04-16 (predictor=3 CPU encode/decode, GPU predictor stride fix, validate_tile_layout, BigTIFF LONG8 offsets, AREA_OR_POINT VRT, per-tile alloc guard). 1M dask chunk cap intact at __init__.py:948; adler32 batch transfer intact at _gpu_decode.py:1825. New code is metadata validation and dispatcher logic with no extra materialization or per-tile sync points. No HIGH/MEDIUM regressions. | Pass 5 (2026-05-12): re-audit identified MEDIUM in _gpu_decode.py:1577 _try_nvcomp_from_device_bufs: per-tile cupy.empty + trailing cupy.concatenate doubled peak VRAM and added serial concat. Filed #1659 and fixed to single-buffer + pointer offsets (matches LZW/deflate/host-buffer patterns at L1847/L1878/L1114). Microbench (alloc+concat overhead only, not full nvCOMP latency): n=256 tile_bytes=65536 drops 3.66ms->0.69ms, n=256 tile_bytes=262144 drops 8.18ms->0.13ms. Tests: 5 new tests in test_nvcomp_from_device_bufs_single_alloc_1659.py (codec short-circuit, no-lib short-circuit, memory-guard contract, real ZSTD round-trip via nvCOMP, structural single-buffer check). 1458 existing geotiff tests pass, 3 unrelated matplotlib/py3.14 failures pre-existing. SAFE/IO-bound verdict holds. | Pass 6 (2026-05-12): re-audit on top of #1659. New HIGH in _try_kvikio_read_tiles at _gpu_decode.py:941: per-tile cupy.empty() + blocking IOFuture.get() inside loop serialised GDS reads to ~1 outstanding pread, missed parallelism the kvikio worker pool was designed for, paid per-tile cupy.empty setup (matches #1659 anti-pattern in nvCOMP path), and lacked _check_gpu_memory guard. Filed #1688 and fixed to single contiguous buffer + batched submit + guard. Microbench with 8-worker pool simulation: 256 tiles@1ms latency drops 256ms->38.7ms (~6.6x); single-thread simulation 256ms->28.5ms (9x). Tests: 9 new tests in test_kvikio_batched_pread_1688.py (kvikio-absent path, single-buffer pointer arithmetic, submit-before-get ordering, memory guard, partial-read fallback, round-trip data, zero-size/all-sparse tiles). All 1577 geotiff tests pass except pre-existing matplotlib/py3.14 failures."
 glcm,2026-03-31T18:00:00Z,SAFE,compute-bound,0,,"Downgraded to MEDIUM. da.stack without rechunk is scheduling overhead, not OOM risk."
 hillshade,2026-04-16T12:00:00Z,SAFE,compute-bound,0,,"Re-audit after Horn's method rewrite (PR 1175): clean stencil, map_overlap depth=(1,1), no materialization. Zero findings."
 hydro,2026-05-01,RISKY,memory-bound,0,1416,"Fixed-in-tree 2026-05-01: hand_mfd._hand_mfd_dask now assembles via da.map_blocks instead of eager da.block of pre-computed tiles (matches hand_dinf pattern). Remaining MEDIUM: sink_d8 CCL fully materializes labels (inherently global), flow_accumulation_mfd frac_bdry held in driver dict instead of memmap-backed BoundaryStore. D8 iterative paths (flow_accum/fill/watershed/basin/stream_*) use serial-tile sweep with memmap-backed boundary store -- per-tile RAM bounded but driver iterates O(diameter) times. flow_direction_*, flow_path/snap_pour_point/twi/hand_d8/hand_dinf are SAFE."
diff --git a/xrspatial/geotiff/_reader.py b/xrspatial/geotiff/_reader.py
index ca05daef7..bb16c7440 100644
--- a/xrspatial/geotiff/_reader.py
+++ b/xrspatial/geotiff/_reader.py
@@ -1939,62 +1939,79 @@ def _read_strips(data: bytes, ifd: IFD, header: TIFFHeader,
     else:
         result = np.empty((out_h, out_w), dtype=dtype)
 
+    # Collect strip jobs; decode in parallel when the pool overhead pays off.
+    # Mirrors the tile path's gate at ``_PARALLEL_DECODE_PIXEL_THRESHOLD``:
+    # codec decode (deflate, zstd, LZW) releases the GIL inside the C
+    # extension so threads actually overlap codec work across cores. The
+    # placement loop that copies pixels into ``result`` stays serial to
+    # avoid contending writes to the output buffer. See issue #2100.
+    strip_jobs: list[tuple[int, int, int]] = []  # (band_idx, strip_idx, global_idx)
     if planar == 2 and samples > 1:
         first_strip = r0 // rps
         last_strip = min((r1 - 1) // rps, strips_per_band - 1)
-
         for band_idx in range(samples):
             band_offset = band_idx * strips_per_band
             for strip_idx in range(first_strip, last_strip + 1):
                 global_idx = band_offset + strip_idx
                 if byte_counts[global_idx] == 0:
-                    # Sparse strip: result is already pre-filled.
                     continue
                 strip_row = strip_idx * rps
                 strip_rows = min(rps, height - strip_row)
                 if strip_rows <= 0:
                     continue
-
-                strip_data = data[offsets[global_idx]:offsets[global_idx] + byte_counts[global_idx]]
-                strip_pixels = _decode_strip_or_tile(
-                    strip_data, compression, width, strip_rows, 1,
-                    bps, bytes_per_sample, is_sub_byte, dtype, pred,
-                    byte_order=header.byte_order,
-                    jpeg_tables=jpeg_tables,
-                    masked_fill=masked_fill)
-
-                src_r0 = max(r0 - strip_row, 0)
-                src_r1 = min(r1 - strip_row, strip_rows)
-                dst_r0 = max(strip_row - r0, 0)
-                dst_r1 = dst_r0 + (src_r1 - src_r0)
-                if dst_r1 > dst_r0:
-                    result[dst_r0:dst_r1, :, band_idx] = strip_pixels[src_r0:src_r1, c0:c1]
+                strip_jobs.append((band_idx, strip_idx, global_idx))
+        strip_samples = 1
     else:
         first_strip = r0 // rps
         last_strip = min((r1 - 1) // rps, len(offsets) - 1)
-
         for strip_idx in range(first_strip, last_strip + 1):
             strip_row = strip_idx * rps
             strip_rows = min(rps, height - strip_row)
             if strip_rows <= 0:
                 continue
             if byte_counts[strip_idx] == 0:
-                # Sparse strip: result is already pre-filled.
                 continue
+            strip_jobs.append((0, strip_idx, strip_idx))
+        strip_samples = samples
+
+    def _decode_strip_job(job):
+        _band_idx, strip_idx, global_idx = job
+        strip_row = strip_idx * rps
+        strip_rows = min(rps, height - strip_row)
+        strip_data = data[
+            offsets[global_idx]:offsets[global_idx] + byte_counts[global_idx]]
+        return _decode_strip_or_tile(
+            strip_data, compression, width, strip_rows, strip_samples,
+            bps, bytes_per_sample, is_sub_byte, dtype, pred,
+            byte_order=header.byte_order,
+            jpeg_tables=jpeg_tables,
+            masked_fill=masked_fill)
+
+    n_strips = len(strip_jobs)
+    strip_pixel_count = width * rps
+    use_parallel = (n_strips > 1
+                    and strip_pixel_count >= _PARALLEL_DECODE_PIXEL_THRESHOLD)
+    if use_parallel:
+        import os as _os
+        n_workers = min(n_strips, _os.cpu_count() or 4)
+        with ThreadPoolExecutor(max_workers=n_workers) as pool:
+            decoded_strips = list(pool.map(_decode_strip_job, strip_jobs))
+    else:
+        decoded_strips = [_decode_strip_job(job) for job in strip_jobs]
 
-            strip_data = data[offsets[strip_idx]:offsets[strip_idx] + byte_counts[strip_idx]]
-            strip_pixels = _decode_strip_or_tile(
-                strip_data, compression, width, strip_rows, samples,
-                bps, bytes_per_sample, is_sub_byte, dtype, pred,
-                byte_order=header.byte_order,
-                jpeg_tables=jpeg_tables,
-                masked_fill=masked_fill)
-
-            src_r0 = max(r0 - strip_row, 0)
-            src_r1 = min(r1 - strip_row, strip_rows)
-            dst_r0 = max(strip_row - r0, 0)
-            dst_r1 = dst_r0 + (src_r1 - src_r0)
-            if dst_r1 > dst_r0:
+    for (band_idx, strip_idx, _global_idx), strip_pixels in zip(
+            strip_jobs, decoded_strips):
+        strip_row = strip_idx * rps
+        strip_rows = min(rps, height - strip_row)
+        src_r0 = max(r0 - strip_row, 0)
+        src_r1 = min(r1 - strip_row, strip_rows)
+        dst_r0 = max(strip_row - r0, 0)
+        dst_r1 = dst_r0 + (src_r1 - src_r0)
+        if dst_r1 > dst_r0:
+            if planar == 2 and samples > 1:
+                result[dst_r0:dst_r1, :, band_idx] = (
+                    strip_pixels[src_r0:src_r1, c0:c1])
+            else:
                 result[dst_r0:dst_r1] = strip_pixels[src_r0:src_r1, c0:c1]
 
     return result
@@ -2667,7 +2684,51 @@ def _fetch_decode_cog_http_strips(
         strip_bytes_list = []
 
     # Pass 3: decode each strip and place its intersection with the window.
-    for (band_idx, strip_idx), strip_data in zip(placements, strip_bytes_list):
+    #
+    # Codec decode (deflate, zstd, LZW, ...) releases the GIL inside the C
+    # extension, so threading the per-strip decode overlaps codec work
+    # across cores. The local tile / strip paths and the HTTP tile path
+    # use the same ``_PARALLEL_DECODE_PIXEL_THRESHOLD`` gate; mirror it
+    # here so HTTP COG strip reads of wide windows benefit from the same
+    # parallelism rather than serialising the decode after a parallel
+    # fetch. The placement loop that copies pixels into ``result`` stays
+    # serial to avoid contending writes to the output buffer. Issue #2100.
+    n_decode_strips = len(strip_bytes_list)
+    strip_pixel_count = width * rps
+    decode_in_parallel = (
+        n_decode_strips > 1
+        and strip_pixel_count >= _PARALLEL_DECODE_PIXEL_THRESHOLD)
+
+    def _decode_http_strip(args):
+        strip_idx, strip_data = args
+        strip_row = strip_idx * rps
+        strip_rows = min(rps, height - strip_row)
+        if strip_rows <= 0:
+            return None
+        # Per-strip decoded-dimension cap (#1851). See note below.
+        _check_dimensions(width, strip_rows, strip_samples,
+                          MAX_PIXELS_DEFAULT)
+        return _decode_strip_or_tile(
+            strip_data, compression, width, strip_rows, strip_samples,
+            bps, bytes_per_sample, is_sub_byte, dtype, pred,
+            byte_order=header.byte_order,
+            jpeg_tables=jpeg_tables,
+            masked_fill=masked_fill)
+
+    decode_inputs = [(strip_idx, strip_data)
+                     for (_band, strip_idx), strip_data
+                     in zip(placements, strip_bytes_list)]
+
+    if decode_in_parallel:
+        n_decode_workers = min(n_decode_strips, _os_module.cpu_count() or 4)
+        with ThreadPoolExecutor(max_workers=n_decode_workers) as pool:
+            decoded_strips = list(pool.map(_decode_http_strip, decode_inputs))
+    else:
+        decoded_strips = [_decode_http_strip(item) for item in decode_inputs]
+
+    for (band_idx, strip_idx), strip_pixels in zip(placements, decoded_strips):
+        if strip_pixels is None:
+            continue
         strip_row = strip_idx * rps
         strip_rows = min(rps, height - strip_row)
         if strip_rows <= 0:
@@ -2681,15 +2742,7 @@ def _fetch_decode_cog_http_strips(
         # the window clip. Use ``MAX_PIXELS_DEFAULT`` rather than the
         # caller's ``max_pixels`` so a small output-window budget does
         # not reject normal strip sizes.
-        _check_dimensions(width, strip_rows, strip_samples,
-                          MAX_PIXELS_DEFAULT)
-
-        strip_pixels = _decode_strip_or_tile(
-            strip_data, compression, width, strip_rows, strip_samples,
-            bps, bytes_per_sample, is_sub_byte, dtype, pred,
-            byte_order=header.byte_order,
-            jpeg_tables=jpeg_tables,
-            masked_fill=masked_fill)
+        # The decoded-dimension cap fired inside ``_decode_http_strip``.
 
         src_r0 = max(r0 - strip_row, 0)
         src_r1 = min(r1 - strip_row, strip_rows)
diff --git a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
new file mode 100644
index 000000000..a78dab449
--- /dev/null
+++ b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
@@ -0,0 +1,185 @@
+"""Tests for parallel strip decode in ``_read_strips`` and
+``_fetch_decode_cog_http_strips``.
+
+Issue #2100. Both strip paths previously decoded strips sequentially in a
+Python for-loop. The matching tile paths use a ``ThreadPoolExecutor``
+gated on ``_PARALLEL_DECODE_PIXEL_THRESHOLD`` (64K pixels). These tests
+codify the gate and verify correctness (parallel vs serial parity).
+"""
+from __future__ import annotations
+
+import http.server
+import os
+import socket
+import tempfile
+import threading
+from unittest.mock import patch
+
+import numpy as np
+import pytest
+import xarray as xr
+
+from xrspatial.geotiff import to_geotiff
+from xrspatial.geotiff import _reader as _reader_mod
+from xrspatial.geotiff._reader import read_to_array
+
+
+def _make_stripped_uint16(height: int, width: int, *,
+                          compression: str = "deflate") -> bytes:
+    """Build a stripped TIFF in memory and return the bytes."""
+    rng = np.random.default_rng(seed=12345)
+    arr = rng.integers(0, 256, size=(height, width), dtype=np.uint16)
+    da = xr.DataArray(arr, dims=["y", "x"])
+    with tempfile.NamedTemporaryFile(suffix=".tif", delete=False) as f:
+        path = f.name
+    try:
+        to_geotiff(da, path, compression=compression, tiled=False)
+        with open(path, "rb") as f:
+            return arr, f.read()
+    finally:
+        try:
+            os.remove(path)
+        except OSError:
+            pass
+
+
+class TestReadStripsParallelGate:
+    """Local strip path: the parallel-decode gate engages on multi-strip."""
+
+    def test_parallel_decode_matches_serial(self):
+        """A wide stripped TIFF: parallel and serial decode return
+        identical bytes."""
+        arr, blob = _make_stripped_uint16(2048, 4096)
+        # Write to disk and read twice (parallel default, then forced serial).
+        with tempfile.TemporaryDirectory() as td:
+            p = os.path.join(td, "s.tif")
+            with open(p, "wb") as f:
+                f.write(blob)
+            par, _ = read_to_array(p)
+            with patch.object(_reader_mod,
+                              "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
+                ser, _ = read_to_array(p)
+        np.testing.assert_array_equal(par, ser)
+        np.testing.assert_array_equal(par, arr)
+
+    def test_parallel_pool_engages_on_multi_strip(self):
+        """Confirm the parallel branch is taken: patch ThreadPoolExecutor
+        and assert it is constructed when n_strips > 1 and strip pixels
+        clear the gate."""
+        arr, blob = _make_stripped_uint16(1024, 2048)
+        with tempfile.TemporaryDirectory() as td:
+            p = os.path.join(td, "s.tif")
+            with open(p, "wb") as f:
+                f.write(blob)
+            with patch.object(_reader_mod, "ThreadPoolExecutor",
+                              wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
+                out, _ = read_to_array(p)
+                # n_strips for a 1024-row file with default rps -> at
+                # least 4 strips (TIFFs default rps=8KB / row).
+                assert mock_pool.called
+        np.testing.assert_array_equal(out, arr)
+
+    def test_serial_path_used_for_small_strip(self):
+        """A single-row image will produce a single strip; the parallel
+        gate must short-circuit to the serial branch."""
+        arr = np.arange(2 * 32, dtype=np.uint16).reshape(2, 32)
+        da = xr.DataArray(arr, dims=["y", "x"])
+        with tempfile.TemporaryDirectory() as td:
+            p = os.path.join(td, "tiny.tif")
+            to_geotiff(da, p, compression="deflate", tiled=False)
+            with patch.object(_reader_mod, "ThreadPoolExecutor",
+                              wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
+                out, _ = read_to_array(p)
+                # Single-strip file => no pool.
+                assert not mock_pool.called
+        np.testing.assert_array_equal(out, arr)
+
+    def test_windowed_strip_read_parallel(self):
+        """A windowed read across multiple strips still matches the full
+        decode."""
+        arr, blob = _make_stripped_uint16(2048, 2048)
+        with tempfile.TemporaryDirectory() as td:
+            p = os.path.join(td, "w.tif")
+            with open(p, "wb") as f:
+                f.write(blob)
+            par, _ = read_to_array(p, window=(100, 100, 1500, 1500))
+            with patch.object(_reader_mod,
+                              "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
+                ser, _ = read_to_array(p, window=(100, 100, 1500, 1500))
+        np.testing.assert_array_equal(par, ser)
+        np.testing.assert_array_equal(par, arr[100:1500, 100:1500])
+
+
+# ----------------------------------------------------------------------
+# HTTP COG strip path
+# ----------------------------------------------------------------------
+
+class _RangeHandler(http.server.BaseHTTPRequestHandler):
+    """Serve a fixed byte blob with HTTP Range support."""
+    blob: bytes = b""
+
+    def do_HEAD(self):
+        self.send_response(200)
+        self.send_header("Content-Length", str(len(self.blob)))
+        self.send_header("Accept-Ranges", "bytes")
+        self.end_headers()
+
+    def do_GET(self):
+        rng = self.headers.get("Range")
+        if rng and rng.startswith("bytes="):
+            r0, r1 = rng[len("bytes="):].split("-")
+            r0 = int(r0)
+            r1 = int(r1) if r1 else len(self.blob) - 1
+            r1 = min(r1, len(self.blob) - 1)
+            body = self.blob[r0:r1 + 1]
+            self.send_response(206)
+            self.send_header("Content-Range",
+                             f"bytes {r0}-{r1}/{len(self.blob)}")
+            self.send_header("Content-Length", str(len(body)))
+            self.send_header("Accept-Ranges", "bytes")
+            self.end_headers()
+            self.wfile.write(body)
+        else:
+            self.send_response(200)
+            self.send_header("Content-Length", str(len(self.blob)))
+            self.send_header("Accept-Ranges", "bytes")
+            self.end_headers()
+            self.wfile.write(self.blob)
+
+    def log_message(self, format, *args):  # quiet
+        return
+
+
+def _start_server(blob: bytes):
+    """Start an HTTP server serving the given blob on a free port."""
+    s = socket.socket()
+    s.bind(("127.0.0.1", 0))
+    port = s.getsockname()[1]
+    s.close()
+    handler = type(
+        "BlobHandler", (_RangeHandler,), {"blob": blob})
+    server = http.server.HTTPServer(("127.0.0.1", port), handler)
+    th = threading.Thread(target=server.serve_forever, daemon=True)
+    th.start()
+    return server, port
+
+
+class TestHttpStripParallelDecode:
+    def test_parallel_decode_matches_serial(self, monkeypatch):
+        monkeypatch.setenv("XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS", "1")
+        arr, blob = _make_stripped_uint16(1024, 2048, compression="deflate")
+        server, port = _start_server(blob)
+        try:
+            url = f"http://127.0.0.1:{port}/s.tif"
+            par, _ = read_to_array(url, window=(0, 0, 1024, 2048))
+            with patch.object(_reader_mod,
+                              "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
+                ser, _ = read_to_array(url, window=(0, 0, 1024, 2048))
+        finally:
+            server.shutdown()
+        np.testing.assert_array_equal(par, ser)
+        np.testing.assert_array_equal(par, arr)
+
+
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])

From 385a11a77a16ed4a01a87bb8299663e8cb0c6e37 Mon Sep 17 00:00:00 2001
From: Brendan Collins <brendancol@gmail.com>
Date: Mon, 18 May 2026 20:51:56 -0700
Subject: [PATCH 2/5] geotiff: fix Windows CI on parallel-strip tests (#2104)

The four local-strip tests in `test_parallel_strip_decode_2100.py`
used `tempfile.TemporaryDirectory()` to host the `.tif` fixture and
called `read_to_array(p)` inside the `with` block. `_FileSource.close`
calls `_mmap_cache.release`, which keeps the entry CACHED (file
handle + mmap stay alive) until LRU eviction. On Windows, an open
mmap pins the file: `TemporaryDirectory.__exit__` then races the
cached mmap and raises `PermissionError: [WinError 32]` from
`shutil.rmtree`.

Switch the four affected tests to pytest's `tmp_path` fixture.
pytest's session-level teardown tolerates Windows file-lock cleanup
failures, so the cached mmap no longer blocks the test from passing.
The HTTP strip test was unaffected (no mmap on that path) and is
left alone.

Tested locally: 5/5 pass. The fix is test-only; the parallel strip
decode code paths are unchanged.
---
 .../tests/test_parallel_strip_decode_2100.py  | 84 ++++++++++---------
 1 file changed, 44 insertions(+), 40 deletions(-)

diff --git a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
index a78dab449..8c414b288 100644
--- a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
+++ b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
@@ -46,66 +46,70 @@ def _make_stripped_uint16(height: int, width: int, *,
 class TestReadStripsParallelGate:
     """Local strip path: the parallel-decode gate engages on multi-strip."""
 
-    def test_parallel_decode_matches_serial(self):
+    def test_parallel_decode_matches_serial(self, tmp_path):
         """A wide stripped TIFF: parallel and serial decode return
-        identical bytes."""
+        identical bytes.
+
+        Uses pytest's ``tmp_path`` rather than
+        ``tempfile.TemporaryDirectory()`` because ``read_to_array``
+        leaves the mmap entry cached on close (see ``_MmapCache``);
+        on Windows the cached mmap holds the file handle, so an
+        eager ``TemporaryDirectory.__exit__`` rmtree races the mmap
+        and raises ``WinError 32``. ``tmp_path`` defers cleanup to
+        pytest's session teardown, which tolerates that race.
+        """
         arr, blob = _make_stripped_uint16(2048, 4096)
-        # Write to disk and read twice (parallel default, then forced serial).
-        with tempfile.TemporaryDirectory() as td:
-            p = os.path.join(td, "s.tif")
-            with open(p, "wb") as f:
-                f.write(blob)
-            par, _ = read_to_array(p)
-            with patch.object(_reader_mod,
-                              "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
-                ser, _ = read_to_array(p)
+        p = str(tmp_path / "s.tif")
+        with open(p, "wb") as f:
+            f.write(blob)
+        par, _ = read_to_array(p)
+        with patch.object(_reader_mod,
+                          "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
+            ser, _ = read_to_array(p)
         np.testing.assert_array_equal(par, ser)
         np.testing.assert_array_equal(par, arr)
 
-    def test_parallel_pool_engages_on_multi_strip(self):
+    def test_parallel_pool_engages_on_multi_strip(self, tmp_path):
         """Confirm the parallel branch is taken: patch ThreadPoolExecutor
         and assert it is constructed when n_strips > 1 and strip pixels
         clear the gate."""
         arr, blob = _make_stripped_uint16(1024, 2048)
-        with tempfile.TemporaryDirectory() as td:
-            p = os.path.join(td, "s.tif")
-            with open(p, "wb") as f:
-                f.write(blob)
-            with patch.object(_reader_mod, "ThreadPoolExecutor",
-                              wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
-                out, _ = read_to_array(p)
-                # n_strips for a 1024-row file with default rps -> at
-                # least 4 strips (TIFFs default rps=8KB / row).
-                assert mock_pool.called
+        p = str(tmp_path / "s.tif")
+        with open(p, "wb") as f:
+            f.write(blob)
+        with patch.object(_reader_mod, "ThreadPoolExecutor",
+                          wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
+            out, _ = read_to_array(p)
+            # n_strips for a 1024-row file with default rps -> at
+            # least 4 strips (TIFFs default rps=8KB / row).
+            assert mock_pool.called
         np.testing.assert_array_equal(out, arr)
 
-    def test_serial_path_used_for_small_strip(self):
+    def test_serial_path_used_for_small_strip(self, tmp_path):
         """A single-row image will produce a single strip; the parallel
         gate must short-circuit to the serial branch."""
         arr = np.arange(2 * 32, dtype=np.uint16).reshape(2, 32)
         da = xr.DataArray(arr, dims=["y", "x"])
-        with tempfile.TemporaryDirectory() as td:
-            p = os.path.join(td, "tiny.tif")
-            to_geotiff(da, p, compression="deflate", tiled=False)
-            with patch.object(_reader_mod, "ThreadPoolExecutor",
-                              wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
-                out, _ = read_to_array(p)
-                # Single-strip file => no pool.
-                assert not mock_pool.called
+        p = str(tmp_path / "tiny.tif")
+        to_geotiff(da, p, compression="deflate", tiled=False)
+        with patch.object(_reader_mod, "ThreadPoolExecutor",
+                          wraps=_reader_mod.ThreadPoolExecutor) as mock_pool:
+            out, _ = read_to_array(p)
+            # Single-strip file => no pool.
+            assert not mock_pool.called
         np.testing.assert_array_equal(out, arr)
 
-    def test_windowed_strip_read_parallel(self):
+    def test_windowed_strip_read_parallel(self, tmp_path):
         """A windowed read across multiple strips still matches the full
         decode."""
         arr, blob = _make_stripped_uint16(2048, 2048)
-        with tempfile.TemporaryDirectory() as td:
-            p = os.path.join(td, "w.tif")
-            with open(p, "wb") as f:
-                f.write(blob)
-            par, _ = read_to_array(p, window=(100, 100, 1500, 1500))
-            with patch.object(_reader_mod,
-                              "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
-                ser, _ = read_to_array(p, window=(100, 100, 1500, 1500))
+        p = str(tmp_path / "w.tif")
+        with open(p, "wb") as f:
+            f.write(blob)
+        par, _ = read_to_array(p, window=(100, 100, 1500, 1500))
+        with patch.object(_reader_mod,
+                          "_PARALLEL_DECODE_PIXEL_THRESHOLD", 10**12):
+            ser, _ = read_to_array(p, window=(100, 100, 1500, 1500))
         np.testing.assert_array_equal(par, ser)
         np.testing.assert_array_equal(par, arr[100:1500, 100:1500])
 

From 8f487b2c580715b1d13cea9adfef21fbc8293bad Mon Sep 17 00:00:00 2001
From: Brendan Collins <brendancol@gmail.com>
Date: Mon, 18 May 2026 20:53:08 -0700
Subject: [PATCH 3/5] geotiff: drop nested os import in _read_strips (#2104)

`_reader.py` aliases `import os as _os_module` at module scope and
uses that name everywhere else (lines 78, 156, 179, 224, 244, 410,
418, 1352-1353, and the matching HTTP strip path at 2723). The
parallel-strip block introduced an inline `import os as _os` for
its single `cpu_count()` call. Replace it with `_os_module` for
consistency.

Review nit on #2104; no behaviour change.
---
 xrspatial/geotiff/_reader.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/xrspatial/geotiff/_reader.py b/xrspatial/geotiff/_reader.py
index bb16c7440..b8fa54a53 100644
--- a/xrspatial/geotiff/_reader.py
+++ b/xrspatial/geotiff/_reader.py
@@ -1992,8 +1992,7 @@ def _decode_strip_job(job):
     use_parallel = (n_strips > 1
                     and strip_pixel_count >= _PARALLEL_DECODE_PIXEL_THRESHOLD)
     if use_parallel:
-        import os as _os
-        n_workers = min(n_strips, _os.cpu_count() or 4)
+        n_workers = min(n_strips, _os_module.cpu_count() or 4)
         with ThreadPoolExecutor(max_workers=n_workers) as pool:
             decoded_strips = list(pool.map(_decode_strip_job, strip_jobs))
     else:

From 96b4c422aaf46b20ee080dea0c2d99da3c6c6b1a Mon Sep 17 00:00:00 2001
From: Brendan Collins <brendancol@gmail.com>
Date: Mon, 18 May 2026 21:08:26 -0700
Subject: [PATCH 4/5] geotiff: add planar=2 and HTTP serial-gate strip tests
 (#2104)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two review suggestions on #2104 left as deferred earlier, now landed.

- `TestHttpStripParallelDecode::test_serial_gate_engages_on_single_strip`
  exercises the HTTP windowed-strip path with a 2x32 fixture that
  produces a single strip. The gate at
  `_fetch_decode_cog_http_strips` must short-circuit because
  `n_decode_strips > 1` is false. Spy on `_decode_strip_or_tile` and
  assert the call lands on the main thread; a regression that fires
  the pool when the gate should serialise would land the spy in a
  worker thread. The fetch layer's own `ThreadPoolExecutor` is
  skipped for single-range fetches via `len(ranges) == 1`, so this
  is a clean discriminator.

- `TestPlanar2MultibandStripParallel` covers the dedicated
  `planar == 2 and samples > 1` branch in the strip-job collection
  loop. Two tests: full-image parity and a windowed parity across
  multiple strips. Bands are written via `tifffile` with
  `planarconfig='separate'` and 128 rows/strip → 8 strips/band × 3
  bands = 24 strips, so the gate engages on the parallel run. Both
  tests confirm `parallel == serial` and that the round-trip
  matches `np.moveaxis(src, 0, -1)`.
---
 .../tests/test_parallel_strip_decode_2100.py  | 128 ++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
index 8c414b288..e968ab6dd 100644
--- a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
+++ b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
@@ -184,6 +184,134 @@ def test_parallel_decode_matches_serial(self, monkeypatch):
         np.testing.assert_array_equal(par, ser)
         np.testing.assert_array_equal(par, arr)
 
+    def test_serial_gate_engages_on_single_strip(self, monkeypatch):
+        """HTTP windowed-strip path with a single-strip source: the
+        gate at ``_fetch_decode_cog_http_strips`` (n_decode_strips > 1)
+        must short-circuit to the serial branch.
+
+        ``ThreadPoolExecutor`` is also used by ``read_ranges`` to fan
+        out fetches, but for a single-range fetch that path takes the
+        ``len(ranges) == 1`` short-circuit and never instantiates a
+        pool. So spying on ``_decode_strip_or_tile`` and asserting the
+        thread identity catches a regression that fires the pool when
+        it shouldn't, without false positives from the fetch layer.
+        """
+        monkeypatch.setenv("XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS", "1")
+        # 2x32 uint16 → 1 strip with the writer's default rps.
+        arr = np.arange(2 * 32, dtype=np.uint16).reshape(2, 32)
+        da = xr.DataArray(arr, dims=["y", "x"])
+        with tempfile.NamedTemporaryFile(suffix=".tif",
+                                         delete=False) as f:
+            tif_path = f.name
+        try:
+            to_geotiff(da, tif_path, compression="deflate",
+                       tiled=False)
+            with open(tif_path, "rb") as f:
+                blob = f.read()
+        finally:
+            try:
+                os.remove(tif_path)
+            except OSError:
+                pass
+
+        threads_seen: list[int] = []
+        real_decode = _reader_mod._decode_strip_or_tile
+
+        def _spy(*args, **kwargs):
+            threads_seen.append(threading.get_ident())
+            return real_decode(*args, **kwargs)
+
+        monkeypatch.setattr(
+            _reader_mod, "_decode_strip_or_tile", _spy)
+
+        server, port = _start_server(blob)
+        try:
+            url = f"http://127.0.0.1:{port}/tiny.tif"
+            out, _ = read_to_array(url, window=(0, 0, 2, 32))
+        finally:
+            server.shutdown()
+
+        main_tid = threading.get_ident()
+        assert threads_seen, "_decode_strip_or_tile was never called"
+        assert all(tid == main_tid for tid in threads_seen), (
+            f"decode ran in worker thread(s) "
+            f"{set(threads_seen) - {main_tid}}; main is {main_tid}. "
+            f"The serial gate failed to short-circuit on a "
+            f"single-strip HTTP read."
+        )
+        np.testing.assert_array_equal(out, arr)
+
+
+# ----------------------------------------------------------------------
+# Planar=2 multi-band stripped TIFF
+# ----------------------------------------------------------------------
+
+
+class TestPlanar2MultibandStripParallel:
+    """Planar=2 multi-band stripped TIFF.
+
+    ``_read_strips`` has a separate ``planar == 2 and samples > 1``
+    branch in the strip-job collection loop (xrspatial/geotiff/
+    _reader.py:1949-1963). The parallel pool itself is planar-agnostic,
+    so a regression in band-ordering or per-band strip indexing would
+    survive the chunky tests above; this class covers that branch.
+    """
+
+    def test_parallel_matches_serial_planar2(self, tmp_path):
+        """Parallel and serial decode produce bit-identical output on
+        a planar=2 multi-band stripped TIFF."""
+        tifffile = pytest.importorskip("tifffile")
+        rng = np.random.default_rng(seed=20260518)
+        bands, height, width = 3, 1024, 1024
+        arr = rng.integers(
+            0, 1000, size=(bands, height, width), dtype=np.uint16)
+        p = str(tmp_path / "planar2.tif")
+        tifffile.imwrite(
+            p, arr,
+            photometric="rgb",
+            planarconfig="separate",
+            compression="deflate",
+            rowsperstrip=128,
+        )
+
+        par, _ = read_to_array(p)
+        with patch.object(
+                _reader_mod, "_PARALLEL_DECODE_PIXEL_THRESHOLD",
+                10 ** 12):
+            ser, _ = read_to_array(p)
+
+        np.testing.assert_array_equal(par, ser)
+        # The reader returns (height, width, samples); the fixture is
+        # (bands, height, width). Reorder for comparison.
+        np.testing.assert_array_equal(par, np.moveaxis(arr, 0, -1))
+
+    def test_windowed_planar2_parallel(self, tmp_path):
+        """A window across multiple strips on a planar=2 multi-band
+        stripped TIFF still matches the slice of the full decode."""
+        tifffile = pytest.importorskip("tifffile")
+        rng = np.random.default_rng(seed=20260519)
+        bands, height, width = 3, 1024, 1024
+        arr = rng.integers(
+            0, 1000, size=(bands, height, width), dtype=np.uint16)
+        p = str(tmp_path / "planar2_win.tif")
+        tifffile.imwrite(
+            p, arr,
+            photometric="rgb",
+            planarconfig="separate",
+            compression="deflate",
+            rowsperstrip=128,
+        )
+
+        par, _ = read_to_array(p, window=(100, 100, 900, 900))
+        with patch.object(
+                _reader_mod, "_PARALLEL_DECODE_PIXEL_THRESHOLD",
+                10 ** 12):
+            ser, _ = read_to_array(p, window=(100, 100, 900, 900))
+
+        np.testing.assert_array_equal(par, ser)
+        expected = np.moveaxis(arr, 0, -1)[100:900, 100:900]
+        np.testing.assert_array_equal(par, expected)
+
 
 if __name__ == "__main__":
     pytest.main([__file__, "-v"])

From 5734fd9d1ce22c190881f37796d2878f3eaec1ac Mon Sep 17 00:00:00 2001
From: Brendan Collins <brendancol@gmail.com>
Date: Mon, 18 May 2026 21:12:17 -0700
Subject: [PATCH 5/5] geotiff: add HTTP planar=2 windowed strip parity test
 (#2104)

The local planar=2 tests cover ``_read_strips`` only. Full-image
HTTP reads dispatch back into ``_read_strips`` via
``_fetch_decode_cog_http_strips`` line 2581, but the windowed HTTP
path runs its own per-band loop at lines 2639-2664 with separate
sparse / per-strip byte-cap handling. Without a windowed planar=2
HTTP test that branch is unexercised.

New test serves a 3-band planar=2 stripped TIFF over the local
range-aware HTTP server, requests a window across multiple strips
per band, and confirms parallel-vs-serial parity plus a round-trip
match against the source array's ``np.moveaxis(arr, 0, -1)`` slice.
---
 .../tests/test_parallel_strip_decode_2100.py  | 50 +++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
index e968ab6dd..4b6c4758e 100644
--- a/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
+++ b/xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py
@@ -312,6 +312,56 @@ def test_windowed_planar2_parallel(self, tmp_path):
         expected = np.moveaxis(arr, 0, -1)[100:900, 100:900]
         np.testing.assert_array_equal(par, expected)
 
+    def test_http_windowed_planar2_parallel(self, monkeypatch):
+        """HTTP windowed strip path on planar=2 multi-band: pins the
+        per-band strip-job loop inside
+        ``_fetch_decode_cog_http_strips`` (the local path's planar=2
+        coverage above does not reach the HTTP fetch+decode branch
+        because full-image HTTP reads dispatch back to
+        ``_read_strips``)."""
+        tifffile = pytest.importorskip("tifffile")
+        monkeypatch.setenv(
+            "XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS", "1")
+        rng = np.random.default_rng(seed=20260520)
+        bands, height, width = 3, 1024, 1024
+        arr = rng.integers(
+            0, 1000, size=(bands, height, width), dtype=np.uint16)
+        with tempfile.NamedTemporaryFile(
+                suffix=".tif", delete=False) as f:
+            tif_path = f.name
+        try:
+            tifffile.imwrite(
+                tif_path, arr,
+                photometric="rgb",
+                planarconfig="separate",
+                compression="deflate",
+                rowsperstrip=128,
+            )
+            with open(tif_path, "rb") as f:
+                blob = f.read()
+        finally:
+            try:
+                os.remove(tif_path)
+            except OSError:
+                pass
+
+        server, port = _start_server(blob)
+        try:
+            url = f"http://127.0.0.1:{port}/planar2.tif"
+            par, _ = read_to_array(
+                url, window=(100, 100, 900, 900))
+            with patch.object(
+                    _reader_mod, "_PARALLEL_DECODE_PIXEL_THRESHOLD",
+                    10 ** 12):
+                ser, _ = read_to_array(
+                    url, window=(100, 100, 900, 900))
+        finally:
+            server.shutdown()
+
+        np.testing.assert_array_equal(par, ser)
+        expected = np.moveaxis(arr, 0, -1)[100:900, 100:900]
+        np.testing.assert_array_equal(par, expected)
+
 
 if __name__ == "__main__":
     pytest.main([__file__, "-v"])