Skip to content

geotiff: parallelise strip decode (#2100)#2104

Merged
brendancol merged 5 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-18
May 19, 2026
Merged

geotiff: parallelise strip decode (#2100)#2104
brendancol merged 5 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-18

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _read_strips (local) and _fetch_decode_cog_http_strips (HTTP COG) decoded strips sequentially in a Python for-loop.
  • The matching tile paths already gated on _PARALLEL_DECODE_PIXEL_THRESHOLD (64K pixels) via ThreadPoolExecutor. Codec decode (deflate / zstd / LZW) releases the GIL, so threads actually overlap C-level work.
  • Both strip paths now use the same collect-decode-place pattern: build a job list, ThreadPool-map decode when n_strips > 1 and strip_pixels >= threshold, place sequentially into the result.

Microbench (4-core, uint16)

shape codec serial parallel speedup
4096x4096 deflate 130 ms 34 ms 3.82x
8192x8192 deflate 531 ms 146 ms 3.63x
8192x8192 zstd 211 ms 85 ms 2.48x
4096x4096 none 25 ms 22 ms 1.14x

Test plan

  • 5 new tests in test_parallel_strip_decode_2100.py (parallel/serial parity, multi-strip pool engaged, single-strip serial short-circuit, windowed cross-strip read, HTTP COG strip parity)
  • pytest xrspatial/geotiff/tests/ -q: 3998 pass, 8 pre-existing failures predate this change (read_to_array is now a private attr)

Closes #2100.

…og_http_strips (#2100)

The local strip path (`_read_strips`) and the HTTP COG strip path
(`_fetch_decode_cog_http_strips`) decoded strips sequentially in a Python
for-loop. The matching tile paths (`_read_tiles` and
`_fetch_decode_cog_http_tiles`) parallelise decode via a
`ThreadPoolExecutor` gated on `_PARALLEL_DECODE_PIXEL_THRESHOLD` (64K
pixels). Codec decode for deflate, zstd, and LZW releases the GIL inside
the C extension so threads overlap codec work across cores; the strip
paths missed that optimisation.

Refactored both paths into the collect-decode-place pattern the tile
paths use: build a job list of (band_idx, strip_idx, global_idx),
ThreadPool-map the decode when n_strips > 1 and strip_pixels >=
threshold, then place sequentially into the result buffer to avoid
contending writes.

Microbench (4-core, uint16):
- 4096x4096 deflate: 130 ms -> 34 ms (3.82x)
- 8192x8192 deflate: 531 ms -> 146 ms (3.63x)
- 8192x8192 zstd: 211 ms -> 85 ms (2.48x)
- 4096x4096 uncompressed: 25 ms -> 22 ms (1.14x, frombuffer + slice copy)

5 new tests in test_parallel_strip_decode_2100.py cover parallel/serial
parity, the pool-engaged branch on multi-strip files, the serial-path
short-circuit for single-strip files, windowed cross-strip reads, and
the HTTP COG strip parity. 3998 existing geotiff tests pass; 8
pre-existing failures predate this change (read_to_array is now a
private attr).

Closes #2100.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
The four local-strip tests in `test_parallel_strip_decode_2100.py`
used `tempfile.TemporaryDirectory()` to host the `.tif` fixture and
called `read_to_array(p)` inside the `with` block. `_FileSource.close`
calls `_mmap_cache.release`, which keeps the entry CACHED (file
handle + mmap stay alive) until LRU eviction. On Windows, an open
mmap pins the file: `TemporaryDirectory.__exit__` then races the
cached mmap and raises `PermissionError: [WinError 32]` from
`shutil.rmtree`.

Switch the four affected tests to pytest's `tmp_path` fixture.
pytest's session-level teardown tolerates Windows file-lock cleanup
failures, so the cached mmap no longer blocks the test from passing.
The HTTP strip test was unaffected (no mmap on that path) and is
left alone.

Tested locally: 5/5 pass. The fix is test-only; the parallel strip
decode code paths are unchanged.
`_reader.py` aliases `import os as _os_module` at module scope and
uses that name everywhere else (lines 78, 156, 179, 224, 244, 410,
418, 1352-1353, and the matching HTTP strip path at 2723). The
parallel-strip block introduced an inline `import os as _os` for
its single `cpu_count()` call. Replace it with `_os_module` for
consistency.

Review nit on #2104; no behaviour change.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: geotiff: parallelise strip decode (#2100)

Blockers

None.

Suggestions

  • HTTP COG strip path has only the parallel-vs-serial parity test (TestHttpStripParallelDecode::test_parallel_decode_matches_serial). The local path has four (parity, pool engagement, serial-gate short-circuit, windowed). The HTTP path's decode_in_parallel gate at xrspatial/geotiff/_reader.py:2698-2700 mirrors the local gate exactly, but a small-strip / single-strip short-circuit test for the HTTP path (analogue of test_serial_path_used_for_small_strip) would lock the gate down on that side too. Not blocking — the local-path tests already pin the gate semantics and the HTTP and local code paths share the same threshold constant.
  • Planar=2 multi-band stripped TIFF is not covered by the new tests. The strip-jobs loop has a dedicated branch for planar == 2 and samples > 1 at xrspatial/geotiff/_reader.py:1949-1963 with its own band-loop iteration. The parallel pool is agnostic to planar mode, so a regression in the planar=2 ordering would still produce a wrong-shape output rather than a runtime error. A planar=2 parity test would close this gap. Not blocking — the production callers that hit _read_strips with planar=2 are exercised by sibling tests under different fixtures, so the dispatch is not bypassed.

Nits

  • xrspatial/geotiff/_reader.py:1995 introduced import os as _os inline. The module already aliases import os as _os_module at line 6 and uses that name throughout (lines 78, 156, 179, 224, 244, 410, 418, 1352-1353, and the matching HTTP strip path at line 2723). [Fixed in commit 8f487b2.]
  • The four local-strip tests used tempfile.TemporaryDirectory() to host the .tif fixture, which raced the cached mmap on Windows and produced PermissionError: [WinError 32] at cleanup. _FileSource.close releases the entry into _mmap_cache but keeps the mmap alive until LRU eviction, so the file handle outlives the test on Windows. [Fixed in commit 385a11a — switched to pytest's tmp_path fixture, which defers cleanup to session teardown and tolerates Windows file locks.]

What looks good

  • Mirrors the existing tile-path gate one-for-one: same _PARALLEL_DECODE_PIXEL_THRESHOLD constant, same n > 1 AND pixels >= threshold rule, same min(n, cpu_count()) worker cap. A reader who learned the tile pattern reads the strip change with no extra cognitive load.
  • Job-list + pool.map + serial placement keeps the output buffer race-free without locks. pool.map preserves order, so the placement loop's zip(strip_jobs, decoded_strips) (xrspatial/geotiff/_reader.py:2002-2003) and HTTP analogue (line 2729) are correct by construction.
  • Sparse-strip handling (byte_counts == 0) is preserved on both paths: the job-collection loops skip them (lines 1956-57, 1972-73), and the pre-filled result array carries the sparse fill through to the output.
  • Codec decode (deflate / zstd / LZW) releases the GIL inside the C extension, so the threading actually overlaps work. The microbench table (3.8x on deflate, 2.5x on zstd, 1.1x on uncompressed) is consistent with that: uncompressed barely moves because there is no GIL-released work to spread.
  • The local and HTTP paths share the same threshold constant, which means a single tuning knob covers all four strip / tile combinations.
  • HTTP path leaves the per-strip dimension cap (_check_dimensions) inside the decode function rather than outside the gate, so the safety check still runs whether or not the decode goes parallel. Easy thing to forget when extracting a function into a thread worker.

Checklist

  • Algorithm matches reference/paper — N/A (decode kernel unchanged, only scheduling).
  • All implemented backends produce consistent results — parallel-vs-serial parity tests pin this.
  • NaN handling is correct — N/A (codec decode unaffected).
  • Edge cases are covered by tests — single-strip serial short-circuit, multi-strip pool engagement, windowed cross-strip read, HTTP parity.
  • Dask chunk boundaries handled correctly — N/A.
  • No premature materialization or unnecessary copies — decoded_strips = list(pool.map(...)) is necessary to keep the placement loop deterministic.
  • Benchmark exists or is not needed — microbench in PR body suffices; no benchmarks/benchmarks/ entry needed for an internal scheduling change.
  • README feature matrix updated (if applicable) — not needed.
  • Docstrings present and accurate — the inline comments at the gate sites cite issue #2100 and explain the GIL-release reasoning.

Two review suggestions on #2104 left as deferred earlier, now landed.

- `TestHttpStripParallelDecode::test_serial_gate_engages_on_single_strip`
  exercises the HTTP windowed-strip path with a 2x32 fixture that
  produces a single strip. The gate at
  `_fetch_decode_cog_http_strips` must short-circuit because
  `n_decode_strips > 1` is false. Spy on `_decode_strip_or_tile` and
  assert the call lands on the main thread; a regression that fires
  the pool when the gate should serialise would land the spy in a
  worker thread. The fetch layer's own `ThreadPoolExecutor` is
  skipped for single-range fetches via `len(ranges) == 1`, so this
  is a clean discriminator.

- `TestPlanar2MultibandStripParallel` covers the dedicated
  `planar == 2 and samples > 1` branch in the strip-job collection
  loop. Two tests: full-image parity and a windowed parity across
  multiple strips. Bands are written via `tifffile` with
  `planarconfig='separate'` and 128 rows/strip → 8 strips/band × 3
  bands = 24 strips, so the gate engages on the parallel run. Both
  tests confirm `parallel == serial` and that the round-trip
  matches `np.moveaxis(src, 0, -1)`.
The local planar=2 tests cover ``_read_strips`` only. Full-image
HTTP reads dispatch back into ``_read_strips`` via
``_fetch_decode_cog_http_strips`` line 2581, but the windowed HTTP
path runs its own per-band loop at lines 2639-2664 with separate
sparse / per-strip byte-cap handling. Without a windowed planar=2
HTTP test that branch is unexercised.

New test serves a 3-band planar=2 stripped TIFF over the local
range-aware HTTP server, requests a window across multiple strips
per band, and confirms parallel-vs-serial parity plus a round-trip
match against the source array's ``np.moveaxis(arr, 0, -1)`` slice.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review on PR #2104

Re-review of commits since the previous pass: 385a11a, 8f487b2, 96b4c42, 5734fd9.

Blockers

None.

Suggestions

None.

Nits

  • xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py:264-275 and :292-303 and :330-341 (and the analogous setup in :201-215 and :202-215) repeat the planar=2 fixture-write and the single-strip fixture-write. A small _write_planar2_uint16 helper would dedupe. Trivial; reads fine as-is.
  • pytest.importorskip("tifffile") is called inside each planar=2 test body. The sibling test_predictor2_big_endian_gpu_1517.py gates on tifffile at module level via importlib.util.find_spec. Both patterns work; the inline form keeps the dependency local to the tests that need it.

What looks good

  • The Windows CI fix targets the real cause: _FileSource.close releases the entry into _mmap_cache but keeps the mmap alive until LRU eviction, so a tempfile.TemporaryDirectory.__exit__ race against the cached mmap raises PermissionError: [WinError 32] on Windows. Switching the four affected tests to tmp_path defers cleanup to pytest's session teardown, which tolerates that race. The docstring at test_parallel_decode_matches_serial:53-60 documents the rationale so a future reader does not unwind the change without understanding why.
  • The _os_os_module consistency fix at _reader.py:1995 is exactly the right scope: a one-line cleanup that aligns with the file's nine other uses of the same alias.
  • test_serial_gate_engages_on_single_strip uses thread-identity spying instead of ThreadPoolExecutor patching. That is the right move on the HTTP path because read_ranges also constructs a pool when there is more than one range. For a single-strip fixture the fetch layer short-circuits at len(ranges) == 1 (_reader.py:1174-1176), so the only _decode_strip_or_tile calls that would land in a worker thread are decode-pool calls — exactly what the test wants to detect.
  • TestPlanar2MultibandStripParallel adds three tests covering: full-image local parity, windowed local parity, and windowed HTTP parity. The HTTP windowed branch (_fetch_decode_cog_http_strips:2639-2664) has its own per-band loop separate from the local one, so the HTTP test is not redundant with the local windowed test.
  • The np.moveaxis(arr, 0, -1) round-trip check is the right shape-aware comparison: the reader returns (H, W, samples) while tifffile.imwrite accepts (samples, H, W) for planarconfig="separate". A naive assert_array_equal(par, arr) would have masked a band-order bug.
  • rowsperstrip=128 on the planar=2 fixtures forces 8 strips per band × 3 bands = 24 strips on the parallel path, enough to engage the gate without dragging out CI.
  • The HTTP planar=2 test reuses the existing _start_server helper, so the new coverage adds no extra infra.

Checklist

  • Algorithm matches reference/paper — N/A (scheduling change only).
  • All implemented backends produce consistent results — parity tests pin local serial vs parallel and local vs HTTP across chunky and planar=2 layouts.
  • NaN handling is correct — N/A.
  • Edge cases are covered by tests — single-strip serial short-circuit (local + HTTP), windowed cross-strip, planar=2 full + windowed, HTTP planar=2 windowed.
  • Dask chunk boundaries handled correctly — N/A (eager-only change).
  • No premature materialization or unnecessary copies — pool.map result is materialised exactly once to make the placement loop deterministic.
  • Benchmark exists or is not needed — microbench table in PR body suffices.
  • README feature matrix updated (if applicable) — not needed.
  • Docstrings present and accurate — per-test docstrings explain branch coverage and Windows-cleanup rationale where it matters.

@brendancol brendancol merged commit 0a2754b into main May 19, 2026
4 of 5 checks passed
brendancol added a commit that referenced this pull request May 19, 2026
…path (#2100) (#2109)

* geotiff: cover sparse-strip parallel decode in _read_strips and HTTP path (#2100)

The strip-decode parallelisation in #2100 / #2104 added a
collect-decode-place pipeline in both _read_strips and
_fetch_decode_cog_http_strips. The job-collection loop filters sparse
strips (byte_counts[idx] == 0) before they reach the
ThreadPoolExecutor. The existing test_parallel_strip_decode_2100.py
covers parallel/serial parity, the pool-engaged branch, the
single-strip serial short-circuit, windowed strip reads, and planar=2
multi-band, but every fixture is fully populated. The 128x128 sparse
fixture in test_sparse_cog.py is below the 64K-pixel parallel gate, so
the sparse-strip filter inside the parallel branch is untested.

A regression that lost the byte_counts==0 guard would silently ship:
the decoder would receive an empty data slice and either raise
'Decompressed tile/strip size mismatch' or return corrupt pixels.

Seven tests cover:
- local-strip full-image parallel/serial parity with sparse strips
- parallel-pool-engaged on a multi-strip sparse image
- windowed read across the sparse boundary
- all-sparse degenerate (zero filled rows -> empty job list ->
  short-circuit gate)
- planar=2 sparse parity (dedicated samples>1 branch)
- HTTP windowed read on a non-sparse strict subset
- HTTP windowed read across the sparse boundary

Mutation against the strip-job collection sparse guard (delete the
byte_counts == 0 continue) flips 5 of 5 local tests red with
'Decompressed tile/strip size mismatch: expected ... got 0'; mutation
against the HTTP path sparse guard at line 2646 flips the boundary
HTTP test red. Source untouched; clean restore verified via md5sum.

deep-sweep-test-coverage pass 18; categories 3 + 4 (HIGH).

* test: skip parallel strip decode sparse tests when rasterio missing

Module-level ``import rasterio`` raised ``ModuleNotFoundError`` during
pytest collection on CI hosts without rasterio, marking the whole run
red even though every other test was fine. Use ``pytest.importorskip``
at the top of the module so the file skips cleanly on those hosts. The
fixtures genuinely need rasterio (they exercise the GDAL ``SPARSE_OK``
writer option); the reader code under test does not.

Mirrors the pattern other golden-corpus tests already use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(geotiff): parallelise strip decode in _read_strips and _fetch_decode_cog_http_strips

1 participant