geotiff: parallelise strip decode (#2100)#2104
Merged
Merged
Conversation
…og_http_strips (#2100) The local strip path (`_read_strips`) and the HTTP COG strip path (`_fetch_decode_cog_http_strips`) decoded strips sequentially in a Python for-loop. The matching tile paths (`_read_tiles` and `_fetch_decode_cog_http_tiles`) parallelise decode via a `ThreadPoolExecutor` gated on `_PARALLEL_DECODE_PIXEL_THRESHOLD` (64K pixels). Codec decode for deflate, zstd, and LZW releases the GIL inside the C extension so threads overlap codec work across cores; the strip paths missed that optimisation. Refactored both paths into the collect-decode-place pattern the tile paths use: build a job list of (band_idx, strip_idx, global_idx), ThreadPool-map the decode when n_strips > 1 and strip_pixels >= threshold, then place sequentially into the result buffer to avoid contending writes. Microbench (4-core, uint16): - 4096x4096 deflate: 130 ms -> 34 ms (3.82x) - 8192x8192 deflate: 531 ms -> 146 ms (3.63x) - 8192x8192 zstd: 211 ms -> 85 ms (2.48x) - 4096x4096 uncompressed: 25 ms -> 22 ms (1.14x, frombuffer + slice copy) 5 new tests in test_parallel_strip_decode_2100.py cover parallel/serial parity, the pool-engaged branch on multi-strip files, the serial-path short-circuit for single-strip files, windowed cross-strip reads, and the HTTP COG strip parity. 3998 existing geotiff tests pass; 8 pre-existing failures predate this change (read_to_array is now a private attr). Closes #2100.
The four local-strip tests in `test_parallel_strip_decode_2100.py` used `tempfile.TemporaryDirectory()` to host the `.tif` fixture and called `read_to_array(p)` inside the `with` block. `_FileSource.close` calls `_mmap_cache.release`, which keeps the entry CACHED (file handle + mmap stay alive) until LRU eviction. On Windows, an open mmap pins the file: `TemporaryDirectory.__exit__` then races the cached mmap and raises `PermissionError: [WinError 32]` from `shutil.rmtree`. Switch the four affected tests to pytest's `tmp_path` fixture. pytest's session-level teardown tolerates Windows file-lock cleanup failures, so the cached mmap no longer blocks the test from passing. The HTTP strip test was unaffected (no mmap on that path) and is left alone. Tested locally: 5/5 pass. The fix is test-only; the parallel strip decode code paths are unchanged.
`_reader.py` aliases `import os as _os_module` at module scope and uses that name everywhere else (lines 78, 156, 179, 224, 244, 410, 418, 1352-1353, and the matching HTTP strip path at 2723). The parallel-strip block introduced an inline `import os as _os` for its single `cpu_count()` call. Replace it with `_os_module` for consistency. Review nit on #2104; no behaviour change.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: geotiff: parallelise strip decode (#2100)
Blockers
None.
Suggestions
- HTTP COG strip path has only the parallel-vs-serial parity test (
TestHttpStripParallelDecode::test_parallel_decode_matches_serial). The local path has four (parity, pool engagement, serial-gate short-circuit, windowed). The HTTP path'sdecode_in_parallelgate at xrspatial/geotiff/_reader.py:2698-2700 mirrors the local gate exactly, but a small-strip / single-strip short-circuit test for the HTTP path (analogue oftest_serial_path_used_for_small_strip) would lock the gate down on that side too. Not blocking — the local-path tests already pin the gate semantics and the HTTP and local code paths share the same threshold constant. - Planar=2 multi-band stripped TIFF is not covered by the new tests. The strip-jobs loop has a dedicated branch for
planar == 2 and samples > 1at xrspatial/geotiff/_reader.py:1949-1963 with its own band-loop iteration. The parallel pool is agnostic to planar mode, so a regression in the planar=2 ordering would still produce a wrong-shape output rather than a runtime error. A planar=2 parity test would close this gap. Not blocking — the production callers that hit_read_stripswith planar=2 are exercised by sibling tests under different fixtures, so the dispatch is not bypassed.
Nits
- xrspatial/geotiff/_reader.py:1995 introduced
import os as _osinline. The module already aliasesimport os as _os_moduleat line 6 and uses that name throughout (lines 78, 156, 179, 224, 244, 410, 418, 1352-1353, and the matching HTTP strip path at line 2723). [Fixed in commit 8f487b2.] - The four local-strip tests used
tempfile.TemporaryDirectory()to host the.tiffixture, which raced the cached mmap on Windows and producedPermissionError: [WinError 32]at cleanup._FileSource.closereleases the entry into_mmap_cachebut keeps the mmap alive until LRU eviction, so the file handle outlives the test on Windows. [Fixed in commit 385a11a — switched to pytest'stmp_pathfixture, which defers cleanup to session teardown and tolerates Windows file locks.]
What looks good
- Mirrors the existing tile-path gate one-for-one: same
_PARALLEL_DECODE_PIXEL_THRESHOLDconstant, samen > 1 AND pixels >= thresholdrule, samemin(n, cpu_count())worker cap. A reader who learned the tile pattern reads the strip change with no extra cognitive load. - Job-list +
pool.map+ serial placement keeps the output buffer race-free without locks.pool.mappreserves order, so the placement loop'szip(strip_jobs, decoded_strips)(xrspatial/geotiff/_reader.py:2002-2003) and HTTP analogue (line 2729) are correct by construction. - Sparse-strip handling (
byte_counts == 0) is preserved on both paths: the job-collection loops skip them (lines 1956-57, 1972-73), and the pre-filledresultarray carries the sparse fill through to the output. - Codec decode (deflate / zstd / LZW) releases the GIL inside the C extension, so the threading actually overlaps work. The microbench table (3.8x on deflate, 2.5x on zstd, 1.1x on uncompressed) is consistent with that: uncompressed barely moves because there is no GIL-released work to spread.
- The local and HTTP paths share the same threshold constant, which means a single tuning knob covers all four strip / tile combinations.
- HTTP path leaves the per-strip dimension cap (
_check_dimensions) inside the decode function rather than outside the gate, so the safety check still runs whether or not the decode goes parallel. Easy thing to forget when extracting a function into a thread worker.
Checklist
- Algorithm matches reference/paper — N/A (decode kernel unchanged, only scheduling).
- All implemented backends produce consistent results — parallel-vs-serial parity tests pin this.
- NaN handling is correct — N/A (codec decode unaffected).
- Edge cases are covered by tests — single-strip serial short-circuit, multi-strip pool engagement, windowed cross-strip read, HTTP parity.
- Dask chunk boundaries handled correctly — N/A.
- No premature materialization or unnecessary copies —
decoded_strips = list(pool.map(...))is necessary to keep the placement loop deterministic. - Benchmark exists or is not needed — microbench in PR body suffices; no
benchmarks/benchmarks/entry needed for an internal scheduling change. - README feature matrix updated (if applicable) — not needed.
- Docstrings present and accurate — the inline comments at the gate sites cite issue #2100 and explain the GIL-release reasoning.
Two review suggestions on #2104 left as deferred earlier, now landed. - `TestHttpStripParallelDecode::test_serial_gate_engages_on_single_strip` exercises the HTTP windowed-strip path with a 2x32 fixture that produces a single strip. The gate at `_fetch_decode_cog_http_strips` must short-circuit because `n_decode_strips > 1` is false. Spy on `_decode_strip_or_tile` and assert the call lands on the main thread; a regression that fires the pool when the gate should serialise would land the spy in a worker thread. The fetch layer's own `ThreadPoolExecutor` is skipped for single-range fetches via `len(ranges) == 1`, so this is a clean discriminator. - `TestPlanar2MultibandStripParallel` covers the dedicated `planar == 2 and samples > 1` branch in the strip-job collection loop. Two tests: full-image parity and a windowed parity across multiple strips. Bands are written via `tifffile` with `planarconfig='separate'` and 128 rows/strip → 8 strips/band × 3 bands = 24 strips, so the gate engages on the parallel run. Both tests confirm `parallel == serial` and that the round-trip matches `np.moveaxis(src, 0, -1)`.
The local planar=2 tests cover ``_read_strips`` only. Full-image HTTP reads dispatch back into ``_read_strips`` via ``_fetch_decode_cog_http_strips`` line 2581, but the windowed HTTP path runs its own per-band loop at lines 2639-2664 with separate sparse / per-strip byte-cap handling. Without a windowed planar=2 HTTP test that branch is unexercised. New test serves a 3-band planar=2 stripped TIFF over the local range-aware HTTP server, requests a window across multiple strips per band, and confirms parallel-vs-serial parity plus a round-trip match against the source array's ``np.moveaxis(arr, 0, -1)`` slice.
brendancol
commented
May 19, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review on PR #2104
Re-review of commits since the previous pass: 385a11a, 8f487b2, 96b4c42, 5734fd9.
Blockers
None.
Suggestions
None.
Nits
xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py:264-275and:292-303and:330-341(and the analogous setup in:201-215and:202-215) repeat the planar=2 fixture-write and the single-strip fixture-write. A small_write_planar2_uint16helper would dedupe. Trivial; reads fine as-is.pytest.importorskip("tifffile")is called inside each planar=2 test body. The siblingtest_predictor2_big_endian_gpu_1517.pygates on tifffile at module level viaimportlib.util.find_spec. Both patterns work; the inline form keeps the dependency local to the tests that need it.
What looks good
- The Windows CI fix targets the real cause:
_FileSource.closereleases the entry into_mmap_cachebut keeps the mmap alive until LRU eviction, so atempfile.TemporaryDirectory.__exit__race against the cached mmap raisesPermissionError: [WinError 32]on Windows. Switching the four affected tests totmp_pathdefers cleanup to pytest's session teardown, which tolerates that race. The docstring attest_parallel_decode_matches_serial:53-60documents the rationale so a future reader does not unwind the change without understanding why. - The
_os→_os_moduleconsistency fix at_reader.py:1995is exactly the right scope: a one-line cleanup that aligns with the file's nine other uses of the same alias. test_serial_gate_engages_on_single_stripuses thread-identity spying instead ofThreadPoolExecutorpatching. That is the right move on the HTTP path becauseread_rangesalso constructs a pool when there is more than one range. For a single-strip fixture the fetch layer short-circuits atlen(ranges) == 1(_reader.py:1174-1176), so the only_decode_strip_or_tilecalls that would land in a worker thread are decode-pool calls — exactly what the test wants to detect.TestPlanar2MultibandStripParalleladds three tests covering: full-image local parity, windowed local parity, and windowed HTTP parity. The HTTP windowed branch (_fetch_decode_cog_http_strips:2639-2664) has its own per-band loop separate from the local one, so the HTTP test is not redundant with the local windowed test.- The
np.moveaxis(arr, 0, -1)round-trip check is the right shape-aware comparison: the reader returns(H, W, samples)whiletifffile.imwriteaccepts(samples, H, W)forplanarconfig="separate". A naiveassert_array_equal(par, arr)would have masked a band-order bug. rowsperstrip=128on the planar=2 fixtures forces 8 strips per band × 3 bands = 24 strips on the parallel path, enough to engage the gate without dragging out CI.- The HTTP planar=2 test reuses the existing
_start_serverhelper, so the new coverage adds no extra infra.
Checklist
- Algorithm matches reference/paper — N/A (scheduling change only).
- All implemented backends produce consistent results — parity tests pin local serial vs parallel and local vs HTTP across chunky and planar=2 layouts.
- NaN handling is correct — N/A.
- Edge cases are covered by tests — single-strip serial short-circuit (local + HTTP), windowed cross-strip, planar=2 full + windowed, HTTP planar=2 windowed.
- Dask chunk boundaries handled correctly — N/A (eager-only change).
- No premature materialization or unnecessary copies —
pool.mapresult is materialised exactly once to make the placement loop deterministic. - Benchmark exists or is not needed — microbench table in PR body suffices.
- README feature matrix updated (if applicable) — not needed.
- Docstrings present and accurate — per-test docstrings explain branch coverage and Windows-cleanup rationale where it matters.
4 tasks
brendancol
added a commit
that referenced
this pull request
May 19, 2026
…path (#2100) (#2109) * geotiff: cover sparse-strip parallel decode in _read_strips and HTTP path (#2100) The strip-decode parallelisation in #2100 / #2104 added a collect-decode-place pipeline in both _read_strips and _fetch_decode_cog_http_strips. The job-collection loop filters sparse strips (byte_counts[idx] == 0) before they reach the ThreadPoolExecutor. The existing test_parallel_strip_decode_2100.py covers parallel/serial parity, the pool-engaged branch, the single-strip serial short-circuit, windowed strip reads, and planar=2 multi-band, but every fixture is fully populated. The 128x128 sparse fixture in test_sparse_cog.py is below the 64K-pixel parallel gate, so the sparse-strip filter inside the parallel branch is untested. A regression that lost the byte_counts==0 guard would silently ship: the decoder would receive an empty data slice and either raise 'Decompressed tile/strip size mismatch' or return corrupt pixels. Seven tests cover: - local-strip full-image parallel/serial parity with sparse strips - parallel-pool-engaged on a multi-strip sparse image - windowed read across the sparse boundary - all-sparse degenerate (zero filled rows -> empty job list -> short-circuit gate) - planar=2 sparse parity (dedicated samples>1 branch) - HTTP windowed read on a non-sparse strict subset - HTTP windowed read across the sparse boundary Mutation against the strip-job collection sparse guard (delete the byte_counts == 0 continue) flips 5 of 5 local tests red with 'Decompressed tile/strip size mismatch: expected ... got 0'; mutation against the HTTP path sparse guard at line 2646 flips the boundary HTTP test red. Source untouched; clean restore verified via md5sum. deep-sweep-test-coverage pass 18; categories 3 + 4 (HIGH). * test: skip parallel strip decode sparse tests when rasterio missing Module-level ``import rasterio`` raised ``ModuleNotFoundError`` during pytest collection on CI hosts without rasterio, marking the whole run red even though every other test was fine. Use ``pytest.importorskip`` at the top of the module so the file skips cleanly on those hosts. The fixtures genuinely need rasterio (they exercise the GDAL ``SPARSE_OK`` writer option); the reader code under test does not. Mirrors the pattern other golden-corpus tests already use.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_read_strips(local) and_fetch_decode_cog_http_strips(HTTP COG) decoded strips sequentially in a Python for-loop._PARALLEL_DECODE_PIXEL_THRESHOLD(64K pixels) viaThreadPoolExecutor. Codec decode (deflate / zstd / LZW) releases the GIL, so threads actually overlap C-level work.Microbench (4-core, uint16)
Test plan
test_parallel_strip_decode_2100.py(parallel/serial parity, multi-strip pool engaged, single-strip serial short-circuit, windowed cross-strip read, HTTP COG strip parity)pytest xrspatial/geotiff/tests/ -q: 3998 pass, 8 pre-existing failures predate this change (read_to_arrayis now a private attr)Closes #2100.