Skip to content

geotiff: parallelize strip writer, adaptive tile threshold, optional libdeflate (closes #1800)#1801

Merged
brendancol merged 2 commits into
mainfrom
issue-1800
May 13, 2026
Merged

geotiff: parallelize strip writer, adaptive tile threshold, optional libdeflate (closes #1800)#1801
brendancol merged 2 commits into
mainfrom
issue-1800

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1800.

Summary

  • Parallelizes _write_stripped with the same ThreadPoolExecutor pattern that _write_tiled has used since the start. zlib, zstd, lz4, and the Numba LZW kernel all release the GIL, so this is a direct speedup with no new code paths to maintain.
  • Replaces the tile writer's n_tiles <= 4 sequential cutoff with a bytes-based threshold (_PARALLEL_MIN_BYTES = 4 MiB). The old cutoff turned tile_size=1024 on a 2048x2048 image (n_tiles=4, 16 MiB) into the slow path, ~8x slower than the parallel path.
  • Adds an optional libdeflate backend in _compression.deflate_compress with a stdlib zlib.compress fallback. Compressors are cached per thread via threading.local; output is wire-compatible (zlib format) either way.

Numbers (20-core box)

Workload Before After rioxarray/GDAL
2048x2048 float32 random, deflate strip 405 ms 70 ms 102 ms
3600x3600 float32 terrain, deflate+predictor=3, strip 948 ms 121 ms 356 ms
2048x2048 deflate tile_size=1024 395 ms 109 ms n/a

After the change, deflate strip writes are 1.4-2.9x faster than rioxarray/GDAL. The tile writer's fast path is unchanged.

Compatibility

Round-trip data is bit-identical to the current serial path; verified across float32, float64, uint8, uint16, int16, int32 and across compression={none, deflate, lzw, zstd} with and without predictor. Files produced by the new path open in rioxarray/GDAL and decode through stdlib zlib.

No public-API change. libdeflate is an optional install; absent it, the code calls zlib.compress directly with no extra overhead.

Test plan

  • xrspatial/geotiff/tests/test_parallel_writer_1800.py (21 new tests): round-trip parity across dtypes and compression codecs, threshold-based path-selection probes (ThreadPoolExecutor is/isn't constructed), libdeflate fallback equivalence to zlib.compress, end-to-end writes through write().
  • Full xrspatial/geotiff/tests/ suite passes. 8 pre-existing GPU test failures (test_predictor2_big_endian_gpu_1517, test_size_param_validation_gpu_vrt_1776) and 1 pre-existing matplotlib deepcopy failure (test_features.py::TestPalette) are reproducible on main and unrelated to this change.
  • Cross-library round-trip with rioxarray/GDAL on synthetic and real (Copernicus DSM, USGS) GeoTIFFs.

…#1800)

The deflate strip-write path was 3.7x slower than rioxarray/GDAL because
`_write_stripped` ran zlib.compress serially while the tile writer
already parallelized via a thread pool. Three changes:

1. Mirror `_write_tiled`'s ThreadPoolExecutor pattern in
   `_write_stripped`. Strip preparation is hoisted into a new
   `_prepare_strip` helper so the same code drives both the serial and
   parallel paths. A 2048x2048 deflate strip write drops from 405 ms to
   70 ms (5.8x speedup, beats rioxarray's 102 ms).

2. Replace the tile writer's `n_tiles <= 4` sequential cutoff with a
   bytes-based threshold (`_PARALLEL_MIN_BYTES = 4 MiB`). Pre-fix,
   `tile_size=1024` on a 2048x2048 image produced n_tiles=4 and forced
   the slow path; now those writes parallelize too.

3. Route `deflate_compress` through the optional `libdeflate` package
   when installed (1.5-2x faster than stdlib zlib at the same level;
   GDAL >= 3.7 already uses it). Output is wire-compatible -- decoded
   streams round-trip through `zlib.decompress` unchanged. Compressors
   are cached per thread via `threading.local`.
@brendancol brendancol added enhancement New feature or request performance PR touches performance-sensitive code labels May 13, 2026
@brendancol brendancol requested a review from Copilot May 13, 2026 14:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GeoTIFF write performance by extending the existing parallel tile-compression approach to strip writes, adding a payload-size-based heuristic for when to parallelize, and introducing an optional libdeflate backend for faster deflate compression while maintaining zlib wire compatibility.

Changes:

  • Parallelize _write_stripped using ThreadPoolExecutor for sufficiently large compressed payloads.
  • Replace the tile writer’s n_tiles <= 4 sequential cutoff with a bytes-based threshold (_PARALLEL_MIN_BYTES = 4 MiB).
  • Add an optional libdeflate path in deflate_compress() with per-thread compressor caching and zlib.compress fallback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
xrspatial/geotiff/_writer.py Adds _PARALLEL_MIN_BYTES, introduces parallel strip compression, and updates tile sequential/parallel decision logic to be payload-size-based.
xrspatial/geotiff/_compression.py Adds optional libdeflate backend for deflate compression with thread-local compressor caching and zlib fallback.
xrspatial/geotiff/tests/test_parallel_writer_1800.py Adds round-trip and path-selection tests for strip/tile parallelization and libdeflate fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +216 to +234
def test_libdeflate_compressor_cache_is_thread_local():
"""The cache lives in threading.local, so two threads see distinct dicts."""
import xrspatial.geotiff._compression as comp_mod

if not comp_mod._HAVE_LIBDEFLATE:
pytest.skip('libdeflate not installed')

seen = {}

def grab(tag):
# First call populates the cache; we grab its id().
comp_mod._libdeflate_compressor(6)
seen[tag] = id(comp_mod._libdeflate_thread_local.cache)

t1 = ThreadPoolExecutor(max_workers=2)
list(t1.map(grab, ['a', 'b']))
t1.shutdown(wait=True)
# Two workers populated two distinct local caches.
assert len(set(seen.values())) == 2
PR #1801's review flagged that
`test_libdeflate_compressor_cache_is_thread_local` could pass with a
single observed cache id: `ThreadPoolExecutor(max_workers=2).map(...)`
is free to run both submissions on the same worker if the first
returns quickly. Force both tasks to occupy a worker at the same time
with a `threading.Barrier`, record `threading.get_ident()` so the
assertion fails loudly if only one thread actually ran, and use the
executor as a context manager so the pool is shut down on assertion
failure.
@brendancol brendancol merged commit 56cc261 into main May 13, 2026
2 of 11 checks passed
brendancol added a commit that referenced this pull request May 13, 2026
Resolves conflict in xrspatial/geotiff/__init__.py: keeps the
`_read_vrt_dask` dispatch hook from the PR branch. All other
geotiff changes from main (#1791, #1793, #1801, #1802, #1803, #1804,
#1805, #1806) were already integrated into the working tree by the
prior 7329dd9 commit; this merge just records the parent so git
recognises the reconciliation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up GeoTIFF deflate writes: parallelize strip writer, optional libdeflate, adaptive tile-parallel threshold

2 participants