Reason or Problem
The deflate write path in xrspatial.geotiff is 3.7x slower than rioxarray/GDAL when writing in strip mode (the default for tiled=False). Profiling shows 99% of the time is in zlib.compress, running serially. The tile-mode path already parallelizes via a ThreadPoolExecutor in _write_tiled (_writer.py:863), but _write_stripped never got that treatment.
Local timings (20-core box, 2048x2048 float32 random):
| Path |
Time |
| xrspatial deflate strip (current) |
405 ms |
| rioxarray DEFLATE strip |
102 ms |
| xrspatial deflate tile (current, parallel) |
46 ms |
Real terrain (Copernicus DSM 3600x3600 float32, deflate + predictor=3):
| Path |
Time |
| xrspatial strip (current) |
948 ms |
| rioxarray (predictor=3) |
356 ms |
| xrspatial tile (current) |
119 ms |
Proposal
Four changes:
-
Parallelize _write_stripped using the same ThreadPoolExecutor pattern as _write_tiled. zlib, zstd, lz4, and the Numba LZW kernel all release the GIL, so the win is direct. A prototype reaches 70 ms on the 2048 case (5.8x current, beats GDAL's 102 ms) with bit-identical round-trip.
-
Optional libdeflate backend in _compression.deflate_compress. libdeflate is typically 1.5-2x faster than zlib for the same compression level, and GDAL >= 3.7 already uses it when available. Detect at import time and fall back to zlib.compress when the package is missing.
-
Adaptive sequential threshold in _write_tiled. The current n_tiles <= 4 branch is a footgun: tile_size=1024 on a 2048x2048 image produces n_tiles=4 and forces the sequential path, which takes 395 ms instead of the parallel path's 46 ms. Switch to a bytes-based threshold (e.g. total uncompressed payload <= 4 MiB) so large-tiled writes still parallelize.
-
Default to tiled=True for compressed writes. tiled=False is the only reason the current bench lands in the slow path. Uncompressed writes can keep strip-default since stride matters there.
Design:
For (1), mirror _write_tiled: short-circuit to sequential for compression == COMPRESSION_NONE or num_strips <= 2, otherwise build the strip-encode closure and dispatch with pool.map. JPEG, JPEG2000, LERC, and predictor paths share the same dispatcher.
For (2), add a module-level _HAVE_LIBDEFLATE flag and a cached Compressor per level (libdeflate compressors are not thread-safe; create one per worker via a thread-local). For level=None (zlib default 6), reuse the cached compressor.
For (3), replace if n_tiles <= 4: with if n_tiles <= 2 or total_bytes <= 4 * 1024 * 1024: to keep the small-payload skip while letting wide tiles still parallelize.
For (4), change the to_geotiff default. Existing tiled=False callers keep working; only the default changes.
Usage:
No API change. Users who want libdeflate install it; everyone else sees a transparent zlib speedup from (1) and the default flip from (4).
Value:
Brings deflate strip writes from 4x slower than GDAL to 1.5-3x faster than GDAL on typical raster sizes. Removes the n_tiles <= 4 performance cliff. Optional libdeflate path lets GDAL-equivalent installs match GDAL-equivalent throughput.
Stakeholders and Impacts
Anyone writing compressed GeoTIFFs through to_geotiff benefits. No public-API surface changes beyond the tiled default.
Drawbacks
(4) is a behavior change: existing users relying on the strip-mode default for compressed writes will silently get tiled output. Tile-mode files are still readable by every TIFF reader, so this is a layout change, not a compatibility break.
Alternatives
- Drop in
zlib-ng instead of libdeflate. libdeflate is a hard requirement of newer GDAL builds and has better single-threaded throughput, so it's the better default.
- Skip (3) and document the cliff. Adaptive is one line and removes the footgun.
- Skip (4) and just fix the strip writer. Doing both means the bench fix doesn't depend on user opt-in.
Unresolved Questions
- Should libdeflate compression level >= 10 (the "slow" tier) be exposed? Keep the current
1..9 range to match zlib and avoid surprises.
Additional Notes or Context
Working prototype at /tmp/proto_parallel_strip.py (parallel strip writer, monkey-patched). Round-trip data is bit-identical to the current serial path.
Reason or Problem
The deflate write path in
xrspatial.geotiffis 3.7x slower than rioxarray/GDAL when writing in strip mode (the default fortiled=False). Profiling shows 99% of the time is inzlib.compress, running serially. The tile-mode path already parallelizes via aThreadPoolExecutorin_write_tiled(_writer.py:863), but_write_strippednever got that treatment.Local timings (20-core box, 2048x2048 float32 random):
Real terrain (Copernicus DSM 3600x3600 float32, deflate + predictor=3):
Proposal
Four changes:
Parallelize
_write_strippedusing the sameThreadPoolExecutorpattern as_write_tiled. zlib, zstd, lz4, and the Numba LZW kernel all release the GIL, so the win is direct. A prototype reaches 70 ms on the 2048 case (5.8x current, beats GDAL's 102 ms) with bit-identical round-trip.Optional
libdeflatebackend in_compression.deflate_compress. libdeflate is typically 1.5-2x faster than zlib for the same compression level, and GDAL >= 3.7 already uses it when available. Detect at import time and fall back tozlib.compresswhen the package is missing.Adaptive sequential threshold in
_write_tiled. The currentn_tiles <= 4branch is a footgun:tile_size=1024on a 2048x2048 image produces n_tiles=4 and forces the sequential path, which takes 395 ms instead of the parallel path's 46 ms. Switch to a bytes-based threshold (e.g. total uncompressed payload <= 4 MiB) so large-tiled writes still parallelize.Default to
tiled=Truefor compressed writes.tiled=Falseis the only reason the current bench lands in the slow path. Uncompressed writes can keep strip-default since stride matters there.Design:
For (1), mirror
_write_tiled: short-circuit to sequential forcompression == COMPRESSION_NONEornum_strips <= 2, otherwise build the strip-encode closure and dispatch withpool.map. JPEG, JPEG2000, LERC, and predictor paths share the same dispatcher.For (2), add a module-level
_HAVE_LIBDEFLATEflag and a cachedCompressorper level (libdeflate compressors are not thread-safe; create one per worker via a thread-local). For level=None (zlib default 6), reuse the cached compressor.For (3), replace
if n_tiles <= 4:withif n_tiles <= 2 or total_bytes <= 4 * 1024 * 1024:to keep the small-payload skip while letting wide tiles still parallelize.For (4), change the
to_geotiffdefault. Existingtiled=Falsecallers keep working; only the default changes.Usage:
No API change. Users who want libdeflate install it; everyone else sees a transparent zlib speedup from (1) and the default flip from (4).
Value:
Brings deflate strip writes from 4x slower than GDAL to 1.5-3x faster than GDAL on typical raster sizes. Removes the
n_tiles <= 4performance cliff. Optional libdeflate path lets GDAL-equivalent installs match GDAL-equivalent throughput.Stakeholders and Impacts
Anyone writing compressed GeoTIFFs through
to_geotiffbenefits. No public-API surface changes beyond thetileddefault.Drawbacks
(4) is a behavior change: existing users relying on the strip-mode default for compressed writes will silently get tiled output. Tile-mode files are still readable by every TIFF reader, so this is a layout change, not a compatibility break.
Alternatives
zlib-nginstead oflibdeflate. libdeflate is a hard requirement of newer GDAL builds and has better single-threaded throughput, so it's the better default.Unresolved Questions
1..9range to match zlib and avoid surprises.Additional Notes or Context
Working prototype at
/tmp/proto_parallel_strip.py(parallel strip writer, monkey-patched). Round-trip data is bit-identical to the current serial path.