Skip to content

Speed up GeoTIFF deflate writes: parallelize strip writer, optional libdeflate, adaptive tile-parallel threshold #1800

@brendancol

Description

@brendancol

Reason or Problem

The deflate write path in xrspatial.geotiff is 3.7x slower than rioxarray/GDAL when writing in strip mode (the default for tiled=False). Profiling shows 99% of the time is in zlib.compress, running serially. The tile-mode path already parallelizes via a ThreadPoolExecutor in _write_tiled (_writer.py:863), but _write_stripped never got that treatment.

Local timings (20-core box, 2048x2048 float32 random):

Path Time
xrspatial deflate strip (current) 405 ms
rioxarray DEFLATE strip 102 ms
xrspatial deflate tile (current, parallel) 46 ms

Real terrain (Copernicus DSM 3600x3600 float32, deflate + predictor=3):

Path Time
xrspatial strip (current) 948 ms
rioxarray (predictor=3) 356 ms
xrspatial tile (current) 119 ms

Proposal

Four changes:

  1. Parallelize _write_stripped using the same ThreadPoolExecutor pattern as _write_tiled. zlib, zstd, lz4, and the Numba LZW kernel all release the GIL, so the win is direct. A prototype reaches 70 ms on the 2048 case (5.8x current, beats GDAL's 102 ms) with bit-identical round-trip.

  2. Optional libdeflate backend in _compression.deflate_compress. libdeflate is typically 1.5-2x faster than zlib for the same compression level, and GDAL >= 3.7 already uses it when available. Detect at import time and fall back to zlib.compress when the package is missing.

  3. Adaptive sequential threshold in _write_tiled. The current n_tiles <= 4 branch is a footgun: tile_size=1024 on a 2048x2048 image produces n_tiles=4 and forces the sequential path, which takes 395 ms instead of the parallel path's 46 ms. Switch to a bytes-based threshold (e.g. total uncompressed payload <= 4 MiB) so large-tiled writes still parallelize.

  4. Default to tiled=True for compressed writes. tiled=False is the only reason the current bench lands in the slow path. Uncompressed writes can keep strip-default since stride matters there.

Design:

For (1), mirror _write_tiled: short-circuit to sequential for compression == COMPRESSION_NONE or num_strips <= 2, otherwise build the strip-encode closure and dispatch with pool.map. JPEG, JPEG2000, LERC, and predictor paths share the same dispatcher.

For (2), add a module-level _HAVE_LIBDEFLATE flag and a cached Compressor per level (libdeflate compressors are not thread-safe; create one per worker via a thread-local). For level=None (zlib default 6), reuse the cached compressor.

For (3), replace if n_tiles <= 4: with if n_tiles <= 2 or total_bytes <= 4 * 1024 * 1024: to keep the small-payload skip while letting wide tiles still parallelize.

For (4), change the to_geotiff default. Existing tiled=False callers keep working; only the default changes.

Usage:

No API change. Users who want libdeflate install it; everyone else sees a transparent zlib speedup from (1) and the default flip from (4).

Value:

Brings deflate strip writes from 4x slower than GDAL to 1.5-3x faster than GDAL on typical raster sizes. Removes the n_tiles <= 4 performance cliff. Optional libdeflate path lets GDAL-equivalent installs match GDAL-equivalent throughput.

Stakeholders and Impacts

Anyone writing compressed GeoTIFFs through to_geotiff benefits. No public-API surface changes beyond the tiled default.

Drawbacks

(4) is a behavior change: existing users relying on the strip-mode default for compressed writes will silently get tiled output. Tile-mode files are still readable by every TIFF reader, so this is a layout change, not a compatibility break.

Alternatives

  • Drop in zlib-ng instead of libdeflate. libdeflate is a hard requirement of newer GDAL builds and has better single-threaded throughput, so it's the better default.
  • Skip (3) and document the cliff. Adaptive is one line and removes the footgun.
  • Skip (4) and just fix the strip writer. Doing both means the bench fix doesn't depend on user opt-in.

Unresolved Questions

  • Should libdeflate compression level >= 10 (the "slow" tier) be exposed? Keep the current 1..9 range to match zlib and avoid surprises.

Additional Notes or Context

Working prototype at /tmp/proto_parallel_strip.py (parallel strip writer, monkey-patched). Round-trip data is bit-identical to the current serial path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePR touches performance-sensitive code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions