Speed up GeoTIFF deflate writes: parallelize strip writer, optional libdeflate, adaptive tile-parallel threshold

## Reason or Problem

The deflate write path in `xrspatial.geotiff` is 3.7x slower than rioxarray/GDAL when writing in strip mode (the default for `tiled=False`). Profiling shows 99% of the time is in `zlib.compress`, running serially. The tile-mode path already parallelizes via a `ThreadPoolExecutor` in `_write_tiled` (`_writer.py:863`), but `_write_stripped` never got that treatment.

Local timings (20-core box, 2048x2048 float32 random):

| Path | Time |
|---|---:|
| xrspatial deflate strip (current) | 405 ms |
| rioxarray DEFLATE strip | 102 ms |
| xrspatial deflate tile (current, parallel) | 46 ms |

Real terrain (Copernicus DSM 3600x3600 float32, deflate + predictor=3):

| Path | Time |
|---|---:|
| xrspatial strip (current) | 948 ms |
| rioxarray (predictor=3) | 356 ms |
| xrspatial tile (current) | 119 ms |

## Proposal

Four changes:

1. **Parallelize `_write_stripped`** using the same `ThreadPoolExecutor` pattern as `_write_tiled`. zlib, zstd, lz4, and the Numba LZW kernel all release the GIL, so the win is direct. A prototype reaches 70 ms on the 2048 case (5.8x current, beats GDAL's 102 ms) with bit-identical round-trip.

2. **Optional `libdeflate` backend** in `_compression.deflate_compress`. libdeflate is typically 1.5-2x faster than zlib for the same compression level, and GDAL >= 3.7 already uses it when available. Detect at import time and fall back to `zlib.compress` when the package is missing.

3. **Adaptive sequential threshold in `_write_tiled`**. The current `n_tiles <= 4` branch is a footgun: `tile_size=1024` on a 2048x2048 image produces n_tiles=4 and forces the sequential path, which takes 395 ms instead of the parallel path's 46 ms. Switch to a bytes-based threshold (e.g. total uncompressed payload <= 4 MiB) so large-tiled writes still parallelize.

4. **Default to `tiled=True` for compressed writes**. `tiled=False` is the only reason the current bench lands in the slow path. Uncompressed writes can keep strip-default since stride matters there.

**Design:**

For (1), mirror `_write_tiled`: short-circuit to sequential for `compression == COMPRESSION_NONE` or `num_strips <= 2`, otherwise build the strip-encode closure and dispatch with `pool.map`. JPEG, JPEG2000, LERC, and predictor paths share the same dispatcher.

For (2), add a module-level `_HAVE_LIBDEFLATE` flag and a cached `Compressor` per level (libdeflate compressors are not thread-safe; create one per worker via a thread-local). For level=None (zlib default 6), reuse the cached compressor.

For (3), replace `if n_tiles <= 4:` with `if n_tiles <= 2 or total_bytes <= 4 * 1024 * 1024:` to keep the small-payload skip while letting wide tiles still parallelize.

For (4), change the `to_geotiff` default. Existing `tiled=False` callers keep working; only the default changes.

**Usage:**

No API change. Users who want libdeflate install it; everyone else sees a transparent zlib speedup from (1) and the default flip from (4).

**Value:**

Brings deflate strip writes from 4x slower than GDAL to 1.5-3x faster than GDAL on typical raster sizes. Removes the `n_tiles <= 4` performance cliff. Optional libdeflate path lets GDAL-equivalent installs match GDAL-equivalent throughput.

## Stakeholders and Impacts

Anyone writing compressed GeoTIFFs through `to_geotiff` benefits. No public-API surface changes beyond the `tiled` default.

## Drawbacks

(4) is a behavior change: existing users relying on the strip-mode default for compressed writes will silently get tiled output. Tile-mode files are still readable by every TIFF reader, so this is a layout change, not a compatibility break.

## Alternatives

- Drop in `zlib-ng` instead of `libdeflate`. libdeflate is a hard requirement of newer GDAL builds and has better single-threaded throughput, so it's the better default.
- Skip (3) and document the cliff. Adaptive is one line and removes the footgun.
- Skip (4) and just fix the strip writer. Doing both means the bench fix doesn't depend on user opt-in.

## Unresolved Questions

- Should libdeflate compression level >= 10 (the "slow" tier) be exposed? Keep the current `1..9` range to match zlib and avoid surprises.

## Additional Notes or Context

Working prototype at `/tmp/proto_parallel_strip.py` (parallel strip writer, monkey-patched). Round-trip data is bit-identical to the current serial path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up GeoTIFF deflate writes: parallelize strip writer, optional libdeflate, adaptive tile-parallel threshold #1800

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Unresolved Questions

Additional Notes or Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Path	Time
xrspatial deflate strip (current)	405 ms
rioxarray DEFLATE strip	102 ms
xrspatial deflate tile (current, parallel)	46 ms

Path	Time
xrspatial strip (current)	948 ms
rioxarray (predictor=3)	356 ms
xrspatial tile (current)	119 ms

Speed up GeoTIFF deflate writes: parallelize strip writer, optional libdeflate, adaptive tile-parallel threshold #1800

Description

Reason or Problem

Proposal

Stakeholders and Impacts

Drawbacks

Alternatives

Unresolved Questions

Additional Notes or Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions