Skip to content

geotiff: fix libdeflate import, hybrid codec for parallel writer#1826

Open
brendancol wants to merge 4 commits into
mainfrom
geotiff-deflate-hybrid
Open

geotiff: fix libdeflate import, hybrid codec for parallel writer#1826
brendancol wants to merge 4 commits into
mainfrom
geotiff-deflate-hybrid

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • The optional libdeflate dispatch in xrspatial.geotiff imported a package name (libdeflate) that does not exist on PyPI, with an API that belongs to a different binding. _HAVE_LIBDEFLATE was therefore False in every install and deflate writes always fell back to stdlib zlib.compress (~3x slower than the libdeflate that GDAL ships bundled). Fixed by importing the actual deflate PyPI binding.
  • deflate.zlib_compress does not release the GIL (measured: 1.2x speedup across 8 threads vs zlib.compress's 5.1x), so a flat switch to libdeflate would have regressed the parallel strip/tile writer that needs thread-pool scaling on large arrays. Added a gil_friendly flag so sequential callers get libdeflate's per-call speedup and parallel callers stay on stdlib zlib.
  • Added a geotiff extras group (pip install xarray-spatial[geotiff]) that pulls in deflate, and a one-shot fallback warning when it is absent.

Bench

Compared with GDAL (via rasterio.open(..., 'w', compress='DEFLATE')) on float32 arrays:

Case Before After vs GDAL Before vs GDAL After
512x512 deflate 25.5 ms 8.8 ms 3.97x 1.48x
2048x2048 deflate 71.2 ms 65.9 ms 0.82x 0.75x

Small-array deflate now lands at ~1.5x of GDAL (the residual gap is per-write Python overhead in the header assembly path). Large-array deflate stays faster than GDAL via the parallel zlib path.

Notes

  • deflate.zlib_compress returns bytearray; cast to bytes to preserve the documented -> bytes contract.
  • test_deflate_higher_level_not_larger widened from a 128x128 noisy gradient to a 512x512 smooth gradient. Tiny noisy inputs are not reliably monotonic in level vs output size across codec backends.
  • Dropped the now-unused thread-local Compressor cache and its test: the deflate binding's compress functions are stateless.

Test plan

  • pytest xrspatial/geotiff/tests/ -k 'deflate or compression or parallel or write' (586 passed, 2 skipped; one pre-existing unrelated failure on main)
  • Re-ran the local 3-way bench (xrspatial vs rioxarray vs GDAL) and confirmed the numbers above
  • CI green

The deflate write path had two bugs that together explain a 3.97x
slowdown vs GDAL on 512x512 float32 arrays:

1. The optional libdeflate backend imported `libdeflate` (no such package
   on PyPI) with a `Compressor(level).compress(..., Format.ZLIB)` API
   that belongs to a different binding. `_HAVE_LIBDEFLATE` was therefore
   always False in every install and deflate always fell back to stdlib
   `zlib.compress`, which is ~3x slower than what GDAL gets from its
   bundled libdeflate. Switched the import to `deflate` (Christian
   Heimes's PyPI binding) and the call to `deflate.zlib_compress`.

2. `deflate.zlib_compress` does not release the GIL: measured 1.2x
   speedup across 8 threads vs `zlib.compress`'s 5.1x. Routing the
   parallel strip/tile path through libdeflate would have regressed
   large-array writes that rely on thread-pool scaling. Added a
   `gil_friendly` flag to `compress`/`deflate_compress`, plumbed through
   `_prepare_strip`, `_prepare_tile`, and `_compress_block`. Sequential
   callers pick up libdeflate's per-call speedup; parallel callers stay
   on stdlib zlib for GIL release.

Bench (xrspatial.geotiff vs GDAL via rasterio, float32, median of 5):

  512x512 deflate:   25.5 ms -> 8.8 ms   (3.97x -> 1.48x of GDAL)
  2048x2048 deflate: 71.2 ms -> 65.9 ms  (0.82x -> 0.75x of GDAL)

Also:
- `setup.cfg`: new `geotiff = deflate` extras group.
- One-shot UserWarning when libdeflate is absent, pointing at
  `pip install xarray-spatial[geotiff]`.
- `deflate.zlib_compress` returns `bytearray`; cast to `bytes` to keep
  the `-> bytes` contract callers and tests rely on.
- Test `test_deflate_higher_level_not_larger` widened from 128x128 noisy
  gradient to 512x512 smooth gradient: small noisy inputs are not
  reliably monotonic in level vs output size across codec backends.
- Dropped the obsolete thread-local Compressor cache and its test (the
  `deflate` binding's compress functions are stateless).
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 13, 2026
@brendancol brendancol requested a review from Copilot May 13, 2026 17:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates GeoTIFF deflate compression to use the deflate package when beneficial, while preserving zlib for parallel writer paths where GIL behavior matters.

Changes:

  • Switches optional deflate backend import from libdeflate to deflate.
  • Adds a gil_friendly compression flag and threads it through parallel strip/tile/streaming compression paths.
  • Adds a geotiff extras group and adjusts deflate-related tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
xrspatial/geotiff/_compression.py Uses the deflate package, adds fallback warning, and exposes gil_friendly compression dispatch.
xrspatial/geotiff/_writer.py Passes gil_friendly=True through parallel compression paths.
xrspatial/geotiff/tests/test_parallel_writer_1800.py Updates deflate backend tests and removes obsolete thread-local cache coverage.
xrspatial/geotiff/tests/test_compression_level.py Makes compression-level size comparison input larger and smoother.
setup.cfg Adds geotiff optional dependency extra for deflate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -188,12 +188,14 @@ def test_deflate_compress_zlib_wire_compatible():


def test_deflate_compress_fallback_when_libdeflate_missing(monkeypatch):
The thread-local Compressor cache test was the last direct user of the
bare `from concurrent.futures import ThreadPoolExecutor` import.
Remaining tests reach the executor through `cf.ThreadPoolExecutor` (so
they can monkeypatch it), so the top-level import is now unused.
Removing it silences flake8 F401.
Picks up #1704 (windowed VRT resample read) which updated
test_vrt_dstrect_resample_cap_1737.py; the PR branch was running the
pre-#1704 test file against post-#1704 reader behaviour, failing
test_dstrect_at_cap_succeeds with PixelSafetyLimitError.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants