Fix GPU predictor kernel stride for multi-sample TIFFs by brendancol · Pull Request #1222 · xarray-contrib/xarray-spatial

brendancol · 2026-04-19T09:15:29Z

Both call sites of _predictor_decode_kernel passed width = tile_width * samples and bytes_per_sample = itemsize * samples. The kernel body does row_bytes = width * bytes_per_sample, so row_bytes came out to tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize.

For tiled multi-sample TIFFs with predictor=2, that meant the cumulative-sum loop walked samples times further per row than the row actually contained. On the last tile in the buffer it wrote past the end of d_decomp itself, so it's an OOB GPU write with no error surface.

To reproduce, any tiled TIFF with SamplesPerPixel > 1 (RGB, RGBA) and predictor=2, read via open_geotiff(path, gpu=True) or read_geotiff_gpu, produces pixels that don't match the CPU decode.

The fix is to pass width = tile_width at both call sites. bytes_per_sample = itemsize * samples stays. No kernel change. This matches what the CPU path already does at _reader._apply_predictor(..., bytes_per_sample * samples).

The new test file xrspatial/geotiff/tests/test_predictor_multisample.py builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes on the GPU, and checks byte-for-byte equality with the CPU decode. I confirmed the tests fail on the unpatched code and pass after the fix by stashing the change and re-running.

Test plan:

pytest xrspatial/geotiff/tests/test_predictor_multisample.py on a box with CuPy + CUDA.
Tests skip cleanly where CUDA isn't available.
pytest xrspatial/geotiff/tests/ clean apart from the pre-existing matplotlib palette recursion failures.

One thing I noticed but didn't fix here: the predictor=3 path has its own argument pattern (bytes_per_sample=itemsize, no * samples) that also looks wrong for multi-sample data. Out of scope for this PR.

Both call sites of _predictor_decode_kernel passed width=tile_width*samples and bytes_per_sample=itemsize*samples. The kernel computes row_bytes = width * bytes_per_sample, so row_bytes ended up tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize. The inner loop walked past the end of each tile row and, on the last tile, past the end of the d_decomp buffer (OOB GPU write). Fix: pass width=tile_width at both call sites. bytes_per_sample stays itemsize*samples. Matches the CPU call convention in _reader._apply_predictor. Regression test builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes via the GPU path, and asserts byte-for-byte equality with the CPU decode.

github-actions Bot added the performance PR touches performance-sensitive code label Apr 19, 2026

brendancol merged commit 0fb7f4c into master Apr 19, 2026
11 checks passed

brendancol deleted the issue-1220 branch May 4, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU predictor kernel stride for multi-sample TIFFs#1222

Fix GPU predictor kernel stride for multi-sample TIFFs#1222
brendancol merged 1 commit into
masterfrom
issue-1220

brendancol commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant