Skip to content

Fix GPU predictor kernel stride for multi-sample TIFFs#1222

Merged
brendancol merged 1 commit into
masterfrom
issue-1220
Apr 19, 2026
Merged

Fix GPU predictor kernel stride for multi-sample TIFFs#1222
brendancol merged 1 commit into
masterfrom
issue-1220

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Fixes #1220.

Both call sites of _predictor_decode_kernel passed width = tile_width * samples and bytes_per_sample = itemsize * samples. The kernel body does row_bytes = width * bytes_per_sample, so row_bytes came out to tile_width * samples**2 * itemsize instead of tile_width * samples * itemsize.

For tiled multi-sample TIFFs with predictor=2, that meant the cumulative-sum loop walked samples times further per row than the row actually contained. On the last tile in the buffer it wrote past the end of d_decomp itself, so it's an OOB GPU write with no error surface.

To reproduce, any tiled TIFF with SamplesPerPixel > 1 (RGB, RGBA) and predictor=2, read via open_geotiff(path, gpu=True) or read_geotiff_gpu, produces pixels that don't match the CPU decode.

The fix is to pass width = tile_width at both call sites. bytes_per_sample = itemsize * samples stays. No kernel change. This matches what the CPU path already does at _reader._apply_predictor(..., bytes_per_sample * samples).

The new test file xrspatial/geotiff/tests/test_predictor_multisample.py builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16, even and uneven tile grids) with predictor=2, decodes on the GPU, and checks byte-for-byte equality with the CPU decode. I confirmed the tests fail on the unpatched code and pass after the fix by stashing the change and re-running.

Test plan:

  • pytest xrspatial/geotiff/tests/test_predictor_multisample.py on a box with CuPy + CUDA.
  • Tests skip cleanly where CUDA isn't available.
  • pytest xrspatial/geotiff/tests/ clean apart from the pre-existing matplotlib palette recursion failures.

One thing I noticed but didn't fix here: the predictor=3 path has its own argument pattern (bytes_per_sample=itemsize, no * samples) that also looks wrong for multi-sample data. Out of scope for this PR.

Both call sites of _predictor_decode_kernel passed
width=tile_width*samples and bytes_per_sample=itemsize*samples. The
kernel computes row_bytes = width * bytes_per_sample, so row_bytes
ended up tile_width * samples**2 * itemsize instead of
tile_width * samples * itemsize. The inner loop walked past the end
of each tile row and, on the last tile, past the end of the d_decomp
buffer (OOB GPU write).

Fix: pass width=tile_width at both call sites. bytes_per_sample stays
itemsize*samples. Matches the CPU call convention in
_reader._apply_predictor.

Regression test builds tiled multi-sample TIFFs (RGB/RGBA, uint8/uint16,
even and uneven tile grids) with predictor=2, decodes via the GPU path,
and asserts byte-for-byte equality with the CPU decode.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label Apr 19, 2026
@brendancol brendancol merged commit 0fb7f4c into master Apr 19, 2026
11 checks passed
@brendancol brendancol deleted the issue-1220 branch May 4, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU predictor decode kernel over-indexes d_decomp for multi-sample tiled TIFFs

1 participant