Skip to content

feat: Add GPU-accelerated operations via PyTorch#21

Closed
maragall wants to merge 18 commits intoCephla-Lab:mainfrom
maragall:feature/gpu-acceleration
Closed

feat: Add GPU-accelerated operations via PyTorch#21
maragall wants to merge 18 commits intoCephla-Lab:mainfrom
maragall:feature/gpu-acceleration

Conversation

@maragall
Copy link

@maragall maragall commented Mar 7, 2026

Summary

  • Add optional GPU acceleration for core utility functions (phase_cross_correlation, shift_array, match_histograms, block_reduce, compute_ssim) using PyTorch CUDA, with automatic CPU fallback when no GPU is available
  • Preserve input dtypes across all accelerated operations for consistency
  • Replace magic numbers with named constants (_FFT_EPS, _SSIM_K1, _SSIM_K2, _PARABOLIC_EPS)
  • Add comprehensive test suite: unit tests for each GPU operation, CPU fallback tests, dtype preservation tests, and subpixel phase correlation tests

Benchmark results (S5_reg1, 184 tiles)

Phase CPU (s) GPU (s) Speedup
Registration 4.94 3.16 1.57x
Full pipeline 59.54 57.27 1.04x

GPU speedup is most visible in the registration phase. The full pipeline is dominated by I/O-bound fusion, where GPU provides minimal benefit.

Test plan

  • pytest tests/test_block_reduce.py tests/test_fft.py tests/test_histogram_match.py tests/test_shift_array.py tests/test_ssim.py tests/test_cpu_fallback.py — all new GPU tests pass
  • Verify CPU-only fallback works when PyTorch/CUDA is unavailable
  • Run end-to-end registration on a tiled dataset with GPU enabled

🤖 Generated with Claude Code

hongquanli and others added 18 commits March 7, 2026 04:56
Consolidates PRs Cephla-Lab#4-8 into a single feature branch:

- phase_cross_correlation: GPU FFT via torch.fft (~46x speedup)
- shift_array: GPU grid_sample for subpixel shifts (~6.7x speedup)
- match_histograms: GPU sort/quantile mapping (~13.3x speedup)
- block_reduce: GPU avg_pool2d (~4x speedup)
- compute_ssim: GPU conv2d for local statistics (~6.4x speedup)

All functions include automatic CPU fallback when CUDA is unavailable.
Replaces cupy/cucim dependency with PyTorch for broader compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add type hints to phase_cross_correlation, shift_array, match_histograms, block_reduce, compute_ssim
- Add return type hints to to_numpy and to_device
- Import Callable, Any, Union from typing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix incorrect pixel assignment in _match_histograms_torch
  The previous code used unnecessary indexing that permuted results incorrectly
- Simplify to_device return type from Union[Any, np.ndarray] to Any
- Remove unused Union import
- Add pixel-by-pixel test comparing GPU vs skimage results

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove redundant `torch is not None` check in to_numpy
- Add type hint to _shift_array_torch shift_vec parameter
- Fix shift_array CPU path to compute in float64 for API consistency
  (preserve_dtype=False now returns float on both GPU and CPU paths)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The GPU implementation returns 0.0 for error and phasediff values
since these are not computed. Added notes to docstring to clarify.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@maragall
Copy link
Author

maragall commented Mar 7, 2026

Closing in favor of #13 which is the original PR for this work.

I've rebased the branch onto latest main (zero conflicts) and benchmarked it — see my comment on #13 with results and the rebased branch reference.

@maragall maragall closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants