Development of a high-performance Python application for GPU-accelerated permutation testing on fMRI neuroimaging data, inspired by FSL's randomise program but leveraging modern GPU architectures (CUDA and Apple Metal Performance Shaders) for substantial performance improvements.
- Current Challenge: FSL randomise is CPU-bound and can take hours to days for large datasets with many permutations
- Opportunity: Modern GPUs can parallelize permutation testing, potentially achieving 10-100x speedup
- Target Users: Neuroimaging researchers, computational neuroscientists, and clinical research teams
- Achieve minimum 10x performance improvement over CPU-based FSL randomise
- Maintain statistical accuracy within 0.001% of reference implementation
- Support both NVIDIA (CUDA) and Apple Silicon (MPS) GPUs
- Provide comprehensive test coverage (>90%)
The requirements for this package represent a subset of the full capability of the FSL randomise tool. Additional features may be added in the future.
-
General Linear Model (GLM) support
- Support for design matrices and contrast matrices
- Handle multiple contrasts simultaneously
- Support for F-tests and t-tests
-
Permutation Strategies
- Sign-flipping for paired/repeated measures designs
- Full permutation enumeration (for small sample sizes)
- Monte Carlo permutation sampling (for large sample sizes)
- Block permutation for exchangeability blocks
-
Variance smoothing
- Variance smoothing for t-stats
- Multiple Comparison Corrections
- Family-wise error rate (FWER) control
- Threshold-free cluster enhancement (TFCE)
- Cluster-based thresholding
- Voxel-wise correction
- T-statistic computation
- F-statistic computation
-
Neuroimaging Image Formats
- NIfTI (.nii, .nii.gz)
-
Design Files
- Text-based design matrices (CSV, TSV)
- Text-based contrast files (CSV, TSV)
- Output maps should use same format as input maps
- T/F-statistic maps
- Uncorrected p-value maps
- Voxel-based familywise error (FWE) corrected maps
- Cluster-based FWE corrected maps
- Cluster tables
- TFCE-corrected maps
- MPS backend for M1/M2/M3 Mac systems
- Automatic fallback to CPU for unsupported operations
- Memory optimization for Apple's unified memory architecture
- CUDA compute capability 6.0+ support
- Dynamic kernel optimization based on data dimensions
- Efficient memory management with unified memory where applicable
- CUDA streams for overlapping computation and data transfer
- Runtime detection of available GPU resources
- Intelligent backend selection based on:
- Data size
- Available GPU memory
- Operation complexity
- User preferences
- Throughput: Process 100,000+ voxels with 10,000 permutations in under 1 minute on modern GPU
- Memory Efficiency: Handle datasets up to available GPU memory (typically 8-48GB)
- Latency: Interactive response (<100ms) for parameter adjustment in GUI mode
-
Modular Design
- Clear separation of concerns (statistics, I/O, GPU kernels)
- Dependency injection for testability
-
Design Patterns
- Strategy pattern for backend selection
- Factory pattern for data loader creation
- Observer pattern for progress reporting
- Command pattern for operation queuing
- PEP 8 compliance for Python code
- Type hints for all public APIs
- Comprehensive docstrings (NumPy style)
- Preferred maximum function length of 50 lines
- Unit test coverage: >90%
- Integration test coverage: >80%
- GPU kernel testing with known inputs/outputs
- Statistical validation against FSL randomise outputs
- pytest for test orchestration
- pytest-benchmark for performance regression testing
- pytest-cov for coverage reporting
- Mock GPU backends for CI/CD testing
- Unit Tests: Individual function validation
- Integration Tests: Module interaction testing
- Statistical Tests: Accuracy validation against known results
- Performance Tests: Regression testing for speed/memory
- Stress Tests: Large dataset and edge case handling
-
User Documentation
- Installation guide for different platforms
- Quick start tutorial
- API reference documentation
- Statistical methodology documentation
- Migration guide from FSL randomise
-
Developer Documentation
- Architecture overview
- Contributing guidelines
- GPU kernel development guide
- Plugin development guide
┌─────────────────────────────────────────────────┐
│ CLI/GUI Interface │
├─────────────────────────────────────────────────┤
│ Core Engine │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Statistics│ │Permutation│ │Correction│ │
│ │ Module │ │ Engine │ │ Module │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────┤
│ Backend Abstraction Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ MPS │ │ CUDA │ │ CPU │ │
│ │ Backend │ │ Backend │ │ Backend │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────┤
│ Data I/O Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ NIfTI │ │ CIFTI │ │ Design │ │
│ │ Loader │ │ Loader │ │ Loader │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────┘
- Orchestrator: Coordinates workflow execution
- Statistics Module: Implements GLM and test statistics
- Permutation Engine: Manages permutation generation and application
- Correction Module: Implements multiple comparison corrections
- Backend Interface: Abstract base class for compute backends
- MPS Backend: PyTorch MPS or Metal-cpp implementation
- CUDA Backend: CuPy/PyTorch-based CUDA implementation
- CPU Backend: NumPy/Numba fallback implementation
- Data Loaders: Format-specific data reading
- Memory Manager: GPU memory allocation and caching
- Data Transformer: Preprocessing and format conversion
- Python: 3.12+ (for modern type hints and features)
- NumPy: Array operations and CPU backend
- SciPy: Statistical functions and distributions
- Nibabel: Neuroimaging data I/O
- PyTorch: GPU acceleration framework (CUDA and MPS)
- uv: Package management
- pytest: Testing framework
- ruff: Code formatting and linting
- mypy: Static type checking
- sphinx: Documentation generation
- Red Phase: Write failing tests for new features
- Green Phase: Implement minimum code to pass tests
- Refactor Phase: Optimize and clean up code
- Feature Planning: Create detailed specifications with test cases
- Research: Examine existing FSL code and generate pseudocode for relevant functions
- Test Implementation: Write comprehensive test suite
- Implementation: Develop feature following TDD
- Code Review: Peer review with focus on performance and correctness
- Integration Testing: Validate with real fMRI datasets
- Performance Profiling: Optimize bottlenecks
- Documentation: Update user and API documentation
- CI Pipeline
- Automated testing on push
- Code quality checks (linting, formatting, type checking)
- Performance regression testing
- Documentation building
- Compare outputs with FSL randomise on standard datasets
- Synthetic data testing with known ground truth
- Edge case testing (small samples, extreme values)
-
Benchmark Datasets
- Small: 1000 voxels, 20 subjects, 1000 permutations
- Large: 250,000 voxels, 100 subjects, 10000 permutations
-
Metrics
- Wall-clock time
- GPU utilization
- Memory usage (GPU and system)
- Speedup vs CPU implementation
- Scaling efficiency (multi-GPU)
- Project setup and CI pipeline
- Core architecture implementation
- Basic data I/O for NIfTI files
- CPU backend with basic GLM
- MPS backend implementation
- CUDA backend implementation
- Backend abstraction layer
- GPU memory management
- Permutation strategies implementation
- TFCE implementation
- Multiple comparison corrections
- Statistical validation suite
- Performance optimization
- Comprehensive documentation
- Example notebooks and tutorials
- Final testing and validation
- Package distribution setup
- Release documentation
- Community feedback integration
-
Risk: GPU memory limitations for large datasets
- Mitigation: Implement chunking and out-of-core processing
-
Risk: Numerical precision differences between GPU and CPU
- Mitigation: Extensive validation testing, mixed precision options
-
Risk: Platform-specific GPU issues
- Mitigation: Comprehensive testing matrix, fallback mechanisms
-
Risk: Scope creep from additional statistical methods
- Mitigation: Clear phase boundaries
-
Risk: Performance targets not met
- Mitigation: Early profiling, alternative algorithmic approaches
- Performance improvement: >10x speedup vs FSL randomise
- Test coverage: >90% for unit tests
- Statistical accuracy: <0.001% deviation from reference
- Documentation completeness and clarity
- Ease of migration from FSL randomise
- fMRI: Functional Magnetic Resonance Imaging
- GLM: General Linear Model
- TFCE: Threshold-Free Cluster Enhancement
- FWER: Family-Wise Error Rate
- MPS: Metal Performance Shaders
- FSL Randomise Documentation: https://fsl.fmrib.ox.ac.uk/fsl/docs/#/statistics/randomise
- Winkler et al. (2014) - Permutation inference for the GLM
- Smith & Nichols (2009) - Threshold-free cluster enhancement
- Freedman & Lane (1983) - Permutation tests for linear models
- One-sample t-test on fMRI data: randomise -i -o -1 -T -c 3.0 -v 10
- One-sample t-test on fMRI data with exchangeability blocks: randomise -i -o -1 -T -c 3.0 -v 10 -e <design.grp> --permuteBlocks
- Multiple regression with t contrasts on fMRI data: randomise -i -o -d -t -T -c 3.0 -v 10 -e <design.grp> --permuteBlocks