-
Notifications
You must be signed in to change notification settings - Fork 1
Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26
Conversation
This commit significantly improves the performance of row vector × matrix multiplication by reorganizing the computation to exploit row-major storage and SIMD acceleration. ## Key Changes - Rewrote `Matrix.multiplyRowVector` to use weighted sum of matrix rows - Original: column-wise accumulation with strided memory access - Optimized: row-wise accumulation with contiguous memory and SIMD ## Performance Improvements Compared to baseline (from PR #20): | Size | Before | After | Improvement | |---------|-----------|-----------|-------------| | 10×10 | 84.3 ns | 55.2 ns | 34.5% faster | | 50×50 | 1,958 ns | 622.6 ns | 68.2% faster | | 100×100 | 9,208 ns | 1,905 ns | 79.3% faster | The optimization achieves 3.5-4.8× speedup for larger matrices by: 1. Eliminating strided column access patterns 2. Enabling SIMD vectorization on contiguous row data 3. Broadcasting vector weights efficiently across SIMD lanes 4. Skipping zero weights to reduce unnecessary computation ## Implementation Details The new implementation computes: result = v[0]*row0 + v[1]*row1 + ... + v[n-1]*row(n-1) This approach: - Accesses matrix rows contiguously (cache-friendly) - Broadcasts each weight v[i] to all SIMD lanes - Accumulates weighted rows directly into the result vector - Falls back to original scalar implementation for small matrices ## Testing - All 132 existing tests pass - Benchmark infrastructure added (Matrix.fs benchmarks) - Memory allocations unchanged 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
The numbers do seem to check out. This is running on my laptop where I checked Before (same benchmarks, adding
After:
|
📊 Code Coverage ReportSummary
📈 Coverage Analysis🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%. 🎯 Coverage Goals
📋 What These Numbers Mean
🔗 Detailed Reports📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report Coverage report generated on 2025-10-14 at 15:36:53 UTC |
|
This is very good. I did not think of this option. |
|
@muehlhaus So interesting isn't it. |
Summary
This PR optimizes row-vector × matrix multiplication (
v × M) achieving 3.5-4.8× speedup for typical matrix sizes by reorganizing the computation to exploit row-major storage and SIMD acceleration.Performance Goal
Goal Selected: Optimize vector × matrix multiplication (Phase 2, Priority: MEDIUM)
Rationale: The research plan from Discussion #11 and benchmarks from PR #20 identified that
VectorMatrixMultiply(vector × matrix) was 4-5× slower thanMatrixVectorMultiply(matrix × vector). This asymmetry was caused by column-wise memory access patterns that don't align with row-major storage.Changes Made
Core Optimization
File Modified:
src/FsMath/Matrix.fs-multiplyRowVectorfunction (lines 581-645)Original Implementation:
Optimized Implementation:
Benchmark Infrastructure
Added comprehensive matrix operation benchmarks from PR #20:
benchmarks/FsMath.Benchmarks/Matrix.fs(108 lines, 14 benchmarks)FsMath.Benchmarks.fsproj- Added Matrix.fs to compilationProgram.fs- Registered MatrixBenchmarks classApproach
Performance Measurements
Test Environment
Results Summary
Detailed Benchmark Results
Key Observations
MatrixVectorMultiplyperformanceWhy This Works
The optimization addresses three key bottlenecks:
Memory Access Pattern:
data[i*m + j]) - cache-unfriendlydata[i*m..(i+1)*m]) - cache-friendlySIMD Utilization:
Computational Efficiency:
Replicating the Performance Measurements
To replicate these benchmarks:
Results are saved to
BenchmarkDotNet.Artifacts/results/in GitHub MD, HTML, and CSV formats.Testing
✅ All 132 tests pass
✅ VectorMatrixMultiply benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 3.5-4.8× for target sizes
✅ Correctness verified across all test cases
Implementation Details
Optimization Techniques Applied
v × Mas linear combination of matrix rowsNumerics.Vector<'T>(weight)to broadcast scalar across SIMD lanesv[i] == 0Code Quality
Next Steps
This PR establishes parity between vector × matrix and matrix × vector operations. Based on the performance plan, remaining Phase 2 work includes:
getColstill has strided accessFuture Optimization Opportunities
From this work, I identified additional optimization targets:
getCol): Could use SIMD gather instructionsRelated Issues/Discussions
🤖 Generated with Claude Code