Skip to content

chore: simplify dot implementation to use auto-vectorization#2645

Merged
wjones127 merged 2 commits intomainfrom
lei/simplify_dot
Nov 10, 2025
Merged

chore: simplify dot implementation to use auto-vectorization#2645
wjones127 merged 2 commits intomainfrom
lei/simplify_dot

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Jul 26, 2024

This change makes the auto-vectorization version of dot(f32) as fast as manually written SIMD.

Run benchmarks via

export RUSTFLAGS="-C target-cpu=native"
git checkout main
cargo bench --bench dot -- --save-baseline dot_main f32
git checkout lei/simplify_dot
cargo bench --bench dot -- --baseline dot_main f32

On Macbook M2 Max

Dot(f32, auto-vectorization)
                        time:   [88.812 ms 89.654 ms 90.306 ms]
                        change: [-2.5819% -1.6876% -0.6964%] (p = 0.01 < 0.10)
                        Change within noise threshold.

AMD 5900X

Dot(f32, auto-vectorization)
                        time:   [172.50 ms 176.41 ms 179.41 ms]
                        change: [-2.3545% +0.6133% +3.5448%] (p = 0.69 > 0.10)
                        No change in performance detected.

Intel Sapphire

Dot(f32, auto-vectorization)
                        time:   [331.36 ms 331.62 ms 331.93 ms]
                        change: [-2.3160% -1.1226% -0.3451%] (p = 0.04 < 0.10)
                        Change within noise threshold.

Graviton3

Benchmarking Dot(f32, auto-vectorization): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.8s or enable flat sampling.
Dot(f32, auto-vectorization)
                        time:   [160.62 ms 160.70 ms 160.76 ms]
                        change: [-1.1157% -0.6868% -0.2951%] (p = 0.00 < 0.10)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

@github-actions github-actions Bot added the chore label Jul 26, 2024
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.73%. Comparing base (3edfa50) to head (62228e5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2645      +/-   ##
==========================================
- Coverage   79.81%   79.73%   -0.08%     
==========================================
  Files         224      224              
  Lines       65871    65827      -44     
  Branches    65871    65827      -44     
==========================================
- Hits        52572    52489      -83     
- Misses      10225    10256      +31     
- Partials     3074     3082       +8     
Flag Coverage Δ
unittests 79.73% <100.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eddyxu eddyxu marked this pull request as ready for review July 27, 2024 19:16
@eddyxu eddyxu force-pushed the lei/simplify_dot branch 2 times, most recently from d2fbabd to e33a1c5 Compare July 29, 2024 15:35
@github-actions github-actions Bot added the Stale label Nov 6, 2025
@wjones127 wjones127 merged commit 96cfdf2 into main Nov 10, 2025
30 of 31 checks passed
@wjones127 wjones127 deleted the lei/simplify_dot branch November 10, 2025 19:52
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
…ormat#2645)

This change makes the auto-vectorization version of dot(f32) as fast as
manually written SIMD.

Run benchmarks via

```
export RUSTFLAGS="-C target-cpu=native"
git checkout main
cargo bench --bench dot -- --save-baseline dot_main f32
git checkout lei/simplify_dot
cargo bench --bench dot -- --baseline dot_main f32
```

On Macbook M2 Max
```
Dot(f32, auto-vectorization)
                        time:   [88.812 ms 89.654 ms 90.306 ms]
                        change: [-2.5819% -1.6876% -0.6964%] (p = 0.01 < 0.10)
                        Change within noise threshold.
```

AMD 5900X

```
Dot(f32, auto-vectorization)
                        time:   [172.50 ms 176.41 ms 179.41 ms]
                        change: [-2.3545% +0.6133% +3.5448%] (p = 0.69 > 0.10)
                        No change in performance detected.
```

Intel Sapphire

```
Dot(f32, auto-vectorization)
                        time:   [331.36 ms 331.62 ms 331.93 ms]
                        change: [-2.3160% -1.1226% -0.3451%] (p = 0.04 < 0.10)
                        Change within noise threshold.
```

Graviton3

```
Benchmarking Dot(f32, auto-vectorization): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.8s or enable flat sampling.
Dot(f32, auto-vectorization)
                        time:   [160.62 ms 160.70 ms 160.76 ms]
                        change: [-1.1157% -0.6868% -0.2951%] (p = 0.00 < 0.10)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants