Skip to content

Optimize x86/aarch64 MD5 implementation#25737

Closed
AWSjswinney wants to merge 1 commit into
openssl:masterfrom
AWSjswinney:master
Closed

Optimize x86/aarch64 MD5 implementation#25737
AWSjswinney wants to merge 1 commit into
openssl:masterfrom
AWSjswinney:master

Conversation

@AWSjswinney
Copy link
Copy Markdown
Contributor

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function,
we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z))
is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform
those additions independently, leaving our dependency on x to the final
addition. This speeds it up around 5% on both platforms.

Signed-off-by: Oli Gillespie <ogillesp@amazon.com>
@paulidale paulidale added triaged: feature The issue/pr requests/adds a feature help wanted labels Oct 20, 2024
@Montana
Copy link
Copy Markdown

Montana commented Oct 21, 2024

Hi @AWSjswinney,

Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.

Cheers,
Montana.

@t8m t8m added branch: master Applies to master branch approval: review pending This pull request needs review by a committer tests: exempted The PR is exempt from requirements for testing triaged: performance The issue/pr reports/fixes a performance concern and removed help wanted triaged: feature The issue/pr requests/adds a feature labels Oct 21, 2024
@olivergillespie
Copy link
Copy Markdown

Hi @AWSjswinney,

Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.

Cheers, Montana.

This implements the GOpt optimization from https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#speedup-vs-standard - you can see a range of CPU families tested there, and I expect this implementation to behave the same. Copied from there:

Method Merom Haswell Skylake-X Airmont K10 Piledriver Jaguar
Standard 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
GOpt +4.99% +5.37% +5.35% 0.00% +0.85% +6.27% -0.38%

I have personally tested it on Intel(R) Xeon(R) Platinum 8259CL (Cascade Lake) and aarch64 Neoverse N1 CPU and both saw 5% improvement.

Intuitively (though I'm no expert) it should help a similar amount anywhere that instruction-level parallelism is advanced enough to take advantage of the shorter dependency path.

@Montana
Copy link
Copy Markdown

Montana commented Oct 21, 2024

Hi @AWSjswinney,
Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.
Cheers, Montana.

This implements the GOpt optimization from https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#speedup-vs-standard - you can see a range of CPU families tested there, and I expect this implementation to behave the same. Copied from there:

Method
Merom
Haswell
Skylake-X
Airmont
K10
Piledriver
Jaguar

Standard
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%

GOpt
+4.99%
+5.37%
+5.35%
0.00%
+0.85%
+6.27%
-0.38%

I have personally tested it on Intel(R) Xeon(R) Platinum 8259CL (Cascade Lake) and aarch64 Neoverse N1 CPU and both saw 5% improvement.

Intuitively (though I'm no expert) it should help a similar amount anywhere that instruction-level parallelism is advanced enough to take advantage of the shorter dependency path.

Thank you Oliver!

@openssl-machine
Copy link
Copy Markdown
Collaborator

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

@AWSjswinney
Copy link
Copy Markdown
Contributor Author

Can this PR be reviewed so it can move forward?

@davidzengxhsh
Copy link
Copy Markdown

I got ~5% improvement on Ampere Altra processor with the speed test:

size 16 64 256 1024 8192 16384
Improvement 102.3% 103.2% 104.5% 105.1% 105.3% 105.5%

@hlandau hlandau added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Jan 4, 2025
@openssl-machine openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Jan 5, 2025
@openssl-machine
Copy link
Copy Markdown
Collaborator

This pull request is ready to merge

@t8m
Copy link
Copy Markdown
Member

t8m commented Jan 6, 2025

Merged to the master branch. Thank you for your contribution.

@t8m t8m closed this Jan 6, 2025
openssl-machine pushed a commit that referenced this pull request Jan 6, 2025
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function,
we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z))
is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform
those additions independently, leaving our dependency on x to the final
addition. This speeds it up around 5% on both platforms.

Signed-off-by: Oli Gillespie <ogillesp@amazon.com>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Hugo Landau <hlandau@devever.net>
(Merged from #25737)
@tom-cosgrove-arm
Copy link
Copy Markdown
Contributor

I know this has been reviewed and merged now (thanks - I've had problems with my vision in Nov and Dec which has restricted what I have been able to do) but pinging @paul-elliott-arm for visibility anyway

@AWSjswinney
Copy link
Copy Markdown
Contributor Author

Thank you!

Sashan pushed a commit to Sashan/openssl that referenced this pull request Apr 23, 2025
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function,
we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z))
is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform
those additions independently, leaving our dependency on x to the final
addition. This speeds it up around 5% on both platforms.

Signed-off-by: Oli Gillespie <ogillesp@amazon.com>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Hugo Landau <hlandau@devever.net>
(Merged from openssl#25737)
eclipse-oniro-oh-bot pushed a commit to eclipse-oniro-mirrors/third_party_openssl that referenced this pull request Nov 22, 2025
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function,
we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z))
is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform
those additions independently, leaving our dependency on x to the final
addition. This speeds it up around 5% on both platforms.

Signed-off-by: Oli Gillespie <ogillesp@amazon.com>

Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Hugo Landau <hlandau@devever.net>
(Merged from openssl/openssl#25737)

Signed-off-by: zxz*3 <zhangxiaozan1@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Applies to master branch tests: exempted The PR is exempt from requirements for testing triaged: performance The issue/pr reports/fixes a performance concern

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants