Optimize x86/aarch64 MD5 implementation by AWSjswinney · Pull Request #25737 · openssl/openssl

AWSjswinney · 2024-10-18T17:11:48Z

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com>

Montana · 2024-10-21T02:34:03Z

Hi @AWSjswinney,

Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.

Cheers,
Montana.

olivergillespie · 2024-10-21T11:55:48Z

Hi @AWSjswinney,

Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.

Cheers, Montana.

This implements the GOpt optimization from https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#speedup-vs-standard - you can see a range of CPU families tested there, and I expect this implementation to behave the same. Copied from there:

Method Merom Haswell Skylake-X Airmont K10 Piledriver Jaguar

Standard 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%

GOpt +4.99% +5.37% +5.35% 0.00% +0.85% +6.27% -0.38%

I have personally tested it on Intel(R) Xeon(R) Platinum 8259CL (Cascade Lake) and aarch64 Neoverse N1 CPU and both saw 5% improvement.

Intuitively (though I'm no expert) it should help a similar amount anywhere that instruction-level parallelism is advanced enough to take advantage of the shorter dependency path.

Montana · 2024-10-21T14:29:02Z

Hi @AWSjswinney,
Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases.
Cheers, Montana.

This implements the GOpt optimization from https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#speedup-vs-standard - you can see a range of CPU families tested there, and I expect this implementation to behave the same. Copied from there:

Method
Merom
Haswell
Skylake-X
Airmont
K10
Piledriver
Jaguar

Standard
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%

GOpt
+4.99%
+5.37%
+5.35%
0.00%
+0.85%
+6.27%
-0.38%

I have personally tested it on Intel(R) Xeon(R) Platinum 8259CL (Cascade Lake) and aarch64 Neoverse N1 CPU and both saw 5% improvement.

Intuitively (though I'm no expert) it should help a similar amount anywhere that instruction-level parallelism is advanced enough to take advantage of the shorter dependency path.

Thank you Oliver!

openssl-machine · 2024-11-21T00:08:40Z

This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago

AWSjswinney · 2024-12-09T17:53:10Z

Can this PR be reviewed so it can move forward?

davidzengxhsh · 2024-12-31T01:12:24Z

I got ~5% improvement on Ampere Altra processor with the speed test:

size	16	64	256	1024	8192	16384
Improvement	102.3%	103.2%	104.5%	105.1%	105.3%	105.5%

openssl-machine · 2025-01-05T08:00:29Z

This pull request is ready to merge

t8m · 2025-01-06T10:44:07Z

Merged to the master branch. Thank you for your contribution.

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from #25737)

tom-cosgrove-arm · 2025-01-06T11:48:00Z

I know this has been reviewed and merged now (thanks - I've had problems with my vision in Nov and Dec which has restricted what I have been able to do) but pinging @paul-elliott-arm for visibility anyway

AWSjswinney · 2025-01-06T14:56:37Z

Thank you!

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from openssl#25737)

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from openssl/openssl#25737) Signed-off-by: zxz*3 <zhangxiaozan1@huawei.com>

paulidale added triaged: feature The issue/pr requests/adds a feature help wanted labels Oct 20, 2024

paulidale approved these changes Dec 9, 2024

View reviewed changes

hlandau approved these changes Jan 4, 2025

View reviewed changes

hlandau added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Jan 4, 2025

openssl-machine added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Jan 5, 2025

t8m closed this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize x86/aarch64 MD5 implementation#25737

Optimize x86/aarch64 MD5 implementation#25737
AWSjswinney wants to merge 1 commit into
openssl:masterfrom
AWSjswinney:master

AWSjswinney commented Oct 18, 2024

Uh oh!

Montana commented Oct 21, 2024

Uh oh!

olivergillespie commented Oct 21, 2024

Uh oh!

Montana commented Oct 21, 2024

Uh oh!

openssl-machine commented Nov 21, 2024

Uh oh!

AWSjswinney commented Dec 9, 2024

Uh oh!

davidzengxhsh commented Dec 31, 2024

Uh oh!

openssl-machine commented Jan 5, 2025

Uh oh!

t8m commented Jan 6, 2025

Uh oh!

tom-cosgrove-arm commented Jan 6, 2025

Uh oh!

AWSjswinney commented Jan 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

AWSjswinney commented Oct 18, 2024

Uh oh!

Montana commented Oct 21, 2024

Uh oh!

olivergillespie commented Oct 21, 2024

Uh oh!

Montana commented Oct 21, 2024

Uh oh!

openssl-machine commented Nov 21, 2024

Uh oh!

AWSjswinney commented Dec 9, 2024

Uh oh!

davidzengxhsh commented Dec 31, 2024

Uh oh!

openssl-machine commented Jan 5, 2025

Uh oh!

t8m commented Jan 6, 2025

Uh oh!

tom-cosgrove-arm commented Jan 6, 2025

Uh oh!

AWSjswinney commented Jan 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants