Optimize x86/aarch64 MD5 implementation#25737
Conversation
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com>
|
Hi @AWSjswinney, Have you observed any variance in the speedup between older and newer generations of CPUs on either platform? It would be interesting to know if the optimization yields similar benefits across a wider range of hardware or if it’s more pronounced in certain cases. Cheers, |
This implements the
I have personally tested it on Intel(R) Xeon(R) Platinum 8259CL (Cascade Lake) and aarch64 Neoverse N1 CPU and both saw 5% improvement. Intuitively (though I'm no expert) it should help a similar amount anywhere that instruction-level parallelism is advanced enough to take advantage of the shorter dependency path. |
Thank you Oliver! |
|
This PR is in a state where it requires action by @openssl/committers but the last update was 30 days ago |
|
Can this PR be reviewed so it can move forward? |
|
I got ~5% improvement on Ampere Altra processor with the speed test:
|
|
This pull request is ready to merge |
|
Merged to the master branch. Thank you for your contribution. |
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from #25737)
|
I know this has been reviewed and merged now (thanks - I've had problems with my vision in Nov and Dec which has restricted what I have been able to do) but pinging @paul-elliott-arm for visibility anyway |
|
Thank you! |
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from openssl#25737)
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms. Signed-off-by: Oli Gillespie <ogillesp@amazon.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Hugo Landau <hlandau@devever.net> (Merged from openssl/openssl#25737) Signed-off-by: zxz*3 <zhangxiaozan1@huawei.com>
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.