polyval: implement Karatsuba multiplication for arm64#181
polyval: implement Karatsuba multiplication for arm64#181tarcieri merged 2 commits intoRustCrypto:masterfrom
Conversation
Improves performance by ~200 MB/s on a 2020 M1. Signed-off-by: Eric Lagergren <eric@ericlagergren.com>
|
The code is taken from https://github.com/ericlagergren/polyval-rs/tree/dev, which also has "wide" implementations (8 blocks at a time), which has significantly better performance (~0.17 cycles per byte instead of ~2). |
|
I also have an x86 version I can submit as well if you'd like. |
|
Parallel and x86 versions would be appreciated, although perhaps as separate PRs to ease reviewability |
Signed-off-by: Eric Lagergren <eric@ericlagergren.com>
tarcieri
left a comment
There was a problem hiding this comment.
Tested locally on an M2 Max, where I observed the reported speedups.
Percentage-wise it's about a 17% speedup.
Actually, your x86 implementation only uses 3 clmul instructions, so I don't think the serial version can be improved much. I'll look at adding parallel implementations. Off hand, do you know if the current API supports it? The input probably needs to be in one contiguous buffer. (Maybe not?) But that's the common case, at least for stuff like non-interleaved AES-GCM-SIV or HCTR2. |
|
Take a look at |
Added - add `new_with_init_block` (RustCrypto#195) Changed - implement Karatsuba multiplication for arm64 (RustCrypto#181)
Improves performance by ~200 MB/s on a 2020 M1.