Conversation
|
Thanks! Could you do |
|
It is actually between 1-2% slower. Looking at the ASM, the optimizer does the right thing with the bytewise loop (unroll the loop, move data in SIMD chunks), but it doesn't see through the wordwise loop. However, I found that if I manually unroll the loop, the optimizer produces SIMD output equivalent to the current Benchmarks (SSE4.1 machine): Before this PR: After original PR: After updated PR: |
|
What if you use explicit indexing ( |
This comment has been minimized.
This comment has been minimized.
This reverts commit 7d9607a. (Had a bug, after fixing the bug perf was poor)
|
@Ralith: Thanks for the idea, but in this case I'm getting poor performance with a 0..4 loop. Tried as follows: |
|
Ah well, thanks for trying! |
|
Perf numbers for another PR I'm making weren't what I expected, but I narrowed the results down to this: That's 15-29% slower. (CPU is 5800X aka Vermeer/Zen 3.) |
|
What is "before", and what is "after"? Your numbers look faster. @kazcw Did you perform your benchmarks with native optimizations or without? |
|
I also observe the performance regression on a Ryzen 9 4900HS, independent of native optimizations. So it looks like the new code does not optimize properly for AVX? DetailsBefore: After: |
mod guts was originally designed for the byteslice interface RustCrypto APIs require--but the algorithm operates on u32 words internally, and rand wants a wordslice interface, so we were converting to bytes in mod guts and converting back to words in mod chacha. We can simply output directly to a wordslice in guts. It is simpler; it may be marginally faster; it avoids an unsafe (cf. #1170).