Chacha: performance improvements#1192
Conversation
|
I noticed travis tests failing to build the ppv-lite update, so I yanked that version. The problem relates to non-x86_64 platforms. I'll take this out of draft when I have that working. |
On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.
Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).
|
Also for comparison on Zen3 5800X: |
|
Those are some very impressive improvements! Could you please also update the |
|
I'm not sure if calling it a 48% improvement in the changelog is honest. These are my numbers for the 0.8.0 release: Compared to that, this PR's performance is +7%, +5%, +4% (almost margin of error). #1181 is what decreased performance; this is just recovering it. |
|
Oh, that's disappointing. I didn't realize that PR was never reverted. |
|
That PR removes an I can't cleanly revert #1181 on top of this PR (for testing purposes). |
|
My thinking was that no incremental improvement to #1181 was possible--the compiler couldn't see through that choice of abstraction for AVX2, so that was it. The goal was still valid, but that attempt had failed. I expected it would be reverted and didn't realize that it hadn't yet. (I should have communicated more there and made sure we were all on the same page.) I started this as a new attempt at improving outputting, since after #1181 I realized that the optimal approach for AVX2 would need an explicit 4x4(128b)-transpose. Because I made a wrong assumption about which outputting code I was replacing, I was mistaken about what the baseline was. Sorry for the confusion, it made this small win a lot less exciting. But anyway, progress is progress. |
|
Okay — so to go from here, shall I merge this, then you can optionally create a new PR to revert/adjust #1181? Also, it would be useful to have a changelog entry of some kind. |
|
Updated changelog. Nothing more need be done about #1181. |
|
Thanks @kazcw. |
Improve AVX2 vectorizability of copying results to buffer
Also use a faster method of incrementing the pos counter on LE.
Total performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).
CPU: E5-2620 v3 (avx2)
BEFORE
AFTER
CPU: X5640L (no avx2)
BEFORE
AFTER