Chacha: performance improvements by kazcw · Pull Request #1192 · rust-random/rand

kazcw · 2021-10-14T05:40:19Z

Improve AVX2 vectorizability of copying results to buffer

Also use a faster method of incrementing the pos counter on LE.
Total performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

CPU: E5-2620 v3 (avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     501,639 ns/iter (+/- 411) = 2041 MB/s
test gen_bytes_chacha12      ... bench:     637,970 ns/iter (+/- 485) = 1605 MB/s
test gen_bytes_chacha20      ... bench:     927,819 ns/iter (+/- 6,716) = 1103 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     367,249 ns/iter (+/- 205) = 2788 MB/s (+37%)
test gen_bytes_chacha12      ... bench:     512,407 ns/iter (+/- 1,121) = 1998 MB/s (+24%)
test gen_bytes_chacha20      ... bench:     803,428 ns/iter (+/- 1,592) = 1274 MB/s (+16%)

CPU: X5640L (no avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     873,669 ns/iter (+/- 682) = 1172 MB/s
test gen_bytes_chacha12      ... bench:   1,179,485 ns/iter (+/- 2,889) = 868 MB/s
test gen_bytes_chacha20      ... bench:   1,768,024 ns/iter (+/- 3,030) = 579 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     826,541 ns/iter (+/- 1,599) = 1238 MB/s (+6%)
test gen_bytes_chacha12      ... bench:   1,146,041 ns/iter (+/- 3,173) = 893 MB/s (+3%)
test gen_bytes_chacha20      ... bench:   1,756,848 ns/iter (+/- 4,348) = 582 MB/s (+1%)

kazcw · 2021-10-14T05:47:16Z

I noticed travis tests failing to build the ppv-lite update, so I yanked that version. The problem relates to non-x86_64 platforms. I'll take this out of draft when I have that working.

On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.

Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

dhardy · 2021-10-14T10:00:32Z

Also for comparison on Zen3 5800X:

# BEFORE
test gen_bytes_chacha8       ... bench:     228,002 ns/iter (+/- 2,042) = 4491 MB/s
test gen_bytes_chacha12      ... bench:     297,280 ns/iter (+/- 2,171) = 3444 MB/s
test gen_bytes_chacha20      ... bench:     435,253 ns/iter (+/- 3,133) = 2352 MB/s
# AFTER
test gen_bytes_chacha8       ... bench:     153,607 ns/iter (+/- 1,184) = 6666 MB/s (+48%)
test gen_bytes_chacha12      ... bench:     223,379 ns/iter (+/- 8,808) = 4584 MB/s (+33%)
test gen_bytes_chacha20      ... bench:     361,063 ns/iter (+/- 3,144) = 2836 MB/s (+21%)

vks · 2021-10-14T10:46:59Z

Those are some very impressive improvements! Could you please also update the rand_chacha changelog?

dhardy · 2021-10-14T16:48:44Z

I'm not sure if calling it a 48% improvement in the changelog is honest. These are my numbers for the 0.8.0 release:

test gen_bytes_chacha8       ... bench:     164,178 ns/iter (+/- 5,078) = 6237 MB/s
test gen_bytes_chacha12      ... bench:     233,660 ns/iter (+/- 5,813) = 4382 MB/s
test gen_bytes_chacha20      ... bench:     374,233 ns/iter (+/- 6,502) = 2736 MB/s

Compared to that, this PR's performance is +7%, +5%, +4% (almost margin of error).

#1181 is what decreased performance; this is just recovering it.

dhardy

Anyway, thanks for the fix @kazcw! Other than the misleading changelog I approve the PR.

kazcw · 2021-10-14T16:59:33Z

Oh, that's disappointing. I didn't realize that PR was never reverted.

dhardy · 2021-10-15T10:03:20Z

That PR removes an unsafe and ideally wants more looking into. Sorry, I assumed this was an attempt to optimise the result.

I can't cleanly revert #1181 on top of this PR (for testing purposes).

kazcw · 2021-10-15T15:42:25Z

My thinking was that no incremental improvement to #1181 was possible--the compiler couldn't see through that choice of abstraction for AVX2, so that was it. The goal was still valid, but that attempt had failed. I expected it would be reverted and didn't realize that it hadn't yet. (I should have communicated more there and made sure we were all on the same page.)

I started this as a new attempt at improving outputting, since after #1181 I realized that the optimal approach for AVX2 would need an explicit 4x4(128b)-transpose. Because I made a wrong assumption about which outputting code I was replacing, I was mistaken about what the baseline was. Sorry for the confusion, it made this small win a lot less exciting.

But anyway, progress is progress.

dhardy · 2021-10-18T09:28:50Z

Okay — so to go from here, shall I merge this, then you can optionally create a new PR to revert/adjust #1181?

Also, it would be useful to have a changelog entry of some kind.

kazcw · 2021-10-22T16:21:46Z

Updated changelog.

Nothing more need be done about #1181.

dhardy · 2021-10-23T10:40:36Z

Thanks @kazcw.

kazcw marked this pull request as draft October 14, 2021 05:45

kazcw marked this pull request as ready for review October 14, 2021 06:05

kazcw force-pushed the chacha-faster branch from f92197f to ad45758 Compare October 14, 2021 06:06

kazcw marked this pull request as draft October 14, 2021 06:09

kazcw added 2 commits October 13, 2021 23:20

rand_chacha: improve little-endian performance incrementing pos counter

6e6cc35

On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.

rand_chacha: optimize outputting

6455dc4

Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

kazcw force-pushed the chacha-faster branch from ad45758 to 6455dc4 Compare October 14, 2021 06:20

kazcw marked this pull request as ready for review October 14, 2021 06:26

dhardy approved these changes Oct 14, 2021

View reviewed changes

kazcw force-pushed the chacha-faster branch from f76c1f0 to 6455dc4 Compare October 14, 2021 16:59

rand_chacha: update changelog

fb7af73

dhardy merged commit 0f4fc6b into rust-random:master Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chacha: performance improvements#1192

Chacha: performance improvements#1192
dhardy merged 3 commits intorust-random:masterfrom
kazcw:chacha-faster

kazcw commented Oct 14, 2021

Uh oh!

kazcw commented Oct 14, 2021

Uh oh!

dhardy commented Oct 14, 2021 •

edited

Loading

Uh oh!

vks commented Oct 14, 2021

Uh oh!

dhardy commented Oct 14, 2021 •

edited

Loading

Uh oh!

dhardy left a comment

Uh oh!

kazcw commented Oct 14, 2021

Uh oh!

dhardy commented Oct 15, 2021 •

edited

Loading

Uh oh!

kazcw commented Oct 15, 2021

Uh oh!

dhardy commented Oct 18, 2021

Uh oh!

kazcw commented Oct 22, 2021

Uh oh!

dhardy commented Oct 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

kazcw commented Oct 14, 2021

Improve AVX2 vectorizability of copying results to buffer

CPU: E5-2620 v3 (avx2)

CPU: X5640L (no avx2)

Uh oh!

kazcw commented Oct 14, 2021

Uh oh!

dhardy commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vks commented Oct 14, 2021

Uh oh!

dhardy commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy left a comment

Choose a reason for hiding this comment

Uh oh!

kazcw commented Oct 14, 2021

Uh oh!

dhardy commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazcw commented Oct 15, 2021

Uh oh!

dhardy commented Oct 18, 2021

Uh oh!

kazcw commented Oct 22, 2021

Uh oh!

dhardy commented Oct 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhardy commented Oct 14, 2021 •

edited

Loading

dhardy commented Oct 14, 2021 •

edited

Loading

dhardy commented Oct 15, 2021 •

edited

Loading