Queue<T> optimization of (Try)Dequeue by gfoidl · Pull Request #26087 · dotnet/corefx

gfoidl · 2017-12-28T16:15:00Z

Description

enabled RCE on array-access
in MoveNext the JIT can produce better code (see SpanHelpers IndexOfAny-methods use sentinel-value #25970 (comment))

By Dequeue the effect on value types is not so big, than for reference types (two array accesses).

This PR is a kind of extension to #17318, similar to #26086

Benchmarks

Notes

Code for benchmarks lives here

Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.

Dequeue

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
Dequeue_Default	0.9702 ns	0.1013 ns	0.1421 ns	1.00	0.00
Dequeue_New	0.8880 ns	0.1042 ns	0.1770 ns	0.93	0.22

TryDequeue

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
TryDequeue_Default	3.990 ns	0.1666 ns	0.2545 ns	1.00	0.00
TryDequeue_New	3.559 ns	0.1585 ns	0.2273 ns	0.90	0.08

Notes

Enqueue

Didn't find a way to improve perf, besides the effect of better JIT-codegen in MoveNext.
Remained Untouched.

(Try)Peek

Didn't find a way to improve perf, and "RCE-if" does not bring a win, it's getting rather slower, because the code gets bigger. I didn't keep benchmark-records for this case.
Remained untouched.

With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. In MoveNext the conditional is with if, so the JIT can produce better code.

benaadams · 2017-12-28T16:58:39Z

-            index = (tmp == _array.Length) ? 0 : tmp;
+            // It is tempting to use the remainder operator here but it is actually much slower
+            // than a simple comparison and a rarely taken branch.
+            // JIT produces better code than with ternary operator ?:


What's the change here?

e.g. is keeping the tmp var better? (does it push to register vs memory location on index)?

The if produces better code than the ternary.

The point with the register I have to validate.

The tmp-Variant uses registers, whilst index pushes to memory. I'll update the PR, though the effect might be minimal, but it's still better 😉

For reference:

Original

000007fe`7b6ec5f0 488d4110 lea rax,[rcx+10h] 000007fe`7b6ec5f4 8b10 mov edx,dword ptr [rax] 000007fe`7b6ec5f6 ffc2 inc edx 000007fe`7b6ec5f8 488b4908 mov rcx,qword ptr [rcx+8] 000007fe`7b6ec5fc 395108 cmp dword ptr [rcx+8],edx 000007fe`7b6ec5ff 7402 je 000007fe`7b6ec603 000007fe`7b6ec601 eb02 jmp 000007fe`7b6ec605 000007fe`7b6ec603 33d2 xor edx,edx 000007fe`7b6ec605 8910 mov dword ptr [rax],edx 000007fe`7b6ec607 c3 ret

Index only

000007fe`7b71c5f0 488d4110 lea rax,[rcx+10h] 000007fe`7b71c5f4 ff00 inc dword ptr [rax] ; doesn't use register 000007fe`7b71c5f6 8b10 mov edx,dword ptr [rax] 000007fe`7b71c5f8 488b4908 mov rcx,qword ptr [rcx+8] 000007fe`7b71c5fc 3b5108 cmp edx,dword ptr [rcx+8] 000007fe`7b71c5ff 7504 jne 000007fe`7b71c605 000007fe`7b71c601 33d2 xor edx,edx 000007fe`7b71c603 8910 mov dword ptr [rax],edx 000007fe`7b71c605 c3 ret

tmp-Variable

000007fe`7b71c5f0 488d4110 lea rax,[rcx+10h] 000007fe`7b71c5f4 8b10 mov edx,dword ptr [rax] 000007fe`7b71c5f6 ffc2 inc edx ; uses register 000007fe`7b71c5f8 488b4908 mov rcx,qword ptr [rcx+8] 000007fe`7b71c5fc 395108 cmp dword ptr [rcx+8],edx 000007fe`7b71c5ff 7502 jne 000007fe`7b71c603 000007fe`7b71c601 33d2 xor edx,edx 000007fe`7b71c603 8910 mov dword ptr [rax],edx 000007fe`7b71c605 c3 ret

karelz · 2017-12-28T17:56:55Z

cc @valenis

safern · 2017-12-30T08:07:31Z

        {
-            if (_size == 0)
+            int head = _head;
+            T[] array = _array;


Are the array copies here necessary?

It may not be necessary but the JIT is bugged and not having this copy may result in incorrect range check elimination. See the discussion in the similar Stack PR: #26086 (comment)

karelz · 2018-01-27T18:35:47Z

@safern what are next steps of this PR? Do we take it, or not?

stephentoub · 2018-01-29T14:58:58Z

+            int head = _head;
+            T[] array = _array;
+
+            if (_size == 0 || (uint)head >= (uint)array.Length)


Why is the latter condition necessary?

To enable the range check elimination on the following array[head]. The real win is on reference types, where this access is two times.

Isn't it strange that we can do the check in this manner more efficiently than the JIT can? What prevents the JIT from doing its bounds check in a similar manner?

Or, I guess the issue isn't the check itself, but the JIT has additional logic for what happens if the check fails, and since we know it won't fail, we can avoid that extra logic?

The JIT has no way of knowing that head is a valid array index so it generates normal bounds checks for array[head], that's something like if ((uint)head >= (uint)array.Length) throw IndexOutOfRangeException();.

If we do the check manually we can piggy back on the existing throw instead of generating a separate throw IndexOutOfRangeException().

It's a rather creative use of the (uint)head >= (uint)array.Length trick. Not entirely sure it's worth it.

@gfoidl, does it actually improve throughput, or just decrease the asm size?

You are right, I forgot about the change in MoveNext that also has an influence to the numbers. So I'll do some benchmark that focuses on this.

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125) Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8 Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC .NET Core SDK=2.1.4 [Host] : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD

Dequeue_PR 0.7148 ns 0.0234 ns 0.0195 ns 1.00 0.00

Dequeue_wo_RCE 0.7514 ns 0.0791 ns 0.0740 ns 1.05 0.10

Dequeue_PR is the method from this PR.
Dequeue_wo_RCE is:

public T Dequeue() { if (_size == 0) { ThrowForEmptyQueue(); } T removed = _array[_head]; if (RuntimeHelpers.IsReferenceOrContainsReferences<T>()) { _array[_head] = default; } MoveNext(ref _head); _size--; _version++; return removed; }

What if you use the code exactly as it is in your PR, and just remove || (uint)head >= (uint)array.Length? That's what I intended to ask about.

The improvement here most likely comes from caching _array and _head in local variables, not from the manual bounds check.

What if you use the code exactly as it is in your PR

Then it is equal in perf (within noise, so I can't say which one is faster).

The dasm is exactely as @mikedn pointed out in #26087 (comment)

With the manual bounds check

G_M11457_IG02: 448B7318 mov r14d, dword ptr [rbx+24] 4C8B7B08 mov r15, gword ptr [rbx+8] 837B2000 cmp dword ptr [rbx+32], 0 7474 je SHORT G_M11457_IG08 418B7F08 mov edi, dword ptr [r15+8] 413BFE cmp edi, r14d 766B jbe SHORT G_M11457_IG08 ; more code G_M11457_IG08: 488BFB mov rdi, rbx E8D6F6FFFF call ...:ThrowForEmptyQueue():this CC int3

Without the manual bounds check:

G_M7328_IG02: 448B7318 mov r14d, dword ptr [rbx+24] 4C8B7B08 mov r15, gword ptr [rbx+8] 837B2000 cmp dword ptr [rbx+32], 0 7471 je SHORT G_M7328_IG08 G_M7328_IG03: 453B7708 cmp r14d, dword ptr [r15+8] 7373 jae SHORT G_M7328_IG09 ; more code G_M7328_IG08: 488BFB mov rdi, rbx E859F7FFFF call ...:ThrowForEmptyQueue():this G_M7328_IG09: E88C379878 call CORINFO_HELP_RNGCHKFAIL CC int3

It is cool that the JIT does emit only one bounds check for

T removed = array[head]; if (RuntimeHelpers.IsReferenceOrContainsReferences<T>()) { array[head] = default; }

G_M7328_IG04: 488BF8 mov rdi, rax E8222965FF call System.Runtime.CompilerServices.RuntimeHelpers:IsReferenceOrContainsReferences():bool 85C0 test eax, eax 740A je SHORT G_M7328_IG05 4963C6 movsxd rax, r14d 33FF xor rdi, rdi 49897CC710 mov gword ptr [r15+8*rax+16], rdi ; no bounds check

I didn't know about that fact, and have to admit that I didn't investigate this case thoroughly enough, hence the manual check in this PR.

So in conclusion there is no real benefit from this manual check -- except the saving of 2 bytes of code. But the code is too big for inlining anyway, therefore this should not matter.

I will update the PR to remove the manual bound check.

stephentoub · 2018-01-29T14:59:31Z

+            int head = _head;
+            T[] array = _array;
+
+            if (_size == 0 || (uint)head >= (uint)array.Length)


Same question; why is the latter condition necessary?

Same reason as above.

To enable the range check elimination on the following array[head]. The real win is on reference types, where this access is two times.

I'm investigating this case right now.

No clear difference measurable. The asm is quite similar, except the jumps at the beginning of the method.

With manual range check:

G_M54913_IG02: 448B7318 mov r14d, dword ptr [rbx+24] 4C8B7B08 mov r15, gword ptr [rbx+8] 837B2000 cmp dword ptr [rbx+32], 0 7409 je SHORT G_M54913_IG03 418B4708 mov eax, dword ptr [r15+8] 413BC6 cmp eax, r14d 7710 ja SHORT G_M54913_IG05 G_M54913_IG03: 33C0 xor rax, rax 488906 mov qword ptr [rsi], rax G_M54913_IG04: 488D65E8 lea rsp, [rbp-18H] 5B pop rbx 415E pop r14 415F pop r15 5D pop rbp C3 ret G_M54913_IG05: 488975D8 mov bword ptr [rbp-28H], rsi 4963FE movsxd rdi, r14d 498B74FF10 mov rsi, gword ptr [r15+8*rdi+16] 488B7DD8 mov rdi, bword ptr [rbp-28H] E8CF54AD78 call CORINFO_HELP_CHECKED_ASSIGN_REF ; more code (equal)

Without manual range check:

G_M50722_IG02: 448B7318 mov r14d, dword ptr [rbx+24] 4C8B7B08 mov r15, gword ptr [rbx+8] 837B2000 cmp dword ptr [rbx+32], 0 7510 jne SHORT G_M50722_IG04 33C0 xor rax, rax 488906 mov qword ptr [rsi], rax G_M50722_IG03: 488D65E8 lea rsp, [rbp-18H] 5B pop rbx 415E pop r14 415F pop r15 5D pop rbp C3 ret G_M50722_IG04: 488975D8 mov bword ptr [rbp-28H], rsi 418B7F08 mov edi, dword ptr [r15+8] 443BF7 cmp r14d, edi 7374 jae SHORT G_M50722_IG09 4963FE movsxd rdi, r14d 498B74FF10 mov rsi, gword ptr [r15+8*rdi+16] 488B7DD8 mov rdi, bword ptr [rbp-28H] E8EF53AD78 call CORINFO_HELP_CHECKED_ASSIGN_REF ; more code (equal) G_M50722_IG09: E896B79678 call CORINFO_HELP_RNGCHKFAIL CC int3

So I'll also go with the cleaner code and remove the manual range check?

So I'll also go with the cleaner code and remove the manual range check?

Yes please. Thanks.

stephentoub · 2018-01-29T15:04:57Z

+            {
+                tmp = 0;
+            }
+            index = tmp;


@AndyAyersMS, is there a good reason for the difference in codegen here? It's unfortunate if the more concise / arguably simpler form results in worse code. (That said, I'm only basing the "JIT produces better code" on the comment.)

Just for reference:

public class Program { private static int[] _array = new int[3]; static Program() {} public static void Main(string[] args) { int index = 3; MoveNext1(ref index); MoveNext2(ref index); } private static void MoveNext1(ref int index) { int tmp = index + 1; index = (tmp == _array.Length) ? 0 : tmp; } private static void MoveNext2(ref int index) { int tmp = index + 1; if (tmp == _array.Length) tmp = 0; index = tmp; } }

; Assembly listing for method ConsoleApplication.Program:MoveNext1(byref) ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rbp based frame ; partially interruptible ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rdi ; V01 loc0 [V01,T01] ( 3, 2.50) int -> rax ; V02 tmp0 [V02,T02] ( 3, 2 ) byref -> rdi ; V03 tmp1 [V03,T03] ( 3, 2 ) byref -> rdi ; V04 tmp2 [V04,T04] ( 3, 2 ) int -> rax ;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] ; ; Lcl frame size = 0 G_M9976_IG01: 55 push rbp 488BEC mov rbp, rsp G_M9976_IG02: 8B07 mov eax, dword ptr [rdi] FFC0 inc eax 48BEB8070040367F0000 mov rsi, 0x7F36400007B8 488B36 mov rsi, gword ptr [rsi] 394608 cmp dword ptr [rsi+8], eax 7402 je SHORT G_M9976_IG03 EB02 jmp SHORT G_M9976_IG04 G_M9976_IG03: 33C0 xor eax, eax G_M9976_IG04: 8907 mov dword ptr [rdi], eax G_M9976_IG05: 5D pop rbp C3 ret ; Total bytes of code 34, prolog size 4 for method ConsoleApplication.Program:MoveNext1(byref) ; ============================================================ ; Assembly listing for method ConsoleApplication.Program:MoveNext2(byref) ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rdi ; V01 loc0 [V01,T01] ( 4, 3.50) int -> rax ;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] ; ; Lcl frame size = 0 G_M9979_IG01: G_M9979_IG02: 8B07 mov eax, dword ptr [rdi] FFC0 inc eax 48BEB8070040367F0000 mov rsi, 0x7F36400007B8 488B36 mov rsi, gword ptr [rsi] 394608 cmp dword ptr [rsi+8], eax 7502 jne SHORT G_M9979_IG03 33C0 xor eax, eax G_M9979_IG03: 8907 mov dword ptr [rdi], eax G_M9979_IG04: C3 ret ; Total bytes of code 27, prolog size 0 for method ConsoleApplication.Program:MoveNext2(byref) ; ============================================================

Thanks for noticing this. Looks like (based on the above) there is a pattern that the jit's flow optimizer doesn't spot that leads to a bit of extra branching.

On Windows the codegen for MoveNext1 has an extra mov thrown in just because:

G_M38324_IG02: 8B01 mov eax, dword ptr [rcx] FFC0 inc eax 488BD1 mov rdx, rcx 48B9E028009012020000 mov rcx, 0x212900028E0 488B09 mov rcx, gword ptr [rcx] 394108 cmp dword ptr [rcx+8], eax 7402 je SHORT G_M38324_IG03 EB02 jmp SHORT G_M38324_IG04 G_M38324_IG03: 33C0 xor eax, eax G_M38324_IG04: 8902 mov dword ptr [rdx], eax G_M38324_IG05: C3 ret

Opened dotnet/coreclr#16079.

* #26087 (review) * #26087 (review)

stephentoub

Thanks

* Queue RCE With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. In MoveNext the conditional is with if, so the JIT can produce better code. * MoveNext uses temp-Variable for register access instead of memory access * Addressed PR feedback * dotnet/corefx#26087 (review) * dotnet/corefx#26087 (review) Commit migrated from dotnet/corefx@fc7cd1b

Queue RCE

bd9c7da

With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. In MoveNext the conditional is with if, so the JIT can produce better code.

benaadams reviewed Dec 28, 2017

View reviewed changes

karelz added the area-System.Collections label Dec 28, 2017

karelz assigned gfoidl, ianhays and safern Dec 28, 2017

MoveNext uses temp-Variable for register access instead of memory access

2964c4d

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 28, 2017

Updated Queue1 -- dotnet/corefx#26087 (comment)

c715ae0

safern reviewed Dec 30, 2017

View reviewed changes

stephentoub reviewed Jan 29, 2018

View reviewed changes

Addressed PR feedback

244839c

* #26087 (review) * #26087 (review)

stephentoub approved these changes Jan 30, 2018

View reviewed changes

stephentoub merged commit fc7cd1b into dotnet:master Jan 30, 2018

gfoidl deleted the queue-dequeue branch January 30, 2018 10:02

karelz added this to the 2.1.0 milestone Feb 4, 2018

Ruikuan mentioned this pull request Nov 28, 2018

.Net Core 2.1 性能改进 Ruikuan/blog#26

Closed

AndyAyersMS mentioned this pull request Jan 31, 2020

JIT: missing flow optimization pattern dotnet/runtime#9623

Open

Method	Mean	Error	StdDev	Scaled	ScaledSD
Dequeue_PR	0.7148 ns	0.0234 ns	0.0195 ns	1.00	0.00
Dequeue_wo_RCE	0.7514 ns	0.0791 ns	0.0740 ns	1.05	0.10

Conversation

gfoidl commented Dec 28, 2017

Description

Benchmarks

Notes

Dequeue

TryDequeue

Notes

Enqueue

(Try)Peek

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Original

Index only

tmp-Variable

Uh oh!

karelz commented Dec 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karelz commented Jan 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephentoub left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

benaadams Dec 28, 2017 •

edited

Loading

stephentoub Jan 29, 2018 •

edited

Loading

stephentoub Jan 29, 2018 •

edited

Loading