Queue<T> optimization of (Try)Dequeue#26087
Conversation
| index = (tmp == _array.Length) ? 0 : tmp; | ||
| // It is tempting to use the remainder operator here but it is actually much slower | ||
| // than a simple comparison and a rarely taken branch. | ||
| // JIT produces better code than with ternary operator ?: |
There was a problem hiding this comment.
e.g. is keeping the tmp var better? (does it push to register vs memory location on index)?
There was a problem hiding this comment.
The if produces better code than the ternary.
The point with the register I have to validate.
There was a problem hiding this comment.
The tmp-Variant uses registers, whilst index pushes to memory. I'll update the PR, though the effect might be minimal, but it's still better 😉
For reference:
Original
000007fe`7b6ec5f0 488d4110 lea rax,[rcx+10h]
000007fe`7b6ec5f4 8b10 mov edx,dword ptr [rax]
000007fe`7b6ec5f6 ffc2 inc edx
000007fe`7b6ec5f8 488b4908 mov rcx,qword ptr [rcx+8]
000007fe`7b6ec5fc 395108 cmp dword ptr [rcx+8],edx
000007fe`7b6ec5ff 7402 je 000007fe`7b6ec603
000007fe`7b6ec601 eb02 jmp 000007fe`7b6ec605
000007fe`7b6ec603 33d2 xor edx,edx
000007fe`7b6ec605 8910 mov dword ptr [rax],edx
000007fe`7b6ec607 c3 retIndex only
000007fe`7b71c5f0 488d4110 lea rax,[rcx+10h]
000007fe`7b71c5f4 ff00 inc dword ptr [rax] ; doesn't use register
000007fe`7b71c5f6 8b10 mov edx,dword ptr [rax]
000007fe`7b71c5f8 488b4908 mov rcx,qword ptr [rcx+8]
000007fe`7b71c5fc 3b5108 cmp edx,dword ptr [rcx+8]
000007fe`7b71c5ff 7504 jne 000007fe`7b71c605
000007fe`7b71c601 33d2 xor edx,edx
000007fe`7b71c603 8910 mov dword ptr [rax],edx
000007fe`7b71c605 c3 rettmp-Variable
000007fe`7b71c5f0 488d4110 lea rax,[rcx+10h]
000007fe`7b71c5f4 8b10 mov edx,dword ptr [rax]
000007fe`7b71c5f6 ffc2 inc edx ; uses register
000007fe`7b71c5f8 488b4908 mov rcx,qword ptr [rcx+8]
000007fe`7b71c5fc 395108 cmp dword ptr [rcx+8],edx
000007fe`7b71c5ff 7502 jne 000007fe`7b71c603
000007fe`7b71c601 33d2 xor edx,edx
000007fe`7b71c603 8910 mov dword ptr [rax],edx
000007fe`7b71c605 c3 ret|
cc @valenis |
| { | ||
| if (_size == 0) | ||
| int head = _head; | ||
| T[] array = _array; |
There was a problem hiding this comment.
Are the array copies here necessary?
There was a problem hiding this comment.
It may not be necessary but the JIT is bugged and not having this copy may result in incorrect range check elimination. See the discussion in the similar Stack PR: #26086 (comment)
|
@safern what are next steps of this PR? Do we take it, or not? |
| int head = _head; | ||
| T[] array = _array; | ||
|
|
||
| if (_size == 0 || (uint)head >= (uint)array.Length) |
There was a problem hiding this comment.
Why is the latter condition necessary?
There was a problem hiding this comment.
To enable the range check elimination on the following array[head]. The real win is on reference types, where this access is two times.
There was a problem hiding this comment.
Isn't it strange that we can do the check in this manner more efficiently than the JIT can? What prevents the JIT from doing its bounds check in a similar manner?
Or, I guess the issue isn't the check itself, but the JIT has additional logic for what happens if the check fails, and since we know it won't fail, we can avoid that extra logic?
There was a problem hiding this comment.
The JIT has no way of knowing that head is a valid array index so it generates normal bounds checks for array[head], that's something like if ((uint)head >= (uint)array.Length) throw IndexOutOfRangeException();.
If we do the check manually we can piggy back on the existing throw instead of generating a separate throw IndexOutOfRangeException().
It's a rather creative use of the (uint)head >= (uint)array.Length trick. Not entirely sure it's worth it.
There was a problem hiding this comment.
@gfoidl, does it actually improve throughput, or just decrease the asm size?
There was a problem hiding this comment.
You are right, I forgot about the change in MoveNext that also has an influence to the numbers. So I'll do some benchmark that focuses on this.
There was a problem hiding this comment.
BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC
.NET Core SDK=2.1.4
[Host] : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
| Method | Mean | Error | StdDev | Scaled | ScaledSD |
|---|---|---|---|---|---|
| Dequeue_PR | 0.7148 ns | 0.0234 ns | 0.0195 ns | 1.00 | 0.00 |
| Dequeue_wo_RCE | 0.7514 ns | 0.0791 ns | 0.0740 ns | 1.05 | 0.10 |
Dequeue_PR is the method from this PR.
Dequeue_wo_RCE is:
public T Dequeue()
{
if (_size == 0)
{
ThrowForEmptyQueue();
}
T removed = _array[_head];
if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
{
_array[_head] = default;
}
MoveNext(ref _head);
_size--;
_version++;
return removed;
}There was a problem hiding this comment.
What if you use the code exactly as it is in your PR, and just remove || (uint)head >= (uint)array.Length? That's what I intended to ask about.
There was a problem hiding this comment.
The improvement here most likely comes from caching _array and _head in local variables, not from the manual bounds check.
There was a problem hiding this comment.
What if you use the code exactly as it is in your PR
Then it is equal in perf (within noise, so I can't say which one is faster).
The dasm is exactely as @mikedn pointed out in #26087 (comment)
With the manual bounds check
G_M11457_IG02:
448B7318 mov r14d, dword ptr [rbx+24]
4C8B7B08 mov r15, gword ptr [rbx+8]
837B2000 cmp dword ptr [rbx+32], 0
7474 je SHORT G_M11457_IG08
418B7F08 mov edi, dword ptr [r15+8]
413BFE cmp edi, r14d
766B jbe SHORT G_M11457_IG08
; more code
G_M11457_IG08:
488BFB mov rdi, rbx
E8D6F6FFFF call ...:ThrowForEmptyQueue():this
CC int3 Without the manual bounds check:
G_M7328_IG02:
448B7318 mov r14d, dword ptr [rbx+24]
4C8B7B08 mov r15, gword ptr [rbx+8]
837B2000 cmp dword ptr [rbx+32], 0
7471 je SHORT G_M7328_IG08
G_M7328_IG03:
453B7708 cmp r14d, dword ptr [r15+8]
7373 jae SHORT G_M7328_IG09
; more code
G_M7328_IG08:
488BFB mov rdi, rbx
E859F7FFFF call ...:ThrowForEmptyQueue():this
G_M7328_IG09:
E88C379878 call CORINFO_HELP_RNGCHKFAIL
CC int3 It is cool that the JIT does emit only one bounds check for
T removed = array[head];
if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
{
array[head] = default;
}G_M7328_IG04:
488BF8 mov rdi, rax
E8222965FF call System.Runtime.CompilerServices.RuntimeHelpers:IsReferenceOrContainsReferences():bool
85C0 test eax, eax
740A je SHORT G_M7328_IG05
4963C6 movsxd rax, r14d
33FF xor rdi, rdi
49897CC710 mov gword ptr [r15+8*rax+16], rdi ; no bounds check I didn't know about that fact, and have to admit that I didn't investigate this case thoroughly enough, hence the manual check in this PR.
So in conclusion there is no real benefit from this manual check -- except the saving of 2 bytes of code. But the code is too big for inlining anyway, therefore this should not matter.
I will update the PR to remove the manual bound check.
| int head = _head; | ||
| T[] array = _array; | ||
|
|
||
| if (_size == 0 || (uint)head >= (uint)array.Length) |
There was a problem hiding this comment.
Same question; why is the latter condition necessary?
There was a problem hiding this comment.
Same reason as above.
To enable the range check elimination on the following
array[head]. The real win is on reference types, where this access is two times.
There was a problem hiding this comment.
I'm investigating this case right now.
There was a problem hiding this comment.
No clear difference measurable. The asm is quite similar, except the jumps at the beginning of the method.
With manual range check:
G_M54913_IG02:
448B7318 mov r14d, dword ptr [rbx+24]
4C8B7B08 mov r15, gword ptr [rbx+8]
837B2000 cmp dword ptr [rbx+32], 0
7409 je SHORT G_M54913_IG03
418B4708 mov eax, dword ptr [r15+8]
413BC6 cmp eax, r14d
7710 ja SHORT G_M54913_IG05
G_M54913_IG03:
33C0 xor rax, rax
488906 mov qword ptr [rsi], rax
G_M54913_IG04:
488D65E8 lea rsp, [rbp-18H]
5B pop rbx
415E pop r14
415F pop r15
5D pop rbp
C3 ret
G_M54913_IG05:
488975D8 mov bword ptr [rbp-28H], rsi
4963FE movsxd rdi, r14d
498B74FF10 mov rsi, gword ptr [r15+8*rdi+16]
488B7DD8 mov rdi, bword ptr [rbp-28H]
E8CF54AD78 call CORINFO_HELP_CHECKED_ASSIGN_REF
; more code (equal)Without manual range check:
G_M50722_IG02:
448B7318 mov r14d, dword ptr [rbx+24]
4C8B7B08 mov r15, gword ptr [rbx+8]
837B2000 cmp dword ptr [rbx+32], 0
7510 jne SHORT G_M50722_IG04
33C0 xor rax, rax
488906 mov qword ptr [rsi], rax
G_M50722_IG03:
488D65E8 lea rsp, [rbp-18H]
5B pop rbx
415E pop r14
415F pop r15
5D pop rbp
C3 ret
G_M50722_IG04:
488975D8 mov bword ptr [rbp-28H], rsi
418B7F08 mov edi, dword ptr [r15+8]
443BF7 cmp r14d, edi
7374 jae SHORT G_M50722_IG09
4963FE movsxd rdi, r14d
498B74FF10 mov rsi, gword ptr [r15+8*rdi+16]
488B7DD8 mov rdi, bword ptr [rbp-28H]
E8EF53AD78 call CORINFO_HELP_CHECKED_ASSIGN_REF
; more code (equal)
G_M50722_IG09:
E896B79678 call CORINFO_HELP_RNGCHKFAIL
CC int3 So I'll also go with the cleaner code and remove the manual range check?
There was a problem hiding this comment.
So I'll also go with the cleaner code and remove the manual range check?
Yes please. Thanks.
| { | ||
| tmp = 0; | ||
| } | ||
| index = tmp; |
There was a problem hiding this comment.
@AndyAyersMS, is there a good reason for the difference in codegen here? It's unfortunate if the more concise / arguably simpler form results in worse code. (That said, I'm only basing the "JIT produces better code" on the comment.)
There was a problem hiding this comment.
Just for reference:
public class Program
{
private static int[] _array = new int[3];
static Program() {}
public static void Main(string[] args)
{
int index = 3;
MoveNext1(ref index);
MoveNext2(ref index);
}
private static void MoveNext1(ref int index)
{
int tmp = index + 1;
index = (tmp == _array.Length) ? 0 : tmp;
}
private static void MoveNext2(ref int index)
{
int tmp = index + 1;
if (tmp == _array.Length) tmp = 0;
index = tmp;
}
}; Assembly listing for method ConsoleApplication.Program:MoveNext1(byref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rdi
; V01 loc0 [V01,T01] ( 3, 2.50) int -> rax
; V02 tmp0 [V02,T02] ( 3, 2 ) byref -> rdi
; V03 tmp1 [V03,T03] ( 3, 2 ) byref -> rdi
; V04 tmp2 [V04,T04] ( 3, 2 ) int -> rax
;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M9976_IG01:
55 push rbp
488BEC mov rbp, rsp
G_M9976_IG02:
8B07 mov eax, dword ptr [rdi]
FFC0 inc eax
48BEB8070040367F0000 mov rsi, 0x7F36400007B8
488B36 mov rsi, gword ptr [rsi]
394608 cmp dword ptr [rsi+8], eax
7402 je SHORT G_M9976_IG03
EB02 jmp SHORT G_M9976_IG04
G_M9976_IG03:
33C0 xor eax, eax
G_M9976_IG04:
8907 mov dword ptr [rdi], eax
G_M9976_IG05:
5D pop rbp
C3 ret
; Total bytes of code 34, prolog size 4 for method ConsoleApplication.Program:MoveNext1(byref)
; ============================================================
; Assembly listing for method ConsoleApplication.Program:MoveNext2(byref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rdi
; V01 loc0 [V01,T01] ( 4, 3.50) int -> rax
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M9979_IG01:
G_M9979_IG02:
8B07 mov eax, dword ptr [rdi]
FFC0 inc eax
48BEB8070040367F0000 mov rsi, 0x7F36400007B8
488B36 mov rsi, gword ptr [rsi]
394608 cmp dword ptr [rsi+8], eax
7502 jne SHORT G_M9979_IG03
33C0 xor eax, eax
G_M9979_IG03:
8907 mov dword ptr [rdi], eax
G_M9979_IG04:
C3 ret
; Total bytes of code 27, prolog size 0 for method ConsoleApplication.Program:MoveNext2(byref)
; ============================================================
There was a problem hiding this comment.
Thanks for noticing this. Looks like (based on the above) there is a pattern that the jit's flow optimizer doesn't spot that leads to a bit of extra branching.
On Windows the codegen for MoveNext1 has an extra mov thrown in just because:
G_M38324_IG02:
8B01 mov eax, dword ptr [rcx]
FFC0 inc eax
488BD1 mov rdx, rcx
48B9E028009012020000 mov rcx, 0x212900028E0
488B09 mov rcx, gword ptr [rcx]
394108 cmp dword ptr [rcx+8], eax
7402 je SHORT G_M38324_IG03
EB02 jmp SHORT G_M38324_IG04
G_M38324_IG03:
33C0 xor eax, eax
G_M38324_IG04:
8902 mov dword ptr [rdx], eax
G_M38324_IG05:
C3 retOpened dotnet/coreclr#16079.
* Queue RCE With value types the effect is not so big, because there is still one (manual) check for bounds. For reference types one bounds-check can be saved, so there is a win. In MoveNext the conditional is with if, so the JIT can produce better code. * MoveNext uses temp-Variable for register access instead of memory access * Addressed PR feedback * dotnet/corefx#26087 (review) * dotnet/corefx#26087 (review) Commit migrated from dotnet/corefx@fc7cd1b
Description
By Dequeue the effect on value types is not so big, than for reference types (two array accesses).
This PR is a kind of extension to #17318, similar to #26086
Benchmarks
Notes
Code for benchmarks lives here
Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.
Dequeue
TryDequeue
Notes
Enqueue
Didn't find a way to improve perf, besides the effect of better JIT-codegen in MoveNext.
Remained Untouched.
(Try)Peek
Didn't find a way to improve perf, and "RCE-if" does not bring a win, it's getting rather slower, because the code gets bigger. I didn't keep benchmark-records for this case.
Remained untouched.