Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Queue<T> optimization of (Try)Dequeue#26087

Merged
stephentoub merged 3 commits into
dotnet:masterfrom
gfoidl:queue-dequeue
Jan 30, 2018
Merged

Queue<T> optimization of (Try)Dequeue#26087
stephentoub merged 3 commits into
dotnet:masterfrom
gfoidl:queue-dequeue

Conversation

@gfoidl
Copy link
Copy Markdown
Member

@gfoidl gfoidl commented Dec 28, 2017

Description

By Dequeue the effect on value types is not so big, than for reference types (two array accesses).

This PR is a kind of extension to #17318, similar to #26086

Benchmarks

Notes

Code for benchmarks lives here

Due the use of http://benchmarkdotnet.org the benchmarks were done a couple of times, because some crazy results with perf x2 were reported and this seems too strange. The results shown here are the more realistic ones. Individual results are in the linked repo above.
The changes from this PR never showed a decrease in perf.

Dequeue

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
Dequeue_Default 0.9702 ns 0.1013 ns 0.1421 ns 1.00 0.00
Dequeue_New 0.8880 ns 0.1042 ns 0.1770 ns 0.93 0.22

TryDequeue

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Xeon CPU 2.60GHz, ProcessorCount=2
.NET Core SDK=2.1.3
  [Host]     : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.4 (Framework 4.6.0.0), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
TryDequeue_Default 3.990 ns 0.1666 ns 0.2545 ns 1.00 0.00
TryDequeue_New 3.559 ns 0.1585 ns 0.2273 ns 0.90 0.08

Notes

Enqueue

Didn't find a way to improve perf, besides the effect of better JIT-codegen in MoveNext.
Remained Untouched.

(Try)Peek

Didn't find a way to improve perf, and "RCE-if" does not bring a win, it's getting rather slower, because the code gets bigger. I didn't keep benchmark-records for this case.
Remained untouched.

With value types the effect is not so big, because there is still one (manual) check for bounds.
For reference types one bounds-check can be saved, so there is a win.

In MoveNext the conditional is with if, so the JIT can produce better code.
index = (tmp == _array.Length) ? 0 : tmp;
// It is tempting to use the remainder operator here but it is actually much slower
// than a simple comparison and a rarely taken branch.
// JIT produces better code than with ternary operator ?:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the change here?

Copy link
Copy Markdown
Member

@benaadams benaadams Dec 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. is keeping the tmp var better? (does it push to register vs memory location on index)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if produces better code than the ternary.

The point with the register I have to validate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tmp-Variant uses registers, whilst index pushes to memory. I'll update the PR, though the effect might be minimal, but it's still better 😉

For reference:

Original

000007fe`7b6ec5f0 488d4110        lea     rax,[rcx+10h]
000007fe`7b6ec5f4 8b10            mov     edx,dword ptr [rax]
000007fe`7b6ec5f6 ffc2            inc     edx
000007fe`7b6ec5f8 488b4908        mov     rcx,qword ptr [rcx+8]
000007fe`7b6ec5fc 395108          cmp     dword ptr [rcx+8],edx
000007fe`7b6ec5ff 7402            je      000007fe`7b6ec603
000007fe`7b6ec601 eb02            jmp     000007fe`7b6ec605
000007fe`7b6ec603 33d2            xor     edx,edx
000007fe`7b6ec605 8910            mov     dword ptr [rax],edx
000007fe`7b6ec607 c3              ret

Index only

000007fe`7b71c5f0 488d4110        lea     rax,[rcx+10h]
000007fe`7b71c5f4 ff00            inc     dword ptr [rax]           ; doesn't use register
000007fe`7b71c5f6 8b10            mov     edx,dword ptr [rax]
000007fe`7b71c5f8 488b4908        mov     rcx,qword ptr [rcx+8]
000007fe`7b71c5fc 3b5108          cmp     edx,dword ptr [rcx+8]
000007fe`7b71c5ff 7504            jne     000007fe`7b71c605
000007fe`7b71c601 33d2            xor     edx,edx
000007fe`7b71c603 8910            mov     dword ptr [rax],edx
000007fe`7b71c605 c3              ret

tmp-Variable

000007fe`7b71c5f0 488d4110        lea     rax,[rcx+10h]
000007fe`7b71c5f4 8b10            mov     edx,dword ptr [rax]
000007fe`7b71c5f6 ffc2            inc     edx                       ; uses register
000007fe`7b71c5f8 488b4908        mov     rcx,qword ptr [rcx+8]
000007fe`7b71c5fc 395108          cmp     dword ptr [rcx+8],edx
000007fe`7b71c5ff 7502            jne     000007fe`7b71c603
000007fe`7b71c601 33d2            xor     edx,edx
000007fe`7b71c603 8910            mov     dword ptr [rax],edx
000007fe`7b71c605 c3              ret

@karelz
Copy link
Copy Markdown
Member

karelz commented Dec 28, 2017

cc @valenis

gfoidl added a commit to gfoidl/Benchmarks that referenced this pull request Dec 28, 2017
{
if (_size == 0)
int head = _head;
T[] array = _array;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the array copies here necessary?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be necessary but the JIT is bugged and not having this copy may result in incorrect range check elimination. See the discussion in the similar Stack PR: #26086 (comment)

@karelz
Copy link
Copy Markdown
Member

karelz commented Jan 27, 2018

@safern what are next steps of this PR? Do we take it, or not?

int head = _head;
T[] array = _array;

if (_size == 0 || (uint)head >= (uint)array.Length)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the latter condition necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable the range check elimination on the following array[head]. The real win is on reference types, where this access is two times.

Copy link
Copy Markdown
Member

@stephentoub stephentoub Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it strange that we can do the check in this manner more efficiently than the JIT can? What prevents the JIT from doing its bounds check in a similar manner?

Or, I guess the issue isn't the check itself, but the JIT has additional logic for what happens if the check fails, and since we know it won't fail, we can avoid that extra logic?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT has no way of knowing that head is a valid array index so it generates normal bounds checks for array[head], that's something like if ((uint)head >= (uint)array.Length) throw IndexOutOfRangeException();.

If we do the check manually we can piggy back on the existing throw instead of generating a separate throw IndexOutOfRangeException().

It's a rather creative use of the (uint)head >= (uint)array.Length trick. Not entirely sure it's worth it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl, does it actually improve throughput, or just decrease the asm size?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I forgot about the change in MoveNext that also has an influence to the numbers. So I'll do some benchmark that focuses on this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
Dequeue_PR 0.7148 ns 0.0234 ns 0.0195 ns 1.00 0.00
Dequeue_wo_RCE 0.7514 ns 0.0791 ns 0.0740 ns 1.05 0.10

Dequeue_PR is the method from this PR.
Dequeue_wo_RCE is:

public T Dequeue()
{
    if (_size == 0)
    {
        ThrowForEmptyQueue();
    }

    T removed = _array[_head];
    if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
    {
        _array[_head] = default;
    }
    MoveNext(ref _head);
    _size--;
    _version++;
    return removed;
}

Copy link
Copy Markdown
Member

@stephentoub stephentoub Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you use the code exactly as it is in your PR, and just remove || (uint)head >= (uint)array.Length? That's what I intended to ask about.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement here most likely comes from caching _array and _head in local variables, not from the manual bounds check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you use the code exactly as it is in your PR

Then it is equal in perf (within noise, so I can't say which one is faster).

The dasm is exactely as @mikedn pointed out in #26087 (comment)

With the manual bounds check

G_M11457_IG02:
       448B7318             mov      r14d, dword ptr [rbx+24]
       4C8B7B08             mov      r15, gword ptr [rbx+8]
       837B2000             cmp      dword ptr [rbx+32], 0
       7474                 je       SHORT G_M11457_IG08
       418B7F08             mov      edi, dword ptr [r15+8]
       413BFE               cmp      edi, r14d
       766B                 jbe      SHORT G_M11457_IG08

; more code 

G_M11457_IG08:
       488BFB               mov      rdi, rbx
       E8D6F6FFFF           call     ...:ThrowForEmptyQueue():this
       CC                   int3    

Without the manual bounds check:

G_M7328_IG02:
       448B7318             mov      r14d, dword ptr [rbx+24]
       4C8B7B08             mov      r15, gword ptr [rbx+8]
       837B2000             cmp      dword ptr [rbx+32], 0
       7471                 je       SHORT G_M7328_IG08

G_M7328_IG03:
       453B7708             cmp      r14d, dword ptr [r15+8]
       7373                 jae      SHORT G_M7328_IG09

; more code

G_M7328_IG08:
       488BFB               mov      rdi, rbx
       E859F7FFFF           call     ...:ThrowForEmptyQueue():this

G_M7328_IG09:
       E88C379878           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

It is cool that the JIT does emit only one bounds check for

T removed = array[head];
if (RuntimeHelpers.IsReferenceOrContainsReferences<T>())
{
	array[head] = default;
}
G_M7328_IG04:
       488BF8               mov      rdi, rax
       E8222965FF           call     System.Runtime.CompilerServices.RuntimeHelpers:IsReferenceOrContainsReferences():bool
       85C0                 test     eax, eax
       740A                 je       SHORT G_M7328_IG05
       4963C6               movsxd   rax, r14d
       33FF                 xor      rdi, rdi
       49897CC710           mov      gword ptr [r15+8*rax+16], rdi    ; no bounds check 

I didn't know about that fact, and have to admit that I didn't investigate this case thoroughly enough, hence the manual check in this PR.

So in conclusion there is no real benefit from this manual check -- except the saving of 2 bytes of code. But the code is too big for inlining anyway, therefore this should not matter.

I will update the PR to remove the manual bound check.

int head = _head;
T[] array = _array;

if (_size == 0 || (uint)head >= (uint)array.Length)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question; why is the latter condition necessary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above.

To enable the range check elimination on the following array[head]. The real win is on reference types, where this access is two times.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm investigating this case right now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clear difference measurable. The asm is quite similar, except the jumps at the beginning of the method.

With manual range check:

G_M54913_IG02:
       448B7318             mov      r14d, dword ptr [rbx+24]
       4C8B7B08             mov      r15, gword ptr [rbx+8]
       837B2000             cmp      dword ptr [rbx+32], 0
       7409                 je       SHORT G_M54913_IG03
       418B4708             mov      eax, dword ptr [r15+8]
       413BC6               cmp      eax, r14d
       7710                 ja       SHORT G_M54913_IG05

G_M54913_IG03:
       33C0                 xor      rax, rax
       488906               mov      qword ptr [rsi], rax

G_M54913_IG04:
       488D65E8             lea      rsp, [rbp-18H]
       5B                   pop      rbx
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      

G_M54913_IG05:
       488975D8             mov      bword ptr [rbp-28H], rsi
       4963FE               movsxd   rdi, r14d
       498B74FF10           mov      rsi, gword ptr [r15+8*rdi+16]
       488B7DD8             mov      rdi, bword ptr [rbp-28H]
       E8CF54AD78           call     CORINFO_HELP_CHECKED_ASSIGN_REF
      
; more code (equal)

Without manual range check:

G_M50722_IG02:
       448B7318             mov      r14d, dword ptr [rbx+24]
       4C8B7B08             mov      r15, gword ptr [rbx+8]
       837B2000             cmp      dword ptr [rbx+32], 0
       7510                 jne      SHORT G_M50722_IG04
       33C0                 xor      rax, rax
       488906               mov      qword ptr [rsi], rax

G_M50722_IG03:
       488D65E8             lea      rsp, [rbp-18H]
       5B                   pop      rbx
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      

G_M50722_IG04:
       488975D8             mov      bword ptr [rbp-28H], rsi
       418B7F08             mov      edi, dword ptr [r15+8]
       443BF7               cmp      r14d, edi
       7374                 jae      SHORT G_M50722_IG09
       4963FE               movsxd   rdi, r14d
       498B74FF10           mov      rsi, gword ptr [r15+8*rdi+16]
       488B7DD8             mov      rdi, bword ptr [rbp-28H]
       E8EF53AD78           call     CORINFO_HELP_CHECKED_ASSIGN_REF

; more code (equal)

G_M50722_IG09:
       E896B79678           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3   

So I'll also go with the cleaner code and remove the manual range check?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'll also go with the cleaner code and remove the manual range check?

Yes please. Thanks.

{
tmp = 0;
}
index = tmp;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyAyersMS, is there a good reason for the difference in codegen here? It's unfortunate if the more concise / arguably simpler form results in worse code. (That said, I'm only basing the "JIT produces better code" on the comment.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference:

public class Program
{
    private static int[] _array = new int[3];

    static Program() {}

    public static void Main(string[] args)
    {
        int index = 3;
        MoveNext1(ref index);
        MoveNext2(ref index);
    }

    private static void MoveNext1(ref int index)
    {
        int tmp = index + 1;
        index = (tmp == _array.Length) ? 0 : tmp;
    }

    private static void MoveNext2(ref int index)
    {
        int tmp = index + 1;
        if (tmp == _array.Length) tmp = 0;

        index = tmp;
    }
}
; Assembly listing for method ConsoleApplication.Program:MoveNext1(byref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->  rdi        
;  V01 loc0         [V01,T01] (  3,  2.50)     int  ->  rax        
;  V02 tmp0         [V02,T02] (  3,  2   )   byref  ->  rdi        
;  V03 tmp1         [V03,T03] (  3,  2   )   byref  ->  rdi        
;  V04 tmp2         [V04,T04] (  3,  2   )     int  ->  rax        
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M9976_IG01:
       55                   push     rbp
       488BEC               mov      rbp, rsp

G_M9976_IG02:
       8B07                 mov      eax, dword ptr [rdi]
       FFC0                 inc      eax
       48BEB8070040367F0000 mov      rsi, 0x7F36400007B8
       488B36               mov      rsi, gword ptr [rsi]
       394608               cmp      dword ptr [rsi+8], eax
       7402                 je       SHORT G_M9976_IG03
       EB02                 jmp      SHORT G_M9976_IG04

G_M9976_IG03:
       33C0                 xor      eax, eax

G_M9976_IG04:
       8907                 mov      dword ptr [rdi], eax

G_M9976_IG05:
       5D                   pop      rbp
       C3                   ret      

; Total bytes of code 34, prolog size 4 for method ConsoleApplication.Program:MoveNext1(byref)
; ============================================================
; Assembly listing for method ConsoleApplication.Program:MoveNext2(byref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->  rdi        
;  V01 loc0         [V01,T01] (  4,  3.50)     int  ->  rax        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M9979_IG01:

G_M9979_IG02:
       8B07                 mov      eax, dword ptr [rdi]
       FFC0                 inc      eax
       48BEB8070040367F0000 mov      rsi, 0x7F36400007B8
       488B36               mov      rsi, gword ptr [rsi]
       394608               cmp      dword ptr [rsi+8], eax
       7502                 jne      SHORT G_M9979_IG03
       33C0                 xor      eax, eax

G_M9979_IG03:
       8907                 mov      dword ptr [rdi], eax

G_M9979_IG04:
       C3                   ret      

; Total bytes of code 27, prolog size 0 for method ConsoleApplication.Program:MoveNext2(byref)
; ============================================================

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. Looks like (based on the above) there is a pattern that the jit's flow optimizer doesn't spot that leads to a bit of extra branching.

On Windows the codegen for MoveNext1 has an extra mov thrown in just because:

G_M38324_IG02:
       8B01                 mov      eax, dword ptr [rcx]
       FFC0                 inc      eax
       488BD1               mov      rdx, rcx
       48B9E028009012020000 mov      rcx, 0x212900028E0
       488B09               mov      rcx, gword ptr [rcx]
       394108               cmp      dword ptr [rcx+8], eax
       7402                 je       SHORT G_M38324_IG03
       EB02                 jmp      SHORT G_M38324_IG04

G_M38324_IG03:
       33C0                 xor      eax, eax

G_M38324_IG04:
       8902                 mov      dword ptr [rdx], eax

G_M38324_IG05:
       C3                   ret

Opened dotnet/coreclr#16079.

Copy link
Copy Markdown
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@stephentoub stephentoub merged commit fc7cd1b into dotnet:master Jan 30, 2018
@gfoidl gfoidl deleted the queue-dequeue branch January 30, 2018 10:02
@karelz karelz added this to the 2.1.0 milestone Feb 4, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Queue RCE

With value types the effect is not so big, because there is still one (manual) check for bounds.
For reference types one bounds-check can be saved, so there is a win.

In MoveNext the conditional is with if, so the JIT can produce better code.

* MoveNext uses temp-Variable for register access instead of memory access

* Addressed PR feedback

* dotnet/corefx#26087 (review)
* dotnet/corefx#26087 (review)


Commit migrated from dotnet/corefx@fc7cd1b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants