Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
using System.Collections.Generic;
using System.Diagnostics;
using System.Diagnostics.Contracts;
using System.Runtime.CompilerServices;

namespace System.Collections.Immutable
{
Expand Down Expand Up @@ -247,7 +248,28 @@ public void Insert(int index, T item)
/// Adds an item to the <see cref="ICollection{T}"/>.
/// </summary>
/// <param name="item">The object to add to the <see cref="ICollection{T}"/>.</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of the improvement you're showing is due to AggressiveInlining vs due to the changes in the method body? I'm not convinced this should be AggressiveInlining.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark

Notes

Method Description
TweakedAdd current implementation
SplitAdd this PR

_NoInline-methods are attributed with [MethodImpl(MethodImplOptions.NoInlining)]
__Inline-methods are attributed with [MethodImpl(MethodImplOptions.AggressiveInlining)]
Methos without _Xxx are without any attributes.

Results

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008384
  [Host]     : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
TweakedAdd_NoInline 4.532 us 0.0900 us 0.1668 us 2.00 0.10
TweakedAdd 2.272 us 0.0479 us 0.0813 us 1.00 0.00
TweakedAdd_Inline 2.317 us 0.0464 us 0.0824 us 1.02 0.05
SplitAdd_NoInline 2.699 us 0.0505 us 0.0473 us 1.19 0.04
SplitAdd 3.034 us 0.0601 us 0.0715 us 1.34 0.05
SplitAdd_Inline 2.088 us 0.0416 us 0.0696 us 0.92 0.04

Discussion

SplitAdd

The JIT won't inline SplitAdd due to [FAILED: unprofitable inline] Builder:SplitAdd(long):this which seems strange to me, because the dasm for this method is:

; Assembly listing for method Builder:SplitAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  8,  6.50)     ref  ->  rdi         this class-hnd
;  V01 arg1         [V01,T01] (  5,  3.50)    long  ->  rsi        
;  V02 loc0         [V02,T02] (  6,  4   )     int  ->  rax        
;  V03 loc1         [V03,T03] (  5,  4   )     ref  ->  rdx         class-hnd
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M50053_IG01:

G_M50053_IG02:
       8B4710               mov      eax, dword ptr [rdi+16]
       488B5708             mov      rdx, gword ptr [rdi+8]
       394208               cmp      dword ptr [rdx+8], eax
       760E                 jbe      SHORT G_M50053_IG04
       4863C8               movsxd   rcx, eax
       488974CA10           mov      qword ptr [rdx+8*rcx+16], rsi
       FFC0                 inc      eax
       894710               mov      dword ptr [rdi+16], eax

G_M50053_IG03:
       C3                   ret      

G_M50053_IG04:
       48B8981431A3AC7F0000 mov      rax, 0x7FACA3311498

G_M50053_IG05:
       48FFE0               rex.jmp  rax

; Total bytes of code 39, prolog size 0 for method Builder:SplitAdd(long):this
; ============================================================

Really not much code.

So SplitAdd isn't inlined, then why SplitAdd_NoInline from the benchmark shows different numbers? It's becuase of the different prolog, and the rex.jmp (although I have to admit that I don't know what rex.jmp is (yeah, I could search for it) and where it comes from):

; Assembly listing for method Builder:SplitAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] (  8,  6.50)     ref  ->  rdi         this class-hnd
;  V01 arg1         [V01,T01] (  5,  3.50)    long  ->  rsi        
;  V02 loc0         [V02,T02] (  6,  4   )     int  ->  rax        
;  V03 loc1         [V03,T03] (  5,  4   )     ref  ->  rdx         class-hnd
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 8

G_M50053_IG01:
       50                   push     rax

G_M50053_IG02:
       8B4710               mov      eax, dword ptr [rdi+16]
       488B5708             mov      rdx, gword ptr [rdi+8]
       394208               cmp      dword ptr [rdx+8], eax
       7612                 jbe      SHORT G_M50053_IG04
       4863C8               movsxd   rcx, eax
       488974CA10           mov      qword ptr [rdx+8*rcx+16], rsi
       FFC0                 inc      eax
       894710               mov      dword ptr [rdi+16], eax

G_M50053_IG03:
       4883C408             add      rsp, 8
       C3                   ret      

G_M50053_IG04:
       E8B4F9FFFF           call     Builder:AddWithResize(long):this
       90                   nop      

G_M50053_IG05:
       4883C408             add      rsp, 8
       C3                   ret      

; Total bytes of code 42, prolog size 1 for method Builder:SplitAdd(long):this
; ============================================================

Note: with AggressiveInling the JIT emits a call and no rex.jmp instruction.

Side note: in #28177 (comment) it's maybe that there was no AggressiveInlining.

TweakedAdd

JIT will inline this method by default, although the dasm is much greater than the one SplitAdd (here the dasm is shown from TweakAdd_NoInline):

; Assembly listing for method Builder:TweakedAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 10, 10   )     ref  ->  rbx         this class-hnd
;  V01 arg1         [V01,T03] (  4,  4   )    long  ->  r14        
;  V02 loc0         [V02,T04] (  4,  4   )     int  ->  r15        
;  V03 tmp0         [V03,T01] (  6, 12   )     ref  ->  rax        
;  V04 tmp1         [V04,T02] (  6, 12   )     int  ->  rdi        
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M41370_IG01:
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       488BDF               mov      rbx, rdi
       4C8BF6               mov      r14, rsi

G_M41370_IG02:
       8B7B10               mov      edi, dword ptr [rbx+16]
       448D7F01             lea      r15d, [rdi+1]
       488BFB               mov      rdi, rbx
       418BF7               mov      esi, r15d
       E893FFFFFF           call     Builder:EnsureCapacity(int):this
       488B4308             mov      rax, gword ptr [rbx+8]
       8B7B10               mov      edi, dword ptr [rbx+16]
       3B7808               cmp      edi, dword ptr [rax+8]
       7312                 jae      SHORT G_M41370_IG04
       4863FF               movsxd   rdi, edi
       4C8974F810           mov      qword ptr [rax+8*rdi+16], r14
       44897B10             mov      dword ptr [rbx+16], r15d

G_M41370_IG03:
       5B                   pop      rbx
       415E                 pop      r14
       415F                 pop      r15
       C3                   ret      

G_M41370_IG04:
       E8B0080F79           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 65, prolog size 5 for method Builder:TweakedAdd(long):this
; ============================================================

Conclusion

The "raw implementation" (NoInline compared) of SplitAdd is way faster than TweakedAdd. Because the JIT won't inline SplitAdd we could

  • improve JITs inlining heuristics
  • force the method to inline (AggressiveInling)

In List.Add the similar pattern with fast- and cold-path is used, and there is also AggressiveInling. So I don't see any reason why we shouldn't go with that.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned:

Method Description
TweakedAdd current implementation
SplitAdd this PR

But if I look at just those two rows of your results table in the same comment, it looks like "this PR" slows down the scenario. Am I reading it right? If so, I don't know why we would take this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theses results are for benchmarks to answert the question about the influence of aggressive inlining.
So the suffixes like _Inline have to taken into account when reading the results.

This PR, as implemented, shows the results in the SplitAdd_Inline row, and there is an improvement. Not a huge one, but still noticeable faster.

public void Add(T item)
{
int count = _count;
T[] elements = _elements;

// PERF: The uint-casts allow the JIT to eliminate bound-checks.
// https://github.com/dotnet/coreclr/pull/9773
if ((uint)count < (uint)elements.Length)
{
elements[count] = item;
_count = count + 1;
}
else
{
AddWithResize(item);
}
}

// Specify NoInlining so that we are guaranteed an opportunity to service this method
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AArnott, what did you mean by "guaranteed an opportunity to service this method"? I though the NoInlining was here to avoid including the slow path as part of the caller getting aggressively inlined and bloating its caller unnecessarily.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see it has the effect you say. I thought that aggressive inlining limited our options to change the method later due to ngen. My proposed comment was focused on that.

[MethodImpl(MethodImplOptions.NoInlining)]
private void AddWithResize(T item)
{
int newCount = _count + 1;
this.EnsureCapacity(newCount);
Expand Down