ImmutableArray<T>.Builder.Add splitted in fast- and cold-path#28184
ImmutableArray<T>.Builder.Add splitted in fast- and cold-path#28184gfoidl wants to merge 3 commits into
Conversation
|
@dotnet-bot test Windows x86 Release Build |
|
cc: @stephentoub |
| /// Adds an item to the <see cref="ICollection{T}"/>. | ||
| /// </summary> | ||
| /// <param name="item">The object to add to the <see cref="ICollection{T}"/>.</param> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
How much of the improvement you're showing is due to AggressiveInlining vs due to the changes in the method body? I'm not convinced this should be AggressiveInlining.
There was a problem hiding this comment.
Benchmark
Notes
| Method | Description |
|---|---|
| TweakedAdd | current implementation |
| SplitAdd | this PR |
_NoInline-methods are attributed with [MethodImpl(MethodImplOptions.NoInlining)]
__Inline-methods are attributed with [MethodImpl(MethodImplOptions.AggressiveInlining)]
Methos without _Xxx are without any attributes.
Results
BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008384
[Host] : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT
DefaultJob : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT
| Method | Mean | Error | StdDev | Scaled | ScaledSD |
|---|---|---|---|---|---|
| TweakedAdd_NoInline | 4.532 us | 0.0900 us | 0.1668 us | 2.00 | 0.10 |
| TweakedAdd | 2.272 us | 0.0479 us | 0.0813 us | 1.00 | 0.00 |
| TweakedAdd_Inline | 2.317 us | 0.0464 us | 0.0824 us | 1.02 | 0.05 |
| SplitAdd_NoInline | 2.699 us | 0.0505 us | 0.0473 us | 1.19 | 0.04 |
| SplitAdd | 3.034 us | 0.0601 us | 0.0715 us | 1.34 | 0.05 |
| SplitAdd_Inline | 2.088 us | 0.0416 us | 0.0696 us | 0.92 | 0.04 |
Discussion
SplitAdd
The JIT won't inline SplitAdd due to [FAILED: unprofitable inline] Builder:SplitAdd(long):this which seems strange to me, because the dasm for this method is:
; Assembly listing for method Builder:SplitAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 8, 6.50) ref -> rdi this class-hnd
; V01 arg1 [V01,T01] ( 5, 3.50) long -> rsi
; V02 loc0 [V02,T02] ( 6, 4 ) int -> rax
; V03 loc1 [V03,T03] ( 5, 4 ) ref -> rdx class-hnd
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M50053_IG01:
G_M50053_IG02:
8B4710 mov eax, dword ptr [rdi+16]
488B5708 mov rdx, gword ptr [rdi+8]
394208 cmp dword ptr [rdx+8], eax
760E jbe SHORT G_M50053_IG04
4863C8 movsxd rcx, eax
488974CA10 mov qword ptr [rdx+8*rcx+16], rsi
FFC0 inc eax
894710 mov dword ptr [rdi+16], eax
G_M50053_IG03:
C3 ret
G_M50053_IG04:
48B8981431A3AC7F0000 mov rax, 0x7FACA3311498
G_M50053_IG05:
48FFE0 rex.jmp rax
; Total bytes of code 39, prolog size 0 for method Builder:SplitAdd(long):this
; ============================================================Really not much code.
So SplitAdd isn't inlined, then why SplitAdd_NoInline from the benchmark shows different numbers? It's becuase of the different prolog, and the rex.jmp (although I have to admit that I don't know what rex.jmp is (yeah, I could search for it) and where it comes from):
; Assembly listing for method Builder:SplitAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 8, 6.50) ref -> rdi this class-hnd
; V01 arg1 [V01,T01] ( 5, 3.50) long -> rsi
; V02 loc0 [V02,T02] ( 6, 4 ) int -> rax
; V03 loc1 [V03,T03] ( 5, 4 ) ref -> rdx class-hnd
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 8
G_M50053_IG01:
50 push rax
G_M50053_IG02:
8B4710 mov eax, dword ptr [rdi+16]
488B5708 mov rdx, gword ptr [rdi+8]
394208 cmp dword ptr [rdx+8], eax
7612 jbe SHORT G_M50053_IG04
4863C8 movsxd rcx, eax
488974CA10 mov qword ptr [rdx+8*rcx+16], rsi
FFC0 inc eax
894710 mov dword ptr [rdi+16], eax
G_M50053_IG03:
4883C408 add rsp, 8
C3 ret
G_M50053_IG04:
E8B4F9FFFF call Builder:AddWithResize(long):this
90 nop
G_M50053_IG05:
4883C408 add rsp, 8
C3 ret
; Total bytes of code 42, prolog size 1 for method Builder:SplitAdd(long):this
; ============================================================Note: with AggressiveInling the JIT emits a call and no rex.jmp instruction.
Side note: in #28177 (comment) it's maybe that there was no AggressiveInlining.
TweakedAdd
JIT will inline this method by default, although the dasm is much greater than the one SplitAdd (here the dasm is shown from TweakAdd_NoInline):
; Assembly listing for method Builder:TweakedAdd(long):this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 this [V00,T00] ( 10, 10 ) ref -> rbx this class-hnd
; V01 arg1 [V01,T03] ( 4, 4 ) long -> r14
; V02 loc0 [V02,T04] ( 4, 4 ) int -> r15
; V03 tmp0 [V03,T01] ( 6, 12 ) ref -> rax
; V04 tmp1 [V04,T02] ( 6, 12 ) int -> rdi
;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M41370_IG01:
4157 push r15
4156 push r14
53 push rbx
488BDF mov rbx, rdi
4C8BF6 mov r14, rsi
G_M41370_IG02:
8B7B10 mov edi, dword ptr [rbx+16]
448D7F01 lea r15d, [rdi+1]
488BFB mov rdi, rbx
418BF7 mov esi, r15d
E893FFFFFF call Builder:EnsureCapacity(int):this
488B4308 mov rax, gword ptr [rbx+8]
8B7B10 mov edi, dword ptr [rbx+16]
3B7808 cmp edi, dword ptr [rax+8]
7312 jae SHORT G_M41370_IG04
4863FF movsxd rdi, edi
4C8974F810 mov qword ptr [rax+8*rdi+16], r14
44897B10 mov dword ptr [rbx+16], r15d
G_M41370_IG03:
5B pop rbx
415E pop r14
415F pop r15
C3 ret
G_M41370_IG04:
E8B0080F79 call CORINFO_HELP_RNGCHKFAIL
CC int3
; Total bytes of code 65, prolog size 5 for method Builder:TweakedAdd(long):this
; ============================================================Conclusion
The "raw implementation" (NoInline compared) of SplitAdd is way faster than TweakedAdd. Because the JIT won't inline SplitAdd we could
- improve JITs inlining heuristics
- force the method to inline (AggressiveInling)
In List.Add the similar pattern with fast- and cold-path is used, and there is also AggressiveInling. So I don't see any reason why we shouldn't go with that.
There was a problem hiding this comment.
You mentioned:
| Method | Description |
|---|---|
| TweakedAdd | current implementation |
| SplitAdd | this PR |
But if I look at just those two rows of your results table in the same comment, it looks like "this PR" slows down the scenario. Am I reading it right? If so, I don't know why we would take this PR.
There was a problem hiding this comment.
Theses results are for benchmarks to answert the question about the influence of aggressive inlining.
So the suffixes like _Inline have to taken into account when reading the results.
This PR, as implemented, shows the results in the SplitAdd_Inline row, and there is an improvement. Not a huge one, but still noticeable faster.
AArnott
left a comment
There was a problem hiding this comment.
I'm hesitant to recommend accepting this PR. It's not clear that there's an improvement (that may just be a misunderstanding) but even if it is faster, is the perf increase important and significant enough to be a must have for something, to warrant the loss of servicing we'll take for the Add method?
| int count = _count; | ||
| T[] elements = _elements; | ||
|
|
||
| if ((uint)count < (uint)elements.Length) |
There was a problem hiding this comment.
With the uint casts the JIT is able to eliminate the bound checks in elements[count]. Cf. dotnet/coreclr#9773
There was a problem hiding this comment.
Can you add a code comment to that effect. That sounds significant, but it's not at all clear, such that I'd be afraid someone would just remove the casts and never know they eliminated the optimization.
| } | ||
| } | ||
|
|
||
| // Improve code quality as uncommon path |
There was a problem hiding this comment.
I don't understand what this code comment means. Is this a TODO item that we need to improve the code quality? What does "as uncommon path" have to do with it?
There was a problem hiding this comment.
The AddWithResize-path is called from Add and is on the cold-path / uncommon-path. The comment is to "explain" why no-inlining is added. Similar to dotnet/coreclr#9539 (comment)
Do you have a better text for the comment, so to make it instantely clear what is meant?
There was a problem hiding this comment.
How about:
// Specify NoInlining so that we are guaranteed an opportunity to service this method| /// Adds an item to the <see cref="ICollection{T}"/>. | ||
| /// </summary> | ||
| /// <param name="item">The object to add to the <see cref="ICollection{T}"/>.</param> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
You mentioned:
| Method | Description |
|---|---|
| TweakedAdd | current implementation |
| SplitAdd | this PR |
But if I look at just those two rows of your results table in the same comment, it looks like "this PR" slows down the scenario. Am I reading it right? If so, I don't know why we would take this PR.
It is. Please see #28184 (comment) and description for the PR.
I can't judge this, but splitting into fast-/cold-path is quite a common technique employed in several "collection"-types. |
AArnott
left a comment
There was a problem hiding this comment.
Thanks. I'm satisfied with the changes and the research you've done.
The perf improvement still seems small to me. It would be interesting to see the perf improvement that came from dotnet/coreclr#9539 (the change in List<T>) but I don't see any measurement comparisons (just asm comparison) on that PR. (so thanks for including it in yours!)
I suspect @stephentoub is best equipped to make the call on the perf benefit being worthwhile. But the code change looks good.
| } | ||
| } | ||
|
|
||
| // Improve code quality as uncommon path |
There was a problem hiding this comment.
How about:
// Specify NoInlining so that we are guaranteed an opportunity to service this method| } | ||
| } | ||
|
|
||
| // Specify NoInlining so that we are guaranteed an opportunity to service this method |
There was a problem hiding this comment.
@AArnott, what did you mean by "guaranteed an opportunity to service this method"? I though the NoInlining was here to avoid including the slow path as part of the caller getting aggressively inlined and bloating its caller unnecessarily.
There was a problem hiding this comment.
I can see it has the effect you say. I thought that aggressive inlining limited our options to change the method later due to ngen. My proposed comment was focused on that.
|
@AArnott's comments raise a good question: The perf testing done here is on .NET Core, with the latest JIT. But this library is used on many previous versions of .NET Framework, right? @gfoidl, I assume you've not validated that this is actually an improvement across all of those? You asked why this library might be special when we've made similar changes to core collection types elsewhere in coreclr/corefx... to me, that's why. |
I validated just for .NET Core, under the assumption this code is used only by .NET Core and may be eventually be backported to .NET Full. As I see now, this assumption is incorrect. For .NET Full I get these numbers: BenchmarkDotNet=v0.10.11, OS=Windows 10.0.17134
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742190 Hz, Resolution=364.6720 ns, Timer=TSC
[Host] : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3110.0
DefaultJob : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3110.0
BenchmarkDotNet=v0.10.11, OS=Windows 10.0.17134
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742190 Hz, Resolution=364.6720 ns, Timer=TSC
.NET Core SDK=2.1.300
[Host] : .NET Core 2.1.0 (Framework 4.6.26515.07), 64bit RyuJIT
Clr : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3110.0
Core : .NET Core 2.1.0 (Framework 4.6.26515.07), 64bit RyuJIT
So on .NET Full it's not really an improvement, rather it stays on par (within noise). Dasm for .NET Full and .NET Core is equal. dasmOnly the loop is shown. In total the dasm is similar. ; netcoreapp2.1
; ...
00007FFCB2011939 mov ecx,dword ptr [rsi+10h]
00007FFCB201193C mov rdx,qword ptr [rsi+8]
00007FFCB2011940 mov eax,dword ptr [rdx+8]
00007FFCB2011943 cmp eax,ecx
00007FFCB2011945 jbe 00007FFCB2011956
00007FFCB2011947 movsxd rax,ecx
00007FFCB201194A mov qword ptr [rdx+rax*8+10h],rdi
00007FFCB201194F inc ecx
00007FFCB2011951 mov dword ptr [rsi+10h],ecx
00007FFCB2011954 jmp 00007FFCB2011961
00007FFCB2011956 mov rcx,rsi
00007FFCB2011959 mov rdx,rdi
00007FFCB201195C call 00007FFCB2011310
00007FFCB2011961 inc rdi
00007FFCB2011964 mov rcx,qword ptr [rsi+8]
00007FFCB2011968 mov ecx,dword ptr [rcx+8]
00007FFCB201196B movsxd rcx,ecx
00007FFCB201196E cmp rcx,rdi
00007FFCB2011971 jg 00007FFCB2011939
; ...
;------------------------------------------------------------------------------
; net471
; ...
00007FFCDE5C0919 mov ecx,dword ptr [rsi+10h]
00007FFCDE5C091C mov rdx,qword ptr [rsi+8]
00007FFCDE5C0920 mov eax,dword ptr [rdx+8]
00007FFCDE5C0923 cmp eax,ecx
00007FFCDE5C0925 jbe 00007FFCDE5C0936
00007FFCDE5C0927 movsxd rax,ecx
00007FFCDE5C092A mov qword ptr [rdx+rax*8+10h],rdi
00007FFCDE5C092F inc ecx
00007FFCDE5C0931 mov dword ptr [rsi+10h],ecx
00007FFCDE5C0934 jmp 00007FFCDE5C0941
00007FFCDE5C0936 mov rcx,rsi
00007FFCDE5C0939 mov rdx,rdi
00007FFCDE5C093C call 00007FFCDE5C02F0
00007FFCDE5C0941 inc rdi
00007FFCDE5C0944 mov rcx,qword ptr [rsi+8]
00007FFCDE5C0948 mov ecx,dword ptr [rcx+8]
00007FFCDE5C094B movsxd rcx,ecx
00007FFCDE5C094E cmp rcx,rdi
00007FFCDE5C0951 jg 00007FFCDE5C0919
; ...Didn't test other targets.
Is this indicated by -- or how can I see on which targets a library will be used (in order to take this into account next time)? So what to do?
|
Thanks, @gfoidl. The potential improvement here is small and it complicates the code to do this kind of refactoring plus then to special-case it would require a lot more than just ifdef'ing (we'd need to ship multiple binaries, deal with the packaging, etc.). I appreciate your efforts here, but I think we should just leave it as-is. |
Description
Based on #28177 (comment)
Addis split in a fast-path without resizing the array, and a cold-path that does the resize.On the fast-path the bounds-check for the array-access is also eliminated.
Benchmarks
Code for benchmarks is taken from svick
SimpleAddis the original code, i.e. before #28177TweakedAddis code of #28177SplitAddis code of this PR.win-x64
linux-x64
linux-x64 (different CPU)
Notes
In #28177 (comment) @svick reports that this change decreases perf on his machine. That's why I tested on three different machines, and all show a perf improvement. List.Add, Stack.Push and similar classes use this pattern and all show an improvement.