ImmutableArray<T>.Builder.Add splitted in fast- and cold-path #28184

stephentoub · 2018-03-21T02:34:30Z

How much of the improvement you're showing is due to AggressiveInlining vs due to the changes in the method body? I'm not convinced this should be AggressiveInlining.

Benchmark

Notes

Method Description

TweakedAdd current implementation

SplitAdd this PR

_NoInline-methods are attributed with [MethodImpl(MethodImplOptions.NoInlining)]
__Inline-methods are attributed with [MethodImpl(MethodImplOptions.AggressiveInlining)]
Methos without _Xxx are without any attributes.

Results

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309) Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8 Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC .NET Core SDK=2.1.300-preview3-008384 [Host] : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT DefaultJob : .NET Core 2.1.0-preview2-26313-01 (Framework 4.6.26310.01), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD

TweakedAdd_NoInline 4.532 us 0.0900 us 0.1668 us 2.00 0.10

TweakedAdd 2.272 us 0.0479 us 0.0813 us 1.00 0.00

TweakedAdd_Inline 2.317 us 0.0464 us 0.0824 us 1.02 0.05

SplitAdd_NoInline 2.699 us 0.0505 us 0.0473 us 1.19 0.04

SplitAdd 3.034 us 0.0601 us 0.0715 us 1.34 0.05

SplitAdd_Inline 2.088 us 0.0416 us 0.0696 us 0.92 0.04

Discussion

SplitAdd

The JIT won't inline SplitAdd due to [FAILED: unprofitable inline] Builder:SplitAdd(long):this which seems strange to me, because the dasm for this method is:

; Assembly listing for method Builder:SplitAdd(long):this ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; fully interruptible ; Final local variable assignments ; ; V00 this [V00,T00] ( 8, 6.50) ref -> rdi this class-hnd ; V01 arg1 [V01,T01] ( 5, 3.50) long -> rsi ; V02 loc0 [V02,T02] ( 6, 4 ) int -> rax ; V03 loc1 [V03,T03] ( 5, 4 ) ref -> rdx class-hnd ;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] ; ; Lcl frame size = 0 G_M50053_IG01: G_M50053_IG02: 8B4710 mov eax, dword ptr [rdi+16] 488B5708 mov rdx, gword ptr [rdi+8] 394208 cmp dword ptr [rdx+8], eax 760E jbe SHORT G_M50053_IG04 4863C8 movsxd rcx, eax 488974CA10 mov qword ptr [rdx+8*rcx+16], rsi FFC0 inc eax 894710 mov dword ptr [rdi+16], eax G_M50053_IG03: C3 ret G_M50053_IG04: 48B8981431A3AC7F0000 mov rax, 0x7FACA3311498 G_M50053_IG05: 48FFE0 rex.jmp rax ; Total bytes of code 39, prolog size 0 for method Builder:SplitAdd(long):this ; ============================================================

Really not much code.

So SplitAdd isn't inlined, then why SplitAdd_NoInline from the benchmark shows different numbers? It's becuase of the different prolog, and the rex.jmp (although I have to admit that I don't know what rex.jmp is (yeah, I could search for it) and where it comes from):

; Assembly listing for method Builder:SplitAdd(long):this ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 this [V00,T00] ( 8, 6.50) ref -> rdi this class-hnd ; V01 arg1 [V01,T01] ( 5, 3.50) long -> rsi ; V02 loc0 [V02,T02] ( 6, 4 ) int -> rax ; V03 loc1 [V03,T03] ( 5, 4 ) ref -> rdx class-hnd ;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] ; ; Lcl frame size = 8 G_M50053_IG01: 50 push rax G_M50053_IG02: 8B4710 mov eax, dword ptr [rdi+16] 488B5708 mov rdx, gword ptr [rdi+8] 394208 cmp dword ptr [rdx+8], eax 7612 jbe SHORT G_M50053_IG04 4863C8 movsxd rcx, eax 488974CA10 mov qword ptr [rdx+8*rcx+16], rsi FFC0 inc eax 894710 mov dword ptr [rdi+16], eax G_M50053_IG03: 4883C408 add rsp, 8 C3 ret G_M50053_IG04: E8B4F9FFFF call Builder:AddWithResize(long):this 90 nop G_M50053_IG05: 4883C408 add rsp, 8 C3 ret ; Total bytes of code 42, prolog size 1 for method Builder:SplitAdd(long):this ; ============================================================

Note: with AggressiveInling the JIT emits a call and no rex.jmp instruction.

Side note: in #28177 (comment) it's maybe that there was no AggressiveInlining.

TweakedAdd

JIT will inline this method by default, although the dasm is much greater than the one SplitAdd (here the dasm is shown from TweakAdd_NoInline):

; Assembly listing for method Builder:TweakedAdd(long):this ; Emitting BLENDED_CODE for X64 CPU with AVX ; optimized code ; rsp based frame ; partially interruptible ; Final local variable assignments ; ; V00 this [V00,T00] ( 10, 10 ) ref -> rbx this class-hnd ; V01 arg1 [V01,T03] ( 4, 4 ) long -> r14 ; V02 loc0 [V02,T04] ( 4, 4 ) int -> r15 ; V03 tmp0 [V03,T01] ( 6, 12 ) ref -> rax ; V04 tmp1 [V04,T02] ( 6, 12 ) int -> rdi ;# V05 OutArgs [V05 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] ; ; Lcl frame size = 0 G_M41370_IG01: 4157 push r15 4156 push r14 53 push rbx 488BDF mov rbx, rdi 4C8BF6 mov r14, rsi G_M41370_IG02: 8B7B10 mov edi, dword ptr [rbx+16] 448D7F01 lea r15d, [rdi+1] 488BFB mov rdi, rbx 418BF7 mov esi, r15d E893FFFFFF call Builder:EnsureCapacity(int):this 488B4308 mov rax, gword ptr [rbx+8] 8B7B10 mov edi, dword ptr [rbx+16] 3B7808 cmp edi, dword ptr [rax+8] 7312 jae SHORT G_M41370_IG04 4863FF movsxd rdi, edi 4C8974F810 mov qword ptr [rax+8*rdi+16], r14 44897B10 mov dword ptr [rbx+16], r15d G_M41370_IG03: 5B pop rbx 415E pop r14 415F pop r15 C3 ret G_M41370_IG04: E8B0080F79 call CORINFO_HELP_RNGCHKFAIL CC int3 ; Total bytes of code 65, prolog size 5 for method Builder:TweakedAdd(long):this ; ============================================================

Conclusion

The "raw implementation" (NoInline compared) of SplitAdd is way faster than TweakedAdd. Because the JIT won't inline SplitAdd we could

improve JITs inlining heuristics

force the method to inline (AggressiveInling)

In List.Add the similar pattern with fast- and cold-path is used, and there is also AggressiveInling. So I don't see any reason why we shouldn't go with that.

You mentioned:

Method Description

TweakedAdd current implementation

SplitAdd this PR

But if I look at just those two rows of your results table in the same comment, it looks like "this PR" slows down the scenario. Am I reading it right? If so, I don't know why we would take this PR.

Theses results are for benchmarks to answert the question about the influence of aggressive inlining.
So the suffixes like _Inline have to taken into account when reading the results.

This PR, as implemented, shows the results in the SplitAdd_Inline row, and there is an improvement. Not a huge one, but still noticeable faster.

stephentoub · 2018-06-22T02:23:47Z

@AArnott, what did you mean by "guaranteed an opportunity to service this method"? I though the NoInlining was here to avoid including the slow path as part of the caller getting aggressively inlined and bloating its caller unnecessarily.

I can see it has the effect you say. I thought that aggressive inlining limited our options to change the method later due to ngen. My proposed comment was focused on that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImmutableArray<T>.Builder.Add splitted in fast- and cold-path #28184

Uh oh!

Diff view

Diff view

There are no files selected for viewing

stephentoub Mar 21, 2018

Uh oh!

gfoidl Mar 21, 2018

Uh oh!

AArnott Jun 20, 2018

Uh oh!

gfoidl Jun 21, 2018

Uh oh!

stephentoub Jun 22, 2018

Uh oh!

AArnott Jun 22, 2018

Uh oh!

ImmutableArray<T>.Builder.Add splitted in fast- and cold-path #28184

Uh oh!

ImmutableArray<T>.Builder.Add splitted in fast- and cold-path #28184

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

stephentoub Mar 21, 2018

Choose a reason for hiding this comment

Uh oh!

gfoidl Mar 21, 2018

Choose a reason for hiding this comment

Benchmark

Notes

Results

Discussion

SplitAdd

TweakedAdd

Conclusion

Uh oh!

AArnott Jun 20, 2018

Choose a reason for hiding this comment

Uh oh!

gfoidl Jun 21, 2018

Choose a reason for hiding this comment

Uh oh!

stephentoub Jun 22, 2018

Choose a reason for hiding this comment

Uh oh!

AArnott Jun 22, 2018

Choose a reason for hiding this comment

Uh oh!