-
Notifications
You must be signed in to change notification settings - Fork 847
Prefer forward branches to decision tree targets #11619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note that the slowdowns reported by Bartosz in the tweet were more dramatic than those shown here. I'm going to work on this a little more. The codegen in the second case is still not perfect, see the corresponding C# codegen here: |
|
I did some further work and the code for As expected a whole bunch of baselines need updating, making a list of the first round of them here for reference |
|
This is a test for singular calls, some of the results that are under a nanosecond should be discarded (1 | 3) since a method call can cost up to 4 cycles. For these cases, I will construct a better test; Since JIT prefers early exits on conditions we can create a large condition chain and force it to execute a set amount of instructions before we start measuring performance. [DisassemblyDiagnoser]
[HardwareCounters(
HardwareCounter.BranchMispredictions,
HardwareCounter.BranchInstructions)]
public class Bench
{
Random rnd = new Random();
[Benchmark]
[Arguments(1)]
[Arguments(2)]
[Arguments(3)]
[Arguments(4)]
[Arguments(5)]
[Arguments(6)]
[Arguments(7)]
public int CSharp(int x)
{
return Cond(x);
}
[Benchmark]
[Arguments(1)]
[Arguments(2)]
[Arguments(3)]
[Arguments(4)]
[Arguments(5)]
[Arguments(6)]
[Arguments(7)]
public int FSharp(int x)
{
return FSharpCond.Bench.condition_1(x);
}
//
// FSharp will not inline the code so we shouldn't eiter.
//
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Cond(int x)
{
if (x == 1 || x == 2) return 1;
else if (x == 3 || x == 4) return 2;
else if (x == 5 || x == 6) return 3;
else return 4;
}
}namespace FSharpCond
module Bench =
let condition_1 x =
if (x = 1 || x = 2) then 1
elif(x = 3 || x = 4) then 2
elif(x = 5 || x = 6) then 3
else 4BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19042.985 (20H2/October2020Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
DefaultJob : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
|
|
This is a random test where the branch predictor will have a hard time guessing what the correct branch should be; since backward branches are usually reserved for loops on my CPU, it will get more backward branches wrong. [DisassemblyDiagnoser]
[HardwareCounters(
HardwareCounter.BranchMispredictions,
HardwareCounter.BranchInstructions)]
public class Bench3_Random
{
Random rnd = new Random(12345678);
const int size = 1024 * 1024;
private int[] s = new int[size];
[GlobalSetup]
public void Setup()
{
for(int i = 0; i < size; i++)
{
s[i] = rnd.Next(0, 8);
}
}
[Benchmark]
public void CSharp()
{
var _s = s;
for (int i = 0; i < _s.Length; i++)
Cond(_s[i]);
}
[Benchmark]
public void FSharp()
{
var _s = s;
for (int i = 0; i < _s.Length; i++)
FSharpCond.Bench.condition_1(_s[i]);
}
//
// FSharp will not inline the code so we shouldn't either.
//
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Cond(int x)
{
if (x == 1 || x == 2) return 1;
else if (x == 3 || x == 4) return 2;
else if (x == 5 || x == 6) return 3;
else return 4;
}namespace FSharpCond
module Bench =
let condition_1 x =
if (x = 1 || x = 2) then 1
elif(x = 3 || x = 4) then 2
elif(x = 5 || x = 6) then 3
else 4BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19042.985 (20H2/October2020Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
DefaultJob : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
|
|
This bench shows what really happens when we are dealing with a mix of forward and backward branches, the performance alternates between identical to C# and Slower than C#. This is because if we want to reach for a condition when x = 11 we first have to cross the backwards branch, on most intels if the predictor state is undiscovered it will pick forwards as not taken and backwards as taken. New AMD CPUs assume that unseen branches are not taken but since they use a deep dynamic predictor it fairs extremely poorly with a small number of alternating branches: https://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf This doesn't explain why the benchmark still works on a fully initialized branch predictor that guesses 100% of the time, but it might be that dynamic predictors and CPUs, in general, don't like mixing branches. [DisassemblyDiagnoser]
[HardwareCounters(
HardwareCounter.BranchMispredictions,
HardwareCounter.BranchInstructions)]
public class Bench2
{
[Benchmark]
[Arguments(10)]
[Arguments(11)]
[Arguments(12)]
[Arguments(13)]
[Arguments(14)]
[Arguments(15)]
public int CSharp(int x)
{
return Cond(x);
}
[Benchmark]
[Arguments(10)]
[Arguments(11)]
[Arguments(12)]
[Arguments(13)]
[Arguments(14)]
[Arguments(15)]
public int FSharp(int x)
{
return FSharpCond.Bench.condition_2(x);
}
//
// FSharp will not inline the code so we shouldn't eiter.
//
[MethodImpl(MethodImplOptions.NoInlining)]
public static int Cond(int x)
{
if (x == 1 || x == 2) return 1;
else if (x == 3 || x == 4) return 2;
else if (x == 5 || x == 6) return 3;
else if (x == 5 || x == 6) return 3;
else if (x == 7 || x == 8) return 4;
else if (x == 9 || x == 10) return 5;
else if (x == 11 || x == 12) return 6;
else if (x == 13 || x == 14) return 7;
else return 8;
}
} let condition_2 x =
if (x = 1 || x = 2) then 1
elif(x = 3 || x = 4) then 2
elif(x = 5 || x = 6) then 3
elif(x = 7 || x = 8) then 4
elif(x = 9 || x = 10) then 5
elif(x = 11 || x = 12) then 6
elif(x = 13 || x = 14) then 7
else 8BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19042.985 (20H2/October2020Update)
Intel Core i7-6700HQ CPU 2.60GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
DefaultJob : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.CSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCondBench.Bench2.Cond(Int32)
; Total bytes of code 7; FSharpCondBench.Bench2.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET 5.0.0 (5.0.20.51904), X64 RyuJIT; FSharpCondBench.Bench2.FSharp(Int32)
mov ecx,edx
jmp near ptr FSharpCond.Bench.condition_2(Int32)
; Total bytes of code 7; FSharpCond.Bench.condition_2(Int32)
cmp ecx,1
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,2
je short M01_L00
cmp ecx,3
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,4
je short M01_L02
cmp ecx,5
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,6
je short M01_L04
cmp ecx,7
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,8
je short M01_L06
cmp ecx,9
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0A
je short M01_L08
cmp ecx,0B
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0C
je short M01_L10
cmp ecx,0D
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
cmp ecx,0E
je short M01_L12
mov eax,8
M01_L14:
ret
; Total bytes of code 122 |
|
@badamczewski Just to check, are those results on this branch, or main, or a release version in the .NET SDK? Also what's the high level summary of the results please :) |
|
@dsyme, this is the release version of .NET 5. The high-level summary is that there's a performance gain when conditions are composed of forward branches on intel CPUs (Skylake, Hanswel) and most likely on AMD's as well 🙂 |
|
Codewise this is ready (once green). We should add a Benchmark.NET perf suite project to test this sort of thing out |
|
This would be a good place to add it, likely a new project to the solution: https://github.com/dotnet/fsharp/tree/main/tests/benchmarks |
|
@dsyme this kind of branch organization is considered good practice. That being said, it would be great to test this on an AMD Ryzen CPU as well. |
|
I added a benchmark. Here are the perf results on my (old) Xeon processor. The new perf results simply make the F# identical to the C# so are uninteresting to list, you can tell the differences below. OLD: |
|
This is now ready We can integrate it, simply on the basis that we now generate the same code as C# for To do that (this is for Windows, step 1 will need adjustment on Linux)
thanks |
|
NEW OLD |
|
@dsyme, @badamczewski, here you go, a 2019 Ryzen CPU. It looks a bit wild, but I wasn't really using the PC at the time. Old: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21258.4
[Host] : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT DEBUG
DefaultJob : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
New: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21258.4
[Host] : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT DEBUG
DefaultJob : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
New, run 2: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21258.4
[Host] : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT DEBUG
DefaultJob : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
If anyone wants to run this with a preview of VS, the correct tool path is |
|
@dsyme more failures :) |
|
@dominikprzywara @kerams Could you also run the benchmark under this post: Ryzen CPUs are crazy fast and they need to execute more instructions for the results to be more stable, and thus the test in the list will force the code to run more instructions per test without changing the branching code in any way: |
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19043
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21258.4
[Host] : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT DEBUG
DefaultJob : .NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT
Looks good to me. The generated assembly is below. Details.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.5 (CoreCLR 5.0.521.16609, CoreFX 5.0.521.16609), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122 |
Should be fixed now. |
|
Sorry for delay :) BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 9 5900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.104
[Host] : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT DEBUG
DefaultJob : .NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT
ASM.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.CSharp(Int32)
mov ecx,edx
jmp near ptr MicroPerfCSharp.Cond(Int32)
; Total bytes of code 7; MicroPerfCSharp.Cond(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122.NET Core 5.0.6 (CoreCLR 5.0.621.22011, CoreFX 5.0.621.22011), X64 RyuJIT; TaskPerf.Benchmarks.FSharp(Int32)
mov ecx,edx
jmp near ptr TaskPerf.Code.condition_2(Int32)
; Total bytes of code 7; TaskPerf.Code.condition_2(Int32)
cmp ecx,1
je short M01_L00
cmp ecx,2
jne short M01_L01
M01_L00:
mov eax,1
ret
M01_L01:
cmp ecx,3
je short M01_L02
cmp ecx,4
jne short M01_L03
M01_L02:
mov eax,2
ret
M01_L03:
cmp ecx,5
je short M01_L04
cmp ecx,6
jne short M01_L05
M01_L04:
mov eax,3
ret
M01_L05:
cmp ecx,7
je short M01_L06
cmp ecx,8
jne short M01_L07
M01_L06:
mov eax,4
jmp short M01_L14
M01_L07:
cmp ecx,9
je short M01_L08
cmp ecx,0A
jne short M01_L09
M01_L08:
mov eax,5
jmp short M01_L14
M01_L09:
cmp ecx,0B
je short M01_L10
cmp ecx,0C
jne short M01_L11
M01_L10:
mov eax,6
jmp short M01_L14
M01_L11:
cmp ecx,0D
je short M01_L12
cmp ecx,0E
jne short M01_L13
M01_L12:
mov eax,7
jmp short M01_L14
M01_L13:
mov eax,8
M01_L14:
ret
; Total bytes of code 122 |
TIHan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change is pretty short, majority of changes are updating baseline files.
Awesome.
This micro-perf analysis by Bartosz Adamczewski (@badamczewski01) revealed that we are emitting backward branches for match targets reachable via multiple paths in the decision tree, e.g.
or this also applies to some instances of boolean logic as shown here: https://twitter.com/badamczewski01/status/1399460065254547458
In the first example we emit
<target-code>when first encountered (Pattern1), then the successful emit forPattern2was branching backwards to that target code. The backward branch evidently has perf cost on typical microprocessors branch prediction and instruction decoding. The second example is similarThis PR changes to only emit the target code at the last pattern that matches, not the first.
NOTE: I had to reduce the size limit for one of the "large record" tests from 1000 to 970 for .NET Core due to some change in stack usage in the compiler. I don't think it indicates anything wrong - though I don't know the specific reason why stack usage increased - in general all the EmittedIL tests show improvement in generated code, removing unnecessary branches etc. One of the other neighbouring tests was already disabled, and I re-enabled that at size 970.
I did a quick perf test using adhoc timing techniques following the code in the tweet:
mainbranch compiler:this PR:
It is true the
n=1andn=3cases have degraded in this quick test, though the results now match what C# does (the generated code is now the same), and overall have become more uniform and the larger variance due to the backward branches avoided. I think it's the right change to make given we're now matching C#Note that the slowdowns reported by Bartosz in the tweet were more dramatic than those shown here. That's because my home machine is using an old Xeon processor I think (yes I need a new one)
Update baselines
Verify the expected perf improvement in this branch.
BEFORE:
AFTER (updated!)