Do not mark op2 as delayRegFree if op1==op2 #53964

kunalspathak · 2021-06-10T01:00:50Z

If op1 == op2, do not mark op2 as delayFree so we can reuse the register.

kunalspathak · 2021-06-10T01:01:17Z

With COMPlus_EnableAVX=0 , here are the improvements:

Benchmarks.run.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3694
Total bytes of diff: 3625
Total bytes of delta: -69 (-1.87% of base)
Total relative delta: -0.72
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -12 : 21279.dasm (-3.45% of base)
          -6 : 13679.dasm (-9.84% of base)
          -6 : 13670.dasm (-9.09% of base)
          -6 : 26511.dasm (-7.69% of base)
          -6 : 26107.dasm (-7.32% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this

Top method improvements (percentages):
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this

11 total methods with Code Size differences (11 improved, 0 regressed), 0 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 63325.46
Total PerfScoreUnits of diff: 63294.560000000005
Total PerfScoreUnits of delta: -30.90 (-0.05% of base)
Total relative delta: -0.26
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
      -18.60 : 59.dasm (-0.03% of base)
       -2.20 : 21279.dasm (-0.84% of base)
       -1.30 : 60.dasm (-0.83% of base)
       -1.10 : 13679.dasm (-3.91% of base)
       -1.10 : 13670.dasm (-2.93% of base)

11 total files with Perf Score differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
      -18.60 (-0.03% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
       -2.20 (-0.84% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
       -1.30 (-0.83% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.10 (-3.91% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
       -1.10 (-2.93% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this

Top method improvements (percentages):
       -1.10 (-4.20% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
       -1.10 (-3.91% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
       -1.10 (-3.14% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
       -1.10 (-2.93% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
       -1.10 (-2.84% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this

11 total methods with Perf Score differences (11 improved, 0 regressed), 0 unchanged.

Coreclr_tests.pmi.windows.x64.checked.1


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3742
Total bytes of diff: 3613
Total bytes of delta: -129 (-3.45% of base)
Total relative delta: -1.58
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -30 : 216439.dasm (-3.94% of base)
         -12 : 241647.dasm (-3.10% of base)
          -9 : 216440.dasm (-3.41% of base)
          -9 : 216442.dasm (-3.35% of base)
          -9 : 216441.dasm (-1.65% of base)

15 total files with Code Size differences (15 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -30 (-3.94% of base) : 216439.dasm - IntelHardwareIntrinsicTest:init()
         -12 (-3.10% of base) : 241647.dasm - VectorInitTest:VectorInit(float):int
          -9 (-3.41% of base) : 216440.dasm - IntelHardwareIntrinsicTest:F1_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]
          -9 (-3.35% of base) : 216442.dasm - IntelHardwareIntrinsicTest:F1_v128i(int):System.Runtime.Intrinsics.Vector128`1[Int16]
          -9 (-1.65% of base) : 216441.dasm - IntelHardwareIntrinsicTest:F2_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]

Top method improvements (percentages):
          -6 (-37.50% of base) : 240233.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeAs():float:this
          -6 (-37.50% of base) : 240235.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeReadUnaligned():float:this
          -6 (-24.00% of base) : 240234.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeRead():float:this
          -6 (-16.22% of base) : 240232.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredManualVectors():float:this
          -6 (-7.06% of base) : 240231.dasm - UnsafeTesting.Program:LengthSquaredManualVectors():float

15 total methods with Code Size differences (15 improved, 0 regressed), 0 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 1528.75
Total PerfScoreUnits of diff: 1503.6100000000004
Total PerfScoreUnits of delta: -25.14 (-1.64% of base)
Total relative delta: -0.40
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
       -7.00 : 216439.dasm (-1.62% of base)
       -2.20 : 241647.dasm (-1.47% of base)
       -1.65 : 216440.dasm (-1.23% of base)
       -1.65 : 216442.dasm (-1.51% of base)
       -1.30 : 5.dasm (-0.83% of base)

15 total files with Perf Score differences (15 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
       -7.00 (-1.62% of base) : 216439.dasm - IntelHardwareIntrinsicTest:init()
       -2.20 (-1.47% of base) : 241647.dasm - VectorInitTest:VectorInit(float):int
       -1.65 (-1.23% of base) : 216440.dasm - IntelHardwareIntrinsicTest:F1_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]
       -1.65 (-1.51% of base) : 216442.dasm - IntelHardwareIntrinsicTest:F1_v128i(int):System.Runtime.Intrinsics.Vector128`1[Int16]
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
       -1.10 (-5.76% of base) : 240233.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeAs():float:this
       -1.10 (-5.76% of base) : 240235.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeReadUnaligned():float:this
       -1.10 (-4.94% of base) : 240234.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeRead():float:this
       -1.10 (-3.90% of base) : 240232.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredManualVectors():float:this
       -1.10 (-2.97% of base) : 240231.dasm - UnsafeTesting.Program:LengthSquaredManualVectors():float

15 total methods with Perf Score differences (15 improved, 0 regressed), 0 unchanged.

Libraries.pmi.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1042
Total bytes of diff: 1021
Total bytes of delta: -21 (-2.02% of base)
Total relative delta: -0.41
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
          -9 : 16072.dasm (-20.93% of base)
          -6 : 16073.dasm (-18.75% of base)
          -3 : 17524.dasm (-0.62% of base)
          -3 : 5.dasm (-0.62% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Code Size differences (4 improved, 0 regressed), 0 unchanged.


Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 368.32
Total PerfScoreUnits of diff: 362.97
Total PerfScoreUnits of delta: -5.35 (-1.45% of base)
Total relative delta: -0.12
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (PerfScoreUnits):
       -1.65 : 16072.dasm (-6.27% of base)
       -1.30 : 17524.dasm (-0.83% of base)
       -1.30 : 5.dasm (-0.83% of base)
       -1.10 : 16073.dasm (-3.97% of base)

4 total files with Perf Score differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
       -1.65 (-6.27% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
       -1.30 (-0.83% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.10 (-3.97% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int

Top method improvements (percentages):
       -1.65 (-6.27% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
       -1.10 (-3.97% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.30 (-0.83% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Perf Score differences (4 improved, 0 regressed), 0 unchanged.

kunalspathak · 2021-06-10T01:01:41Z

@dotnet/jit-contrib

sandreenko

LGTM

src/coreclr/jit/gentree.cpp

tannergooding · 2021-06-10T13:31:53Z

src/coreclr/jit/lsraxarch.cpp

-
-                        srcCount += BuildDelayFreeUses(op2);
+                        // Unless, op1 and op2 are same, in which case we can overwrite op2.
+                        if (GenTree::NodesAreEquivalentLeaves(op1, op2))


Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated? Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?

I'd imagine, for example, the check a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this check: https://github.com/dotnet/runtime/pull/53964/files#diff-9626112837daf480c93d401a587012b3e398dd90a953c870797d472cff36839dR2510

Maybe the right fix is to call this overload of BuildDelayFreeUses:

runtime/src/coreclr/jit/lsrabuild.cpp

Line 3059 in c2c92c4

int LinearScan::BuildDelayFreeUses(GenTree* node, GenTree* rmwNode, regMaskTP candidates)

It looks to already do some checks on rmwNode vs the node being set as delayFree and might be a natural place to add this NodesAreEquivalentLeaves check, if its generally applicable?

Edit: It looks like its not an overload, just one method where rmwNode defaults to nullptr and where many places (at least for HWIntrinsics) aren't passing in the rmwNode, likely because it (the rmwNode parameter) was added ~6 months back: #45135

Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated?

I am not super familiar with the RMW semantics. I can sync-up offline to understand which other scenarios can be benefitted.

and where many places (at least for HWIntrinsics) aren't passing in the rmwNode,

That's right, at most places, we pass rmwNode as nullptr and I am not 100% sure the impact of refactoring it such that we can fit in NodesAreEquivalent() method inside it. For this bug, I would keep it where it is currently.

a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this

Can you give an example of what is the semantics of rmw for this and which 2 nodes I should be checking for equivalence?

Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?

Again, could you give an example for this?

I am not super familiar with the RMW semantics.

RMW (Read-Modify-Write) just means one of the sources is also the destination and so is "destructive" (the value isn't preserved).

This means that if the value needs to be preserved (generally because src1 is not "last use") that you need an additional mov to copy the value to the destination:

; Operation is logically `dst = src1 + src2` mov dst, src1 add dst, src2

Because of this, it also means that dst != src2 since, if it was, the move would overwrite src2 and change the logical operation. BuildDelayFreeUses exists to cover this scenario by ensuring that dst != src2:

; If `dst == src2 && src1 != src2`, operation becomes `dst = src1 + src1` and so is problematic mov dst, src1 add dst, src2

All of this works great when src1 is last use and when src1 != src2 because this means the register allocator can make dst == src1 and we can elide the copy generating just:

; Operation is logically `src1 += src2` add src1, src2

However, in certain cases such as when src1 == src2, setting src2 to be "delay free" is problematic because this forces the register allocator to configure it as dst != src2, but since src1 == src2 this also means the register allocator configures dst != src1, thus forcing us to generate a move. In practice however, it is safe for this exact scenario to not be delay free because we would never overwrite either input.

The PR you are providing here "fixes" the issue by avoiding us setting "delay free" when rmwOp == delayFreeOp (this is safe since it means we won't ever overwrite delayFreeOp). This "should be" applicable to all RMW setups whether its specifically this SIMD example or even other examples like integer additions and so I'd think that we want to make the support for this "broader" so we can always generate the more efficient code.

As far as I can tell, the fix you have here should work for any BuildDelayFreeUses call and so I think we could just extend most calls to BuildDelayFreeUses to pass in rmwOp when it exists and for it to do the if (GenTree::NodesAreEquivalentLeaves(delayFreeOp, rmwOp)) { /* dont set delay free */ } else { /* set delay free */ }

For the 2-operand RMW instructions (like add, sub, mul, div, etc) this is what you are already doing for this specific SIMD case.

It should likewise extend to 3-operand (or more) RMW instructions (like fma where you have dst = (src1 * src2) + src3). In the three operand case, eliding the move requires: (dst == src1) && (dst != src2) && (dst != src3). However the latter two restrictions (which are achieved by setting "delay free") can be relaxed when src1 == src2 or when src1 == src3, respectively (since when rmwOp == delayFreeOp, we can't overwrite the delayFreeOp).

Noting that I'm not super familiar with the register allocator logic myself, just stating logically how I'd expect this to be handled, and so it isn't clear to me if this is already "being handled" by the following check:

runtime/src/coreclr/jit/lsrabuild.cpp

Line 3100 in c2c92c4

if ((use->getInterval() != rmwInterval) || (!rmwIsLastUse && !use->lastUse))

If it is, it might be simple enough to just ensure we pass in rmwOp everywhere and the right things will happen.

Thanks for the wonderful explanation @tannergooding . I think I heard from Carol about delayRegFree, but your concrete examples helped me understand it more clearly. Regarding, RMW (last I heard was when working on Arm64), I thought that only certain instructions fall under that category (and that's why was asking which one are those), but looks like it is applicable to even GP register cases like add. I will investigate and try to come up with "broader" solution.

Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation. I need to look deeper, but what determines what should be node parameter to BuildDelayFreeUses()? Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?

but looks like it is applicable to even GP register cases like add

Right. On ARM64 most instructions are not RMW, that is they can separately encode dst, src1, and src2. Only a handful of instructions end up being RMW and many of them are more advanced SIMD instructions so its a rarity to deal with.
x86/x64 on the other hand has most instructions being RMW and outside some newer instructions or using the VEX encoding, its something that codegen frequently has to deal with.

Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation

Right. rmwOp is the operand that is a source but also the destination.

Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?

We have to build uses for all operands and so typically we:

Set the rmwOp (which is typically, but not always, op1) to be tgtPrefUse:

runtime/src/coreclr/jit/lsraxarch.cpp

Line 2455 in c2c92c4

tgtPrefUse = BuildUse(op1);

For each other operand, we call BuildDelayFreeUses:

runtime/src/coreclr/jit/lsraxarch.cpp

Line 2491 in c2c92c4

srcCount += BuildDelayFreeUses(op2);

and

runtime/src/coreclr/jit/lsraxarch.cpp

Line 2510 in c2c92c4

srcCount += isRMW ? BuildDelayFreeUses(op3) : BuildOperandUses(op3);

There are many examples of this throughput lsraxarch.cpp

One of the 4 operand examples is here:

runtime/src/coreclr/jit/lsraxarch.cpp

Lines 2422 to 2425 in c2c92c4

srcCount += BuildOperandUses(op1);

srcCount += BuildDelayFreeUses(op2);

srcCount += BuildDelayFreeUses(op3);

srcCount += BuildDelayFreeUses(op4);

and it would be nice if we could make this something like:

srcCount += BuildOperandUses(op1); srcCount += BuildDelayFreeUses(op2, op1); srcCount += BuildDelayFreeUses(op3, op1); srcCount += BuildDelayFreeUses(op4, op1);

Rather than needing to do something like the following:

srcCount += BuildOperandUses(op1); srcCount += GenTree::NodesAreEquivalentLeaves(op2, op1) ? BuildOperandUses(op2) : BuildDelayFreeUses(op2, op1); srcCount += GenTree::NodesAreEquivalentLeaves(op3, op1) ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1); srcCount += GenTree::NodesAreEquivalentLeaves(op4, op1) ? BuildOperandUses(op4) : BuildDelayFreeUses(op4, op1);

(assuming there isn't some complexity in the register allocator preventing this)

@tannergooding - once again thanks for the insights. The change was easier than I thought because BuildDelayFreeUses() captures those scenarios.

kunalspathak · 2021-06-11T23:46:04Z

Here are some improvements with COMPlus_EnableAVX=0:

Benchmarks.run.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3694
Total bytes of diff: 3625
Total bytes of delta: -69 (-1.87% of base)
Total relative delta: -0.72
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -12 : 21279.dasm (-3.45% of base)
          -6 : 12839.dasm (-9.23% of base)
          -6 : 27223.dasm (-7.50% of base)
          -6 : 14151.dasm (-9.68% of base)
          -6 : 13679.dasm (-9.84% of base)
          -6 : 26511.dasm (-7.69% of base)
          -6 : 13670.dasm (-9.09% of base)
          -6 : 59.dasm (-0.26% of base)
          -6 : 26107.dasm (-7.32% of base)
          -6 : 26871.dasm (-7.14% of base)
          -3 : 60.dasm (-0.62% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-7.50% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-0.26% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this
          -6 (-7.14% of base) : 26871.dasm - System.Numerics.Tests.Perf_Vector2:DistanceBenchmark():float:this
          -3 (-0.62% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-7.50% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this
          -6 (-7.14% of base) : 26871.dasm - System.Numerics.Tests.Perf_Vector2:DistanceBenchmark():float:this
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -3 (-0.62% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -6 (-0.26% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

11 total methods with Code Size differences (11 improved, 0 regressed), 0 unchanged.

Libraries.crossgen2.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 4579
Total bytes of diff: 4440
Total bytes of delta: -139 (-3.04% of base)
Total relative delta: -3.07
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -49 : 45410.dasm (-3.74% of base)
         -12 : 45459.dasm (-3.45% of base)
          -6 : 45338.dasm (-20.00% of base)
          -6 : 45202.dasm (-30.00% of base)
          -6 : 45228.dasm (-27.27% of base)
          -6 : 45311.dasm (-28.57% of base)
          -6 : 45229.dasm (-23.08% of base)
          -6 : 40489.dasm (-0.25% of base)
          -6 : 45310.dasm (-35.29% of base)
          -6 : 45201.dasm (-37.50% of base)
          -6 : 45337.dasm (-23.08% of base)
          -3 : 39996.dasm (-15.79% of base)
          -3 : 39998.dasm (-11.54% of base)
          -3 : 45271.dasm (-4.05% of base)
          -3 : 40003.dasm (-11.11% of base)
          -3 : 40849.dasm (-3.66% of base)
          -3 : 41208.dasm (-3.66% of base)
          -3 : 40001.dasm (-15.79% of base)
          -3 : 45217.dasm (-9.09% of base)

19 total files with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4

Top method improvements (percentages):
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

19 total methods with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Libraries.pmi.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1042
Total bytes of diff: 1021
Total bytes of delta: -21 (-2.02% of base)
Total relative delta: -0.41
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
          -9 : 16072.dasm (-20.93% of base)
          -6 : 16073.dasm (-18.75% of base)
          -3 : 5.dasm (-0.62% of base)
          -3 : 17524.dasm (-0.62% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

Top method improvements (percentages):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Here are some improvements with COMPlus_EnableAVX=1:

Libraries.crossgen2.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 4579
Total bytes of diff: 4440
Total bytes of delta: -139 (-3.04% of base)
Total relative delta: -3.07
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -49 : 45410.dasm (-3.74% of base)
         -12 : 45459.dasm (-3.45% of base)
          -6 : 40489.dasm (-0.25% of base)
          -6 : 45311.dasm (-28.57% of base)
          -6 : 45201.dasm (-37.50% of base)
          -6 : 45310.dasm (-35.29% of base)
          -6 : 45337.dasm (-23.08% of base)
          -6 : 45202.dasm (-30.00% of base)
          -6 : 45338.dasm (-20.00% of base)
          -6 : 45228.dasm (-27.27% of base)
          -6 : 45229.dasm (-23.08% of base)
          -3 : 39998.dasm (-11.54% of base)
          -3 : 39996.dasm (-15.79% of base)
          -3 : 40001.dasm (-15.79% of base)
          -3 : 41208.dasm (-3.66% of base)
          -3 : 40003.dasm (-11.11% of base)
          -3 : 40849.dasm (-3.66% of base)
          -3 : 45217.dasm (-9.09% of base)
          -3 : 45271.dasm (-4.05% of base)

19 total files with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3

Top method improvements (percentages):
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

19 total methods with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Libraries.crossgen2.windows.x86.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7474
Total bytes of diff: 7294
Total bytes of delta: -180 (-2.41% of base)
Total relative delta: -1.68
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
         -60 : 132888.dasm (-4.29% of base)
         -12 : 132931.dasm (-1.07% of base)
         -12 : 132937.dasm (-3.55% of base)
          -9 : 128688.dasm (-1.59% of base)
          -9 : 137845.dasm (-1.27% of base)
          -9 : 128340.dasm (-1.59% of base)
          -6 : 132927.dasm (-0.92% of base)
          -6 : 132905.dasm (-1.08% of base)
          -3 : 132680.dasm (-11.11% of base)
          -3 : 127493.dasm (-15.79% of base)
          -3 : 132804.dasm (-4.35% of base)
          -3 : 132936.dasm (-2.94% of base)
          -3 : 132816.dasm (-7.32% of base)
          -3 : 127492.dasm (-13.04% of base)
          -3 : 132679.dasm (-13.04% of base)
          -3 : 132789.dasm (-10.71% of base)
          -3 : 132706.dasm (-8.57% of base)
          -3 : 132932.dasm (-0.55% of base)
          -3 : 132707.dasm (-7.69% of base)
          -3 : 132883.dasm (-1.29% of base)

27 total files with Code Size differences (27 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -60 (-4.29% of base) : 132888.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-1.07% of base) : 132931.dasm - System.Numerics.Matrix4x4:CreateConstrainedBillboard(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
         -12 (-3.55% of base) : 132937.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -9 (-1.59% of base) : 128688.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii_Sse2(int,int,int):int
          -9 (-1.27% of base) : 137845.dasm - System.String:MakeSeparatorListVectorized(byref,ushort,ushort,ushort):this
          -9 (-1.59% of base) : 128340.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1_Sse2(int,int,int):int
          -6 (-0.92% of base) : 132927.dasm - System.Numerics.Matrix4x4:CreateLookAt(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -6 (-1.08% of base) : 132905.dasm - System.Numerics.Matrix4x4:CreateWorld(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -3 (-11.11% of base) : 132680.dasm - System.Numerics.Vector4:Length():float:this
          -3 (-15.79% of base) : 127493.dasm - System.Runtime.Intrinsics.Vector128:Create(float):System.Runtime.Intrinsics.Vector128`1[System.Single]
          -3 (-4.35% of base) : 132804.dasm - System.Numerics.Vector2:Normalize(System.Numerics.Vector2):System.Numerics.Vector2
          -3 (-2.94% of base) : 132936.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -3 (-7.32% of base) : 132816.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-13.04% of base) : 127492.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-13.04% of base) : 132679.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -3 (-10.71% of base) : 132789.dasm - System.Numerics.Vector2:Length():float:this
          -3 (-8.57% of base) : 132706.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-0.55% of base) : 132932.dasm - System.Numerics.Matrix4x4:CreateBillboard(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -3 (-7.69% of base) : 132707.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-1.29% of base) : 132883.dasm - System.Numerics.Plane:CreateFromVertices(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Plane

Top method improvements (percentages):
          -3 (-15.79% of base) : 127493.dasm - System.Runtime.Intrinsics.Vector128:Create(float):System.Runtime.Intrinsics.Vector128`1[System.Single]
          -3 (-13.04% of base) : 127492.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-13.04% of base) : 132679.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -3 (-13.04% of base) : 127497.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-12.50% of base) : 132788.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -3 (-11.11% of base) : 132680.dasm - System.Numerics.Vector4:Length():float:this
          -3 (-10.71% of base) : 132789.dasm - System.Numerics.Vector2:Length():float:this
          -3 (-8.57% of base) : 132706.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-8.57% of base) : 132760.dasm - System.Numerics.Vector3:DistanceSquared(System.Numerics.Vector3,System.Numerics.Vector3):float
          -3 (-8.11% of base) : 132815.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-7.69% of base) : 132707.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-7.69% of base) : 132761.dasm - System.Numerics.Vector3:Distance(System.Numerics.Vector3,System.Numerics.Vector3):float
          -3 (-7.32% of base) : 132816.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-4.35% of base) : 132804.dasm - System.Numerics.Vector2:Normalize(System.Numerics.Vector2):System.Numerics.Vector2
         -60 (-4.29% of base) : 132888.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.95% of base) : 132749.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -12 (-3.55% of base) : 132937.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -3 (-2.94% of base) : 132936.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -3 (-1.97% of base) : 132897.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -9 (-1.59% of base) : 128688.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii_Sse2(int,int,int):int

27 total methods with Code Size differences (27 improved, 0 regressed), 0 unchanged.

src/coreclr/jit/lsraxarch.cpp

tannergooding

LGTM. It would probably be good for @echesakovMSFT to give a second pair of eyes over the ARM64 code

tannergooding · 2021-06-12T02:32:20Z

It would probably be also good to run the ISA stress tests runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm

kunalspathak · 2021-06-13T04:13:30Z

Failures are unrelated.

src/coreclr/jit/lsraarm64.cpp

echesakov · 2021-06-14T16:42:32Z

src/coreclr/jit/lsraarm64.cpp

-                        varNum3 = intrin.op3->AsLclVar()->GetLclNum();
-                        op1LastUse |= ((varNum1 == varNum3) && intrin.op3->HasLastUse());
+                        varNum2 = intrin.op2->AsLclVar()->GetLclNum();
+                        assert((varNum1 == varNum2) && intrin.op2->HasLastUse());


The whole block under ifdef DEBUG doesn't make sense to me - why are we asserting that for Vector64.GetElement(vector, index) and Vector128.GetElement(vector, index) their operands correspond the same local? This will never be true - since op1 is a SIMD value and op2 is a int.

In fact, I don't think it will ever execute - neither NI_Vector64_GetElement nor NI_Vector128_GetElement is a RMW intrinsic.

That's right. I misread the code thinking that only way assert(!op2DelayFree); in previous code would get triggered if (isRMW == true) && (varNum1 == varNum2) but didn't realize that it can be because isRMW == false. I will remove the #ifdef DEBUG block.

echesakov

Looks good.

Can you please publish jit-diff results for arm64?

kunalspathak · 2021-06-14T19:49:37Z

Can you please publish jit-diff results for arm64?

No diffs, because arm64 was already doing the right thing. I just refactored the code to handle it inside BuildDelayFreeUses().

echesakov · 2021-06-14T19:53:50Z

Can you please publish jit-diff results for arm64?

No diffs, because arm64 was already doing the right thing. I just refactored the code to handle it inside BuildDelayFreeUses().

Thanks for confirming!

Do not mark op2 as delayRegFree if op1==op2

c2c92c4

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 10, 2021

sandreenko self-requested a review June 10, 2021 04:02

sandreenko approved these changes Jun 10, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 10, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 10, 2021

View reviewed changes

kunalspathak added 4 commits June 11, 2021 16:37

Revert NodesAreEquivalentLeaves change

dee1a09

Pass rmwNode to BuildDelayFreeUses() which does the right thing

a451208

Make similar change in arm64

bcd1313

remove TODO comment

1e76781

tannergooding reviewed Jun 12, 2021

View reviewed changes

src/coreclr/jit/lsraxarch.cpp Show resolved Hide resolved

tannergooding approved these changes Jun 12, 2021

View reviewed changes

echesakov suggested changes Jun 14, 2021

View reviewed changes

review feedback

cae4038

echesakov approved these changes Jun 14, 2021

View reviewed changes

kunalspathak merged commit cc9fdad into dotnet:main Jun 14, 2021

ghost locked as resolved and limited conversation to collaborators Jul 14, 2021

	srcCount += BuildOperandUses(op1);
	srcCount += BuildDelayFreeUses(op2);
	srcCount += BuildDelayFreeUses(op3);
	srcCount += BuildDelayFreeUses(op4);

Do not mark op2 as delayRegFree if op1==op2 #53964

Do not mark op2 as delayRegFree if op1==op2 #53964

Uh oh!

Conversation

kunalspathak commented Jun 10, 2021

Uh oh!

kunalspathak commented Jun 10, 2021

Benchmarks.run.windows.x64.checked

Coreclr_tests.pmi.windows.x64.checked.1

Libraries.pmi.windows.x64.checked

Uh oh!

kunalspathak commented Jun 10, 2021

Uh oh!

sandreenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tannergooding Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

kunalspathak Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

tannergooding Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunalspathak Jun 11, 2021

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Jun 11, 2021

Benchmarks.run.windows.x64.checked

Libraries.crossgen2.windows.x64.checked

Libraries.pmi.windows.x64.checked

Libraries.crossgen2.windows.x64.checked

Libraries.crossgen2.windows.x86.checked

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Jun 12, 2021

Uh oh!

kunalspathak commented Jun 13, 2021

Uh oh!

Uh oh!

echesakov Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

kunalspathak Jun 14, 2021

Choose a reason for hiding this comment

Uh oh!

echesakov left a comment

Choose a reason for hiding this comment

Uh oh!

kunalspathak commented Jun 14, 2021

Uh oh!

tannergooding Jun 10, 2021 •

edited

Loading

tannergooding Jun 10, 2021 •

edited

Loading

kunalspathak Jun 10, 2021 •

edited

Loading

tannergooding Jun 11, 2021 •

edited

Loading

tannergooding Jun 11, 2021 •

edited

Loading