Skip to content

Conversation

@kunalspathak
Copy link
Contributor

If op1 == op2, do not mark op2 as delayFree so we can reuse the register.

Fixes: #9896

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 10, 2021
@kunalspathak
Copy link
Contributor Author

With COMPlus_EnableAVX=0 , here are the improvements:

Benchmarks.run.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3694
Total bytes of diff: 3625
Total bytes of delta: -69 (-1.87% of base)
Total relative delta: -0.72
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -12 : 21279.dasm (-3.45% of base)
          -6 : 13679.dasm (-9.84% of base)
          -6 : 13670.dasm (-9.09% of base)
          -6 : 26511.dasm (-7.69% of base)
          -6 : 26107.dasm (-7.32% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this

Top method improvements (percentages):
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this

11 total methods with Code Size differences (11 improved, 0 regressed), 0 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 63325.46
Total PerfScoreUnits of diff: 63294.560000000005
Total PerfScoreUnits of delta: -30.90 (-0.05% of base)
Total relative delta: -0.26
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
      -18.60 : 59.dasm (-0.03% of base)
       -2.20 : 21279.dasm (-0.84% of base)
       -1.30 : 60.dasm (-0.83% of base)
       -1.10 : 13679.dasm (-3.91% of base)
       -1.10 : 13670.dasm (-2.93% of base)

11 total files with Perf Score differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
      -18.60 (-0.03% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
       -2.20 (-0.84% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
       -1.30 (-0.83% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.10 (-3.91% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
       -1.10 (-2.93% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this

Top method improvements (percentages):
       -1.10 (-4.20% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
       -1.10 (-3.91% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
       -1.10 (-3.14% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
       -1.10 (-2.93% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
       -1.10 (-2.84% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this

11 total methods with Perf Score differences (11 improved, 0 regressed), 0 unchanged.


Coreclr_tests.pmi.windows.x64.checked.1


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3742
Total bytes of diff: 3613
Total bytes of delta: -129 (-3.45% of base)
Total relative delta: -1.58
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -30 : 216439.dasm (-3.94% of base)
         -12 : 241647.dasm (-3.10% of base)
          -9 : 216440.dasm (-3.41% of base)
          -9 : 216442.dasm (-3.35% of base)
          -9 : 216441.dasm (-1.65% of base)

15 total files with Code Size differences (15 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -30 (-3.94% of base) : 216439.dasm - IntelHardwareIntrinsicTest:init()
         -12 (-3.10% of base) : 241647.dasm - VectorInitTest:VectorInit(float):int
          -9 (-3.41% of base) : 216440.dasm - IntelHardwareIntrinsicTest:F1_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]
          -9 (-3.35% of base) : 216442.dasm - IntelHardwareIntrinsicTest:F1_v128i(int):System.Runtime.Intrinsics.Vector128`1[Int16]
          -9 (-1.65% of base) : 216441.dasm - IntelHardwareIntrinsicTest:F2_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]

Top method improvements (percentages):
          -6 (-37.50% of base) : 240233.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeAs():float:this
          -6 (-37.50% of base) : 240235.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeReadUnaligned():float:this
          -6 (-24.00% of base) : 240234.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeRead():float:this
          -6 (-16.22% of base) : 240232.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredManualVectors():float:this
          -6 (-7.06% of base) : 240231.dasm - UnsafeTesting.Program:LengthSquaredManualVectors():float

15 total methods with Code Size differences (15 improved, 0 regressed), 0 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 1528.75
Total PerfScoreUnits of diff: 1503.6100000000004
Total PerfScoreUnits of delta: -25.14 (-1.64% of base)
Total relative delta: -0.40
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
       -7.00 : 216439.dasm (-1.62% of base)
       -2.20 : 241647.dasm (-1.47% of base)
       -1.65 : 216440.dasm (-1.23% of base)
       -1.65 : 216442.dasm (-1.51% of base)
       -1.30 : 5.dasm (-0.83% of base)

15 total files with Perf Score differences (15 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
       -7.00 (-1.62% of base) : 216439.dasm - IntelHardwareIntrinsicTest:init()
       -2.20 (-1.47% of base) : 241647.dasm - VectorInitTest:VectorInit(float):int
       -1.65 (-1.23% of base) : 216440.dasm - IntelHardwareIntrinsicTest:F1_v128(float):System.Runtime.Intrinsics.Vector128`1[Single]
       -1.65 (-1.51% of base) : 216442.dasm - IntelHardwareIntrinsicTest:F1_v128i(int):System.Runtime.Intrinsics.Vector128`1[Int16]
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
       -1.10 (-5.76% of base) : 240233.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeAs():float:this
       -1.10 (-5.76% of base) : 240235.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeReadUnaligned():float:this
       -1.10 (-4.94% of base) : 240234.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredUnsafeRead():float:this
       -1.10 (-3.90% of base) : 240232.dasm - UnsafeTesting.QuaternionStruct:LengthSquaredManualVectors():float:this
       -1.10 (-2.97% of base) : 240231.dasm - UnsafeTesting.Program:LengthSquaredManualVectors():float

15 total methods with Perf Score differences (15 improved, 0 regressed), 0 unchanged.


Libraries.pmi.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1042
Total bytes of diff: 1021
Total bytes of delta: -21 (-2.02% of base)
Total relative delta: -0.41
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
          -9 : 16072.dasm (-20.93% of base)
          -6 : 16073.dasm (-18.75% of base)
          -3 : 17524.dasm (-0.62% of base)
          -3 : 5.dasm (-0.62% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Code Size differences (4 improved, 0 regressed), 0 unchanged.



Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 368.32
Total PerfScoreUnits of diff: 362.97
Total PerfScoreUnits of delta: -5.35 (-1.45% of base)
Total relative delta: -0.12
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (PerfScoreUnits):
       -1.65 : 16072.dasm (-6.27% of base)
       -1.30 : 17524.dasm (-0.83% of base)
       -1.30 : 5.dasm (-0.83% of base)
       -1.10 : 16073.dasm (-3.97% of base)

4 total files with Perf Score differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (PerfScoreUnits):
       -1.65 (-6.27% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
       -1.30 (-0.83% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.10 (-3.97% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int

Top method improvements (percentages):
       -1.65 (-6.27% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
       -1.10 (-3.97% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
       -1.30 (-0.83% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
       -1.30 (-0.83% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Perf Score differences (4 improved, 0 regressed), 0 unchanged.


@kunalspathak
Copy link
Contributor Author

@dotnet/jit-contrib

@sandreenko sandreenko self-requested a review June 10, 2021 04:02
Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


srcCount += BuildDelayFreeUses(op2);
// Unless, op1 and op2 are same, in which case we can overwrite op2.
if (GenTree::NodesAreEquivalentLeaves(op1, op2))
Copy link
Member

@tannergooding tannergooding Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated? Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?

Copy link
Member

@tannergooding tannergooding Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine, for example, the check a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this check: https://github.com/dotnet/runtime/pull/53964/files#diff-9626112837daf480c93d401a587012b3e398dd90a953c870797d472cff36839dR2510

Maybe the right fix is to call this overload of BuildDelayFreeUses:

int LinearScan::BuildDelayFreeUses(GenTree* node, GenTree* rmwNode, regMaskTP candidates)

It looks to already do some checks on rmwNode vs the node being set as delayFree and might be a natural place to add this NodesAreEquivalentLeaves check, if its generally applicable?

Edit: It looks like its not an overload, just one method where rmwNode defaults to nullptr and where many places (at least for HWIntrinsics) aren't passing in the rmwNode, likely because it (the rmwNode parameter) was added ~6 months back: #45135

Copy link
Contributor Author

@kunalspathak kunalspathak Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other RMW locations setting BuildDelayFreeUses that could/should also be updated?

I am not super familiar with the RMW semantics. I can sync-up offline to understand which other scenarios can be benefitted.

and where many places (at least for HWIntrinsics) aren't passing in the rmwNode,

That's right, at most places, we pass rmwNode as nullptr and I am not 100% sure the impact of refactoring it such that we can fit in NodesAreEquivalent() method inside it. For this bug, I would keep it where it is currently.

a few lines down dealing with making BuildDelayFreeUses(op3) would also benefit from this

Can you give an example of what is the semantics of rmw for this and which 2 nodes I should be checking for equivalence?

Is this already handled on the scalar operation (that is int + int or float - float) code path, for example?

Again, could you give an example for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super familiar with the RMW semantics.

RMW (Read-Modify-Write) just means one of the sources is also the destination and so is "destructive" (the value isn't preserved).

This means that if the value needs to be preserved (generally because src1 is not "last use") that you need an additional mov to copy the value to the destination:

; Operation is logically `dst = src1 + src2`
mov dst, src1
add dst, src2

Because of this, it also means that dst != src2 since, if it was, the move would overwrite src2 and change the logical operation. BuildDelayFreeUses exists to cover this scenario by ensuring that dst != src2:

; If `dst == src2 && src1 != src2`, operation becomes `dst = src1 + src1` and so is problematic
mov dst, src1
add dst, src2

All of this works great when src1 is last use and when src1 != src2 because this means the register allocator can make dst == src1 and we can elide the copy generating just:

; Operation is logically `src1 += src2`
add src1, src2

However, in certain cases such as when src1 == src2, setting src2 to be "delay free" is problematic because this forces the register allocator to configure it as dst != src2, but since src1 == src2 this also means the register allocator configures dst != src1, thus forcing us to generate a move. In practice however, it is safe for this exact scenario to not be delay free because we would never overwrite either input.

The PR you are providing here "fixes" the issue by avoiding us setting "delay free" when rmwOp == delayFreeOp (this is safe since it means we won't ever overwrite delayFreeOp). This "should be" applicable to all RMW setups whether its specifically this SIMD example or even other examples like integer additions and so I'd think that we want to make the support for this "broader" so we can always generate the more efficient code.

As far as I can tell, the fix you have here should work for any BuildDelayFreeUses call and so I think we could just extend most calls to BuildDelayFreeUses to pass in rmwOp when it exists and for it to do the if (GenTree::NodesAreEquivalentLeaves(delayFreeOp, rmwOp)) { /* dont set delay free */ } else { /* set delay free */ }

  • For the 2-operand RMW instructions (like add, sub, mul, div, etc) this is what you are already doing for this specific SIMD case.
  • It should likewise extend to 3-operand (or more) RMW instructions (like fma where you have dst = (src1 * src2) + src3). In the three operand case, eliding the move requires: (dst == src1) && (dst != src2) && (dst != src3). However the latter two restrictions (which are achieved by setting "delay free") can be relaxed when src1 == src2 or when src1 == src3, respectively (since when rmwOp == delayFreeOp, we can't overwrite the delayFreeOp).

Copy link
Member

@tannergooding tannergooding Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I'm not super familiar with the register allocator logic myself, just stating logically how I'd expect this to be handled, and so it isn't clear to me if this is already "being handled" by the following check:

if ((use->getInterval() != rmwInterval) || (!rmwIsLastUse && !use->lastUse))

If it is, it might be simple enough to just ensure we pass in rmwOp everywhere and the right things will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the wonderful explanation @tannergooding . I think I heard from Carol about delayRegFree, but your concrete examples helped me understand it more clearly. Regarding, RMW (last I heard was when working on Arm64), I thought that only certain instructions fall under that category (and that's why was asking which one are those), but looks like it is applicable to even GP register cases like add. I will investigate and try to come up with "broader" solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation. I need to look deeper, but what determines what should be node parameter to BuildDelayFreeUses()? Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but looks like it is applicable to even GP register cases like add

Right. On ARM64 most instructions are not RMW, that is they can separately encode dst, src1, and src2. Only a handful of instructions end up being RMW and many of them are more advanced SIMD instructions so its a rarity to deal with.
x86/x64 on the other hand has most instructions being RMW and outside some newer instructions or using the VEX encoding, its something that codegen frequently has to deal with.

Also, from what I understand, rmwOp is essentially the one that is acting as source as well as destination operand, while other operand (e.g. node parameter in BuildDelayFreeUses()) refers to other operand of that operation

Right. rmwOp is the operand that is a source but also the destination.

Assuming op1 is always considered as potential source and destination for 2/3/4 operand cases (and hence it is rmwNode), but for 3+ operand cases, should node be op2 or op3 or it depends on the GenTree?

We have to build uses for all operands and so typically we:

There are many examples of this throughput lsraxarch.cpp

Copy link
Member

@tannergooding tannergooding Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the 4 operand examples is here:

srcCount += BuildOperandUses(op1);
srcCount += BuildDelayFreeUses(op2);
srcCount += BuildDelayFreeUses(op3);
srcCount += BuildDelayFreeUses(op4);

and it would be nice if we could make this something like:

                srcCount += BuildOperandUses(op1);
                srcCount += BuildDelayFreeUses(op2, op1);
                srcCount += BuildDelayFreeUses(op3, op1);
                srcCount += BuildDelayFreeUses(op4, op1);

Rather than needing to do something like the following:

                srcCount += BuildOperandUses(op1);
                srcCount += GenTree::NodesAreEquivalentLeaves(op2, op1) ? BuildOperandUses(op2) : BuildDelayFreeUses(op2, op1);
                srcCount += GenTree::NodesAreEquivalentLeaves(op3, op1) ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1);
                srcCount += GenTree::NodesAreEquivalentLeaves(op4, op1) ? BuildOperandUses(op4) : BuildDelayFreeUses(op4, op1);

(assuming there isn't some complexity in the register allocator preventing this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding - once again thanks for the insights. The change was easier than I thought because BuildDelayFreeUses() captures those scenarios.

@kunalspathak
Copy link
Contributor Author

Here are some improvements with COMPlus_EnableAVX=0:

Benchmarks.run.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 3694
Total bytes of diff: 3625
Total bytes of delta: -69 (-1.87% of base)
Total relative delta: -0.72
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -12 : 21279.dasm (-3.45% of base)
          -6 : 12839.dasm (-9.23% of base)
          -6 : 27223.dasm (-7.50% of base)
          -6 : 14151.dasm (-9.68% of base)
          -6 : 13679.dasm (-9.84% of base)
          -6 : 26511.dasm (-7.69% of base)
          -6 : 13670.dasm (-9.09% of base)
          -6 : 59.dasm (-0.26% of base)
          -6 : 26107.dasm (-7.32% of base)
          -6 : 26871.dasm (-7.14% of base)
          -3 : 60.dasm (-0.62% of base)

11 total files with Code Size differences (11 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-7.50% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-0.26% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this
          -6 (-7.14% of base) : 26871.dasm - System.Numerics.Tests.Perf_Vector2:DistanceBenchmark():float:this
          -3 (-0.62% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long

Top method improvements (percentages):
          -6 (-9.84% of base) : 13679.dasm - System.Numerics.Tests.Perf_Vector4:LengthSquaredBenchmark():float:this
          -6 (-9.68% of base) : 14151.dasm - System.Numerics.Tests.Perf_Vector2:LengthSquaredBenchmark():float:this
          -6 (-9.23% of base) : 12839.dasm - System.Numerics.Tests.Perf_Vector4:LengthBenchmark():float:this
          -6 (-9.09% of base) : 13670.dasm - System.Numerics.Tests.Perf_Vector2:LengthBenchmark():float:this
          -6 (-7.69% of base) : 26511.dasm - System.Numerics.Tests.Perf_Vector4:DistanceSquaredBenchmark():float:this
          -6 (-7.50% of base) : 27223.dasm - System.Numerics.Tests.Perf_Vector2:DistanceSquaredBenchmark():float:this
          -6 (-7.32% of base) : 26107.dasm - System.Numerics.Tests.Perf_Vector4:DistanceBenchmark():float:this
          -6 (-7.14% of base) : 26871.dasm - System.Numerics.Tests.Perf_Vector2:DistanceBenchmark():float:this
         -12 (-3.45% of base) : 21279.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -3 (-0.62% of base) : 60.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -6 (-0.26% of base) : 59.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

11 total methods with Code Size differences (11 improved, 0 regressed), 0 unchanged.


Libraries.crossgen2.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 4579
Total bytes of diff: 4440
Total bytes of delta: -139 (-3.04% of base)
Total relative delta: -3.07
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -49 : 45410.dasm (-3.74% of base)
         -12 : 45459.dasm (-3.45% of base)
          -6 : 45338.dasm (-20.00% of base)
          -6 : 45202.dasm (-30.00% of base)
          -6 : 45228.dasm (-27.27% of base)
          -6 : 45311.dasm (-28.57% of base)
          -6 : 45229.dasm (-23.08% of base)
          -6 : 40489.dasm (-0.25% of base)
          -6 : 45310.dasm (-35.29% of base)
          -6 : 45201.dasm (-37.50% of base)
          -6 : 45337.dasm (-23.08% of base)
          -3 : 39996.dasm (-15.79% of base)
          -3 : 39998.dasm (-11.54% of base)
          -3 : 45271.dasm (-4.05% of base)
          -3 : 40003.dasm (-11.11% of base)
          -3 : 40849.dasm (-3.66% of base)
          -3 : 41208.dasm (-3.66% of base)
          -3 : 40001.dasm (-15.79% of base)
          -3 : 45217.dasm (-9.09% of base)

19 total files with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4

Top method improvements (percentages):
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

19 total methods with Code Size differences (19 improved, 0 regressed), 0 unchanged.


Libraries.pmi.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1042
Total bytes of diff: 1021
Total bytes of delta: -21 (-2.02% of base)
Total relative delta: -0.41
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
          -9 : 16072.dasm (-20.93% of base)
          -6 : 16073.dasm (-18.75% of base)
          -3 : 5.dasm (-0.62% of base)
          -3 : 17524.dasm (-0.62% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

Top method improvements (percentages):
          -9 (-20.93% of base) : 16072.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int16],System.Numerics.Vector`1[Int16]):short
          -6 (-18.75% of base) : 16073.dasm - System.Numerics.Vector:Dot(System.Numerics.Vector`1[Int32],System.Numerics.Vector`1[Int32]):int
          -3 (-0.62% of base) : 5.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
          -3 (-0.62% of base) : 17524.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1(long,long,long):long

4 total methods with Code Size differences (4 improved, 0 regressed), 0 unchanged.


Here are some improvements with COMPlus_EnableAVX=1:

Libraries.crossgen2.windows.x64.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 4579
Total bytes of diff: 4440
Total bytes of delta: -139 (-3.04% of base)
Total relative delta: -3.07
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -49 : 45410.dasm (-3.74% of base)
         -12 : 45459.dasm (-3.45% of base)
          -6 : 40489.dasm (-0.25% of base)
          -6 : 45311.dasm (-28.57% of base)
          -6 : 45201.dasm (-37.50% of base)
          -6 : 45310.dasm (-35.29% of base)
          -6 : 45337.dasm (-23.08% of base)
          -6 : 45202.dasm (-30.00% of base)
          -6 : 45338.dasm (-20.00% of base)
          -6 : 45228.dasm (-27.27% of base)
          -6 : 45229.dasm (-23.08% of base)
          -3 : 39998.dasm (-11.54% of base)
          -3 : 39996.dasm (-15.79% of base)
          -3 : 40001.dasm (-15.79% of base)
          -3 : 41208.dasm (-3.66% of base)
          -3 : 40003.dasm (-11.11% of base)
          -3 : 40849.dasm (-3.66% of base)
          -3 : 45217.dasm (-9.09% of base)
          -3 : 45271.dasm (-4.05% of base)

19 total files with Code Size differences (19 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3

Top method improvements (percentages):
          -6 (-37.50% of base) : 45201.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -6 (-35.29% of base) : 45310.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -6 (-30.00% of base) : 45202.dasm - System.Numerics.Vector4:Length():float:this
          -6 (-28.57% of base) : 45311.dasm - System.Numerics.Vector2:Length():float:this
          -6 (-27.27% of base) : 45228.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-23.08% of base) : 45337.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -6 (-23.08% of base) : 45229.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -6 (-20.00% of base) : 45338.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-15.79% of base) : 39996.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.UInt64]
          -3 (-15.79% of base) : 40001.dasm - System.Runtime.Intrinsics.Vector128:Create(long):System.Runtime.Intrinsics.Vector128`1[System.Int64]
          -3 (-11.54% of base) : 39998.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-11.11% of base) : 40003.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-9.09% of base) : 45217.dasm - System.Numerics.Vector4:Normalize(System.Numerics.Vector4):System.Numerics.Vector4
          -3 (-4.05% of base) : 45271.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -49 (-3.74% of base) : 45410.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.66% of base) : 41208.dasm - System.Text.ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long)
          -3 (-3.66% of base) : 40849.dasm - System.Text.Latin1Utility:NarrowFourUtf16CharsToLatin1AndWriteToBuffer(byref,long)
         -12 (-3.45% of base) : 45459.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -6 (-0.25% of base) : 40489.dasm - System.Text.Unicode.Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int

19 total methods with Code Size differences (19 improved, 0 regressed), 0 unchanged.


Libraries.crossgen2.windows.x86.checked


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 7474
Total bytes of diff: 7294
Total bytes of delta: -180 (-2.41% of base)
Total relative delta: -1.68
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
         -60 : 132888.dasm (-4.29% of base)
         -12 : 132931.dasm (-1.07% of base)
         -12 : 132937.dasm (-3.55% of base)
          -9 : 128688.dasm (-1.59% of base)
          -9 : 137845.dasm (-1.27% of base)
          -9 : 128340.dasm (-1.59% of base)
          -6 : 132927.dasm (-0.92% of base)
          -6 : 132905.dasm (-1.08% of base)
          -3 : 132680.dasm (-11.11% of base)
          -3 : 127493.dasm (-15.79% of base)
          -3 : 132804.dasm (-4.35% of base)
          -3 : 132936.dasm (-2.94% of base)
          -3 : 132816.dasm (-7.32% of base)
          -3 : 127492.dasm (-13.04% of base)
          -3 : 132679.dasm (-13.04% of base)
          -3 : 132789.dasm (-10.71% of base)
          -3 : 132706.dasm (-8.57% of base)
          -3 : 132932.dasm (-0.55% of base)
          -3 : 132707.dasm (-7.69% of base)
          -3 : 132883.dasm (-1.29% of base)

27 total files with Code Size differences (27 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
         -60 (-4.29% of base) : 132888.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
         -12 (-1.07% of base) : 132931.dasm - System.Numerics.Matrix4x4:CreateConstrainedBillboard(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
         -12 (-3.55% of base) : 132937.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -9 (-1.59% of base) : 128688.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii_Sse2(int,int,int):int
          -9 (-1.27% of base) : 137845.dasm - System.String:MakeSeparatorListVectorized(byref,ushort,ushort,ushort):this
          -9 (-1.59% of base) : 128340.dasm - System.Text.Latin1Utility:NarrowUtf16ToLatin1_Sse2(int,int,int):int
          -6 (-0.92% of base) : 132927.dasm - System.Numerics.Matrix4x4:CreateLookAt(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -6 (-1.08% of base) : 132905.dasm - System.Numerics.Matrix4x4:CreateWorld(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -3 (-11.11% of base) : 132680.dasm - System.Numerics.Vector4:Length():float:this
          -3 (-15.79% of base) : 127493.dasm - System.Runtime.Intrinsics.Vector128:Create(float):System.Runtime.Intrinsics.Vector128`1[System.Single]
          -3 (-4.35% of base) : 132804.dasm - System.Numerics.Vector2:Normalize(System.Numerics.Vector2):System.Numerics.Vector2
          -3 (-2.94% of base) : 132936.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -3 (-7.32% of base) : 132816.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-13.04% of base) : 127492.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-13.04% of base) : 132679.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -3 (-10.71% of base) : 132789.dasm - System.Numerics.Vector2:Length():float:this
          -3 (-8.57% of base) : 132706.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-0.55% of base) : 132932.dasm - System.Numerics.Matrix4x4:CreateBillboard(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Matrix4x4
          -3 (-7.69% of base) : 132707.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-1.29% of base) : 132883.dasm - System.Numerics.Plane:CreateFromVertices(System.Numerics.Vector3,System.Numerics.Vector3,System.Numerics.Vector3):System.Numerics.Plane

Top method improvements (percentages):
          -3 (-15.79% of base) : 127493.dasm - System.Runtime.Intrinsics.Vector128:Create(float):System.Runtime.Intrinsics.Vector128`1[System.Single]
          -3 (-13.04% of base) : 127492.dasm - System.Runtime.Intrinsics.Vector128:Create(ushort):System.Runtime.Intrinsics.Vector128`1[System.UInt16]
          -3 (-13.04% of base) : 132679.dasm - System.Numerics.Vector4:LengthSquared():float:this
          -3 (-13.04% of base) : 127497.dasm - System.Runtime.Intrinsics.Vector128:Create(short):System.Runtime.Intrinsics.Vector128`1[System.Int16]
          -3 (-12.50% of base) : 132788.dasm - System.Numerics.Vector2:LengthSquared():float:this
          -3 (-11.11% of base) : 132680.dasm - System.Numerics.Vector4:Length():float:this
          -3 (-10.71% of base) : 132789.dasm - System.Numerics.Vector2:Length():float:this
          -3 (-8.57% of base) : 132706.dasm - System.Numerics.Vector4:DistanceSquared(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-8.57% of base) : 132760.dasm - System.Numerics.Vector3:DistanceSquared(System.Numerics.Vector3,System.Numerics.Vector3):float
          -3 (-8.11% of base) : 132815.dasm - System.Numerics.Vector2:DistanceSquared(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-7.69% of base) : 132707.dasm - System.Numerics.Vector4:Distance(System.Numerics.Vector4,System.Numerics.Vector4):float
          -3 (-7.69% of base) : 132761.dasm - System.Numerics.Vector3:Distance(System.Numerics.Vector3,System.Numerics.Vector3):float
          -3 (-7.32% of base) : 132816.dasm - System.Numerics.Vector2:Distance(System.Numerics.Vector2,System.Numerics.Vector2):float
          -3 (-4.35% of base) : 132804.dasm - System.Numerics.Vector2:Normalize(System.Numerics.Vector2):System.Numerics.Vector2
         -60 (-4.29% of base) : 132888.dasm - System.Numerics.Matrix4x4:<Invert>g__SseImpl|65_0(System.Numerics.Matrix4x4,byref):bool
          -3 (-3.95% of base) : 132749.dasm - System.Numerics.Vector3:Normalize(System.Numerics.Vector3):System.Numerics.Vector3
         -12 (-3.55% of base) : 132937.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4):System.Numerics.Matrix4x4
          -3 (-2.94% of base) : 132936.dasm - System.Numerics.Matrix4x4:op_Multiply(System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -3 (-1.97% of base) : 132897.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4
          -9 (-1.59% of base) : 128688.dasm - System.Text.ASCIIUtility:NarrowUtf16ToAscii_Sse2(int,int,int):int

27 total methods with Code Size differences (27 improved, 0 regressed), 0 unchanged.


Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It would probably be good for @echesakovMSFT to give a second pair of eyes over the ARM64 code

@tannergooding
Copy link
Member

It would probably be also good to run the ISA stress tests runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm

@kunalspathak
Copy link
Contributor Author

Failures are unrelated.

varNum3 = intrin.op3->AsLclVar()->GetLclNum();
op1LastUse |= ((varNum1 == varNum3) && intrin.op3->HasLastUse());
varNum2 = intrin.op2->AsLclVar()->GetLclNum();
assert((varNum1 == varNum2) && intrin.op2->HasLastUse());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole block under ifdef DEBUG doesn't make sense to me - why are we asserting that for Vector64.GetElement(vector, index) and Vector128.GetElement(vector, index) their operands correspond the same local? This will never be true - since op1 is a SIMD value and op2 is a int.

In fact, I don't think it will ever execute - neither NI_Vector64_GetElement nor NI_Vector128_GetElement is a RMW intrinsic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I misread the code thinking that only way assert(!op2DelayFree); in previous code would get triggered if (isRMW == true) && (varNum1 == varNum2) but didn't realize that it can be because isRMW == false. I will remove the #ifdef DEBUG block.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Can you please publish jit-diff results for arm64?

@kunalspathak
Copy link
Contributor Author

Can you please publish jit-diff results for arm64?

No diffs, because arm64 was already doing the right thing. I just refactored the code to handle it inside BuildDelayFreeUses().

@echesakov
Copy link
Contributor

Can you please publish jit-diff results for arm64?

No diffs, because arm64 was already doing the right thing. I just refactored the code to handle it inside BuildDelayFreeUses().

Thanks for confirming!

@kunalspathak kunalspathak merged commit cc9fdad into dotnet:main Jun 14, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update lsra to not mark op2 as delay free for certain RMW nodes

5 participants