use YMM registers on x64 for BlkUnroll. #33665

sandreenko · 2020-03-17T06:15:53Z

PMI SPC diffs:
Total bytes of diff: -7255 (-0.13% of base),
the actual diff is better, but we start inlining more and that is shown as asm size regression.

Details

Top file improvements (bytes):
       -7255 : System.Private.CoreLib.dasm (-0.13% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method regressions (bytes):
          61 (1,220.00% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
          54 ( 5.49% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[__Canon][System.__Canon]:Return(System.__Canon[],bool):this
          50 ( 5.52% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[ReadOnlyMemory`1][System.ReadOnlyMemory`1[System.Char]]:Rent(int):System.ReadOnlyMemory`1[System.Char][]:this
          50 ( 4.73% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[ReadOnlyMemory`1][System.ReadOnlyMemory`1[System.Char]]:Return(System.ReadOnlyMemory`1[System.Char][],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Char][System.Char]:Rent(int):System.Char[]:this
          50 ( 4.90% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Char][System.Char]:Return(System.Char[],bool):this
          50 ( 5.23% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[__Canon][System.__Canon]:Rent(int):System.__Canon[]:this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Byte][System.Byte]:Rent(int):System.Byte[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Byte][System.Byte]:Return(System.Byte[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int16][System.Int16]:Rent(int):System.Int16[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int16][System.Int16]:Return(System.Int16[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int32][System.Int32]:Rent(int):System.Int32[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int32][System.Int32]:Return(System.Int32[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Double][System.Double]:Rent(int):System.Double[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Double][System.Double]:Return(System.Double[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Rent(int):System.Numerics.Vector`1[System.Single][]:this
          50 ( 4.42% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Return(System.Numerics.Vector`1[System.Single][],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int64][System.Int64]:Rent(int):System.Int64[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int64][System.Int64]:Return(System.Int64[],bool):this
          50 (46.73% of base) : System.Private.CoreLib.dasm - PerCoreLockedStacks[__Canon][System.__Canon]:TryPush(System.__Canon[]):this
Top method improvements (bytes):
       -1164 (-25.59% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM_S_D(byref,ubyte,byref):bool
       -1152 (-25.50% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HMS_F_D(byref,ubyte,byref):bool
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -147 (-14.53% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_D(byref,ubyte,byref):bool
        -146 (-14.46% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM(byref,ubyte,byref):bool
        -143 (-14.79% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_DHMSF(byref,ubyte,byref):bool
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -84 (-3.24% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:TryParseByFormat(System.ReadOnlySpan`1[Char],System.ReadOnlySpan`1[Char],int,byref):bool
         -83 (-28.82% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Object):bool:this
         -81 (-4.33% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:.ctor(System.Reflection.RuntimeModule,System.Reflection.MetadataToken,byref):this
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -66 (-4.87% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:TryParseExactMultiple(System.ReadOnlySpan`1[Char],System.String[],System.Globalization.DateTimeFormatInfo,int,byref):bool (2 methods)
         -61 (-6.62% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:get_NamedArguments():System.Collections.Generic.IList`1[CustomAttributeNamedArgument]:this

frameworks libraries (I used crossgen without check for compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES to estimate PMI diffs. PMI takes too long)
Total bytes of diff: -56175 (-0.17% of base)

Details

Top file improvements (bytes):
      -12924 : Microsoft.CodeAnalysis.CSharp.dasm (-0.62% of base)
       -9681 : System.Private.CoreLib.dasm (-0.25% of base)
       -9012 : System.Linq.Parallel.dasm (-1.54% of base)
       -6304 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.28% of base)
       -2213 : Microsoft.CodeAnalysis.dasm (-0.29% of base)
       -2107 : System.Reflection.Metadata.dasm (-0.65% of base)
       -1897 : System.Private.Xml.dasm (-0.06% of base)
       -1247 : System.Security.Cryptography.Pkcs.dasm (-0.37% of base)
       -1169 : System.Collections.Immutable.dasm (-0.54% of base)
        -968 : System.Data.Common.dasm (-0.09% of base)
        -866 : System.Security.Cryptography.Algorithms.dasm (-0.31% of base)
        -853 : System.Net.Http.dasm (-0.14% of base)
        -782 : System.Text.Json.dasm (-0.18% of base)
        -560 : System.Reflection.MetadataLoadContext.dasm (-0.30% of base)
        -541 : System.Diagnostics.PerformanceCounter.dasm (-0.74% of base)
        -467 : System.Security.Cryptography.Cng.dasm (-0.31% of base)
        -406 : System.Security.Cryptography.X509Certificates.dasm (-0.29% of base)
        -405 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.01% of base)
        -291 : ILCompiler.Reflection.ReadyToRun.dasm (-0.15% of base)
        -277 : System.DirectoryServices.dasm (-0.07% of base)
70 total files with Code Size differences (70 improved, 0 regressed), 161 unchanged.

Diffs look like:
before:

       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, xmmword ptr [rcx+16]
       vmovdqu  xmmword ptr [rsp+128H], xmm0

after:

       vmovdqu  ymm0, ymmword ptr[rcx]
       vmovdqu  ymmword ptr[rsp+118H], ymm0

and we are emitting VZEROUPPER everywhere now, so adding YMM usage doesn't add any penalty, that could change with #11496

Fixes #33617.

sandreenko · 2020-03-17T07:12:08Z

PTAL @CarolEidt @dotnet/jit-contrib

tannergooding · 2020-03-17T15:12:08Z

Are the loads/stores aligned? If not, 12.6 Data Alignment for Intel AVX has a fairly lengthy section detailing the potential perf drawbacks and so it may be worth profiling and including perf numbers as part of this.

tannergooding · 2020-03-17T15:15:24Z

src/coreclr/src/jit/codegenxarch.cpp

    if (size >= XMM_REGSIZE_BYTES)
    {
+#ifdef FEATURE_SIMD
+        bool useYmm = (compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES) && (size >= YMM_REGSIZE_BYTES);


Do we want to tie this to the size of Vector<T> rather than to the underlying ISAs that are available (based on the instructions used)?

Do you mean to replace compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES with compiler->>getSIMDSupportLevel() == SIMD_AVX2_Supported? I do not have a preference here, what is your opinion?

No, to replace it with a compSupports(InstructionSet_AVX2) check so we aren't tied to Vector<T> when we aren't actually using Vector<T>.

The appropriate check might change slightly with #33274, but I imagine this PR would get merged first. CC. @davidwrighton

Got it, will do that before merge, thanks.

tannergooding · 2020-03-17T15:22:30Z

src/coreclr/src/jit/codegenxarch.cpp

            emit->emitIns_R_R(INS_mov_i2xmm, EA_PTRSIZE, srcXmmReg, srcIntReg);
            emit->emitIns_R_R(INS_punpckldq, EA_16BYTE, srcXmmReg, srcXmmReg);
 #ifdef TARGET_X86
            // For x86, we need one more to convert it from 8 bytes to 16 bytes.


This else block could be more efficient as just a shuffle (SSE2) or broadcast (AVX2). You can see the pattern we use for setting all elements of a vector to value here:

32-bit: https://source.dot.net/#System.Private.CoreLib/Vector128.cs,399

64-bit: https://source.dot.net/#System.Private.CoreLib/Vector128.cs,434

Likewise for V256 it can be a shuffle + insert (AVX) or broadcast (AVX2):

32-bit: https://source.dot.net/#System.Private.CoreLib/Vector256.cs,357

64-bit: https://source.dot.net/#System.Private.CoreLib/Vector256.cs,396

We start with 2 bytes of data after emit->emitIns_R_R(INS_mov_i2xmm, EA_PTRSIZE, srcXmmReg, srcIntReg);, if we want to use a broadcast we will need to copy it up to 4 bytes. It could still be profitable, I think it is worth to open an issue from your comment, mark it as easy and up-for-grabs probably.

How do we start with 2 bytes of data? this is mov_i2xmm which is movd which only supports reading 32-bits or 64-bits (also hence the EA_PTRSIZE).

oh yes, we are transferring from 1 byte that Initblk accepts to int/long/float size in fgMorphPromoteLocalInitBlock

src/coreclr/src/jit/codegenxarch.cpp

sandreenko · 2020-03-17T17:05:06Z

Are the loads/stores aligned? If not, 12.6 Data Alignment for Intel AVX has a fairly lengthy section detailing the potential perf drawbacks and so it may be worth profiling and including perf numbers as part of this.

The alignment rules are the same as they were for xmm usages before, and the instruction are the same.
So as long as

  vmovdqu  ymm0, ymmword ptr[rcx]
  vmovdqu  ymmword ptr[rsp+118H], ymm0

faster than

       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, xmmword ptr [rcx+16]
       vmovdqu  xmmword ptr [rsp+128H], xmm0

for a random value rcx value we are good.
In most cases rcx will point to a local struct so it could be aligned, but it is not guaranteed.

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

tannergooding · 2020-03-17T17:07:53Z

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

I don't know about that

for a random value rcx value we are good.

I think that depends on a lot of factors. The optimization manual has a lot of notes on the penalties for crossing a cache line vs crossing a page boundary (~150 cycles) and calls out that it may be more optimal to do two 16-bit operations than a single 32-bit operation.

sandreenko · 2020-03-17T17:41:39Z

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

I don't know about that

@adiaaida could you please help us with that?

michellemcdaniel · 2020-03-17T18:18:53Z

@DrewScoggins could probably help

EgorBo · 2020-03-17T18:26:05Z

I think that depends on a lot of factors. The optimization manual has a lot of notes on the penalties for crossing a cache line vs crossing a page boundary (~150 cycles) and calls out that it may be more optimal to do two 16-bit operations than a single 32-bit operation.

For cases when a 32bit operation hits the cache/page boundary, one of those two 16 bit operations most likely will hit it too, won't it? (unless the boundary is between them)

DrewScoggins · 2020-03-17T18:49:26Z

We currently do not have any PR level performance testing that is done on actual hardware. The runs that you are seeing are basically just CI for our performance tests to make sure that changes don't break our perf tests.

To investigate this issue further I would go here, https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md, and folllow the instructions how on to run the performance tests locally against a built version of the runtime repo. If you need any help running these just let me know and I can help with anywhere you get stuck.

Also after you have checked in we will run the full battery of tests against the change and I can point you to the report that we generate and you can look through that and see if we are seeing any major changes to the benchmarks that we have.

tannergooding · 2020-03-17T19:16:02Z

For cases when a 32bit operation hit the cache/page boundary, one of those two 16 bit operations most likely will hit it too, won't it? (unless the boundary is between them)

The manual isn't exactly clear on this, it just indicates that it doubles the split rate.

When using Intel AVX with unaligned 32-byte vectors, every second load is a cache-line split, since the
cache-line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-
byte vectors. Even though split line access penalties were reduced significantly since Intel microarchitecture
code name Nehalem, a high cache-line split rate in memory-intensive code may cause performance
degradation.

Assembly/Compiler Coding Rule 72. (H impact, M generality) Align data to 32-byte boundary
when possible. Prefer store alignment over load alignment.

For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where
you cannot align the data, or data alignment is unknown. This can happen when you are writing a library
function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may
be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte
YMM registers.

Consider replacing unaligned 32-byte memory accesses using a combination of VMOVUPS,
VINSERTF128, and VEXTRACTF128 instructions.

etc

sandreenko · 2020-03-17T20:24:30Z

Thanks @DrewScoggins.

I will do a local perfomance run when I have time and a free machine and post results here.
Also, note that we are already using YMM registers for STORE_LCL_VAR (see #33617) so it is not something new.

EgorBo · 2020-03-17T20:49:31Z

When using Intel AVX with unaligned 32-byte vectors, every second load is a cache-line split, since the cache-line is 64 bytes.

The same statement is true for two SSE loads if vectors are not 16 bytes aligned basically.

So:
if data is 32 bytes aligned -- AVX is faster
if data is 16 bytes aligned -- SSE is faster
in all other cases both AVX and 2*SSE will receive a penalty for crossing the boundaries. And AVX case will be slightly faster, won't it?

tannergooding · 2020-03-17T20:53:05Z

in all other cases both AVX and 2*SSE will receive a penalty for crossing the boundaries. And AVX case will be slightly faster, won't it?

I don't believe so, hence the explicit recommendations:

However, there are cases where you cannot align the data, or data alignment is unknown. This can happen when you are writing a library function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte YMM registers.

I believe the underlying issue is that the cache is 64-bytes and so with YMM you have a cache line split every other access.
While with XMM you have a cache line split every 4 accesses. I also believe it is simpler for the CPU to rationalize the 16-byte split than it is to rationalize the 32-byte split.
But it would be good to get one of our contacts from Intel to better confirm if possible and if the notes in the architecture manual are still relevant to Haswell and newer processors.

CarolEidt · 2020-03-17T22:51:12Z

@sandreenko - I'm looking forward to seeing performance numbers; it seems that the guidance would indicate that it could be problematic.

lemire · 2020-03-17T22:53:12Z

Quoting Agner Fog (2011):

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.

I have an old (2012) blog post on the topic: Data alignment for speed: myth or reality?

I'm skeptical that switching from ymm to xmm can improve matters.

CarolEidt · 2020-03-17T22:56:17Z

src/coreclr/src/jit/codegenxarch.cpp

-                emit->emitIns_R_S(simdMov, EA_ATTR(regSize), tempReg, srcLclNum, srcOffset);
-            }
-            else
+        auto unrollUsingXMM = [&](unsigned regSize, regNumber tempReg) {


I'd prefer to make these actual methods rather than lambdas. To me the lambdas make it more difficult to see what's being modified, and it seems that sometimes the debugger(s) don't support them all that well. That said, doing that here would require some out (pointer) arguments). @BruceForstall do we have any guidance on this for the JIT?
I'm interested in others' thoughts on this @dotnet/jit-contrib. Perhaps it's only me that finds them obfuscating.

Yes, I did not want to do a method because it would need 7+ arguments and ~5 of them were pointers or references.

Right - I realize that there are tradeoffs, but I would probably make them in the other direction. I'd still like to hear from other JIT devs on this question, because I think it will continue to come up.
ping @dotnet/jit-contrib.

Given that this is mainly just used to cover the regular + trailing case, and the trailing case should only ever be one iteration, might it be simpler to just copy the logic for the one trailing case needed?

For just this case I don't think it's worth debating over - but I think this is a good opportunity to discuss these tradeoffs relative to the JIT coding conventions. It may be that most others feel as I do, but it may also be that most JIT devs would prefer the slight obfuscation and possible debug challenges over the admittedly messy approach of passing in a bunch of pointer arguments. As I say, it's a tradeoff and it would be nice to get some consensus on where the balance should lie.

tannergooding · 2020-03-17T23:07:27Z

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands

This, to my knowledge (and based on past profiling numbers) is only applicable to unaligned loads/stores that don't cross a cache-line or page-line boundary.
Both Intel and AMD have multiple examples and metrics showing the perf penalties when you do cross one of these boundaries.
They have also spent effort reducing these penalties in newer generations, hence asking for perf numbers showing the difference if possible.

EgorBo · 2020-03-17T23:46:17Z

@tannergooding in this comment you quoted the optimization manual but it seems it doesn't include this green block:

lemire · 2020-03-18T00:53:57Z

I'm skeptical that switching from ymm to xmm can improve matters.

After checking, if you configure things just right, so that the ymm is stored always on two cache lines, and where the two xmm avoid crossing a cache line, there is a small benefit to the xmm approach.

But as with my Data alignment for speed: myth or reality? post quoted above, these are small differences of the order of 10%. You are not going to save 50% or anything of the sort.

lemire · 2020-03-18T02:00:14Z

Blog post with hard numbers and source code (C++) :

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Screenshot:

tannergooding · 2020-03-18T05:15:32Z

you quoted the optimization manual but it seems it doesn't include this green block:

@EgorBo, looks like I just didn't have the latest copy of the manual. I have the copy from 2018, not the copy from Sep 2019.

In any case, Skylake is a relatively newer baseline and many of the azure machines sizes are still on Haswell/Broadwell: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general.

But as with my Data alignment for speed: myth or reality? post quoted above, these are small differences of the order of 10%. You are not going to save 50% or anything of the sort.

Right, I wasn't expecting anything crazy, but 10% could still be significant for certain workloads. This also looks to assume that the 16-byte read/writes will be aligned. It might be beneficial to also see what the numbers are like when none of it is aligned (so you get a cache line split every other load/store for YMM and every 4 loads/stores for XMM). It might also be beneficial to check on a Haswell or Broadwell era processor as that is a common baseline for Azure and AWS machines.

sandreenko · 2020-03-18T09:32:15Z

Thanks everybody for the feedback.

I was running performance\benchmarks\micro for a few hours and the results are on my local machine (Intel Core i7-6700 CPU 3.40GHz (Skylake)) by MannWhitney(3ms):
223 Same;
17 Faster;
13 Slower;
but none of them stayed the same on reruns, so I would say they are all in noise borders.

I would be glad to discuss it further and to do more measurements for this change if I had more time.
I want to merge this change because it fixes diffs between STORE_LCL_VAR and STORE_BLK codegen (#33617) and more such diffs are coming with my future changes. However, I could ignore these diffs if people think that the change is dangerous or needs more analysis. Also, we can always open a follow-up issue or revert it if the perf report shows diffs.

tannergooding · 2020-03-18T12:02:50Z

Also, we can always open a follow-up issue or revert it if the perf report shows diffs.

I'd be fine with this.

lemire · 2020-03-18T12:17:28Z

It might be beneficial to also see what the numbers are like when none of it is aligned (so you get a cache line split every other load/store for YMM and every 4 loads/stores for XMM). It might also be beneficial to check on a Haswell or Broadwell era processor as that is a common baseline for Azure and AWS machines.

Of course, one needs to test on one's actual hardware in one's actual code paths... but I did run these tests informally last night and allude to it in my post. On AMD Rome, the only time the two XMM bested the one YMM is when you the exact 48-byte offset scenario. That is, the 48-byte offset is designed to make the two XMM approach look as good as possible.

It should be reasonably easy to rerun my benchmark anywhere... you just need Linux with perf counter access... and a compiler that does not screw up your code.

AndyAyersMS · 2020-03-18T18:01:18Z

the actual diff is better, but we start inlining more and that is shown as asm size regression.

Can you help me understand this? I don't see how codegen details can impact inlining decisions.

I would be glad to discuss it further and to do more measurements

You might want to create a focused benchmark similar in spirt to the ones we had for the prolog zeroing work (see for example).

sandreenko · 2020-03-18T21:58:39Z

the actual diff is better, but we start inlining more and that is shown as asm size regression.

Can you help me understand this? I don't see how codegen details can impact inlining decisions.

good question, don't we use code size as an inlining metric if the method that we are considering to inline was already compiled?

I saw that System.Threading.Thread:GetCurrentProcessorNumber():int was always inlined with the change, so methods that used it became bigger.

I ran the diffs again but this time it did not show me the regressions, the improved methods stayed the same

PMI CodeSize Diffs for System.Private.CoreLib.dll for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -10873 (-0.20% of base)
    diff is an improvement.
370 total methods with Code Size differences (370 improved, 0 regressed), 33995 unchanged.

Details

Top file improvements (bytes):
      -10873 : System.Private.CoreLib.dasm (-0.20% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements (bytes):
       -1164 (-25.59% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM_S_D(byref,ubyte,byref):bool
       -1152 (-25.50% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HMS_F_D(byref,ubyte,byref):bool
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -147 (-14.53% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_D(byref,ubyte,byref):bool
        -146 (-14.46% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM(byref,ubyte,byref):bool
        -143 (-14.79% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_DHMSF(byref,ubyte,byref):bool
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -84 (-3.24% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:TryParseByFormat(System.ReadOnlySpan`1[Char],System.ReadOnlySpan`1[Char],int,byref):bool
         -83 (-28.82% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Object):bool:this
         -81 (-4.33% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:.ctor(System.Reflection.RuntimeModule,System.Reflection.MetadataToken,byref):this
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -66 (-4.87% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:TryParseExactMultiple(System.ReadOnlySpan`1[Char],System.String[],System.Globalization.DateTimeFormatInfo,int,byref):bool (2 methods)
         -61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
Top method improvements (percentages):
         -61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
         -10 (-45.45% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256DebugView`1[Vector`1][System.Numerics.Vector`1[System.Single]]:.ctor(System.Runtime.Intrinsics.Vector256`1[Vector`1]):this
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -34 (-34.00% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:EqualsAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -22 (-33.33% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Numerics.Matrix4x4):bool:this
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -22 (-32.35% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256:<WithLower>g__SoftwareFallback|67_0(System.Runtime.Intrinsics.Vector256`1[Vector`1],System.Runtime.Intrinsics.Vector128`1[Vector`1]):System.Runtime.Intrinsics.Vector256`1[Vector`1]
         -10 (-32.26% of base) : System.Private.CoreLib.dasm - System.Runtime.InteropServices.WindowsRuntime.CLRIKeyValuePairImpl`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:.ctor(byref):this
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -50 (-31.85% of base) : System.Private.CoreLib.dasm - PerCoreLockedStacks[__Canon][System.__Canon]:TryPush(System.__Canon[]):this
         -34 (-31.78% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:op_Inequality(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -10 (-31.25% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.DataCollector:Disable():this
         -22 (-30.56% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256:<WithUpper>g__SoftwareFallback|69_0(System.Runtime.Intrinsics.Vector256`1[Vector`1],System.Runtime.Intrinsics.Vector128`1[Vector`1]):System.Runtime.Intrinsics.Vector256`1[Vector`1]
         -41 (-30.37% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:Negate(System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]

Hm, I have run it a few more times, it shows a bit different results each time, sometimes it shows that GetCurrentProcessorId is improved:
-61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int, sometimes it is not in the list of changed methods, probably it is a case of PMI flacky behavior.

AndyAyersMS · 2020-03-18T22:11:20Z

don't we use code size as an inlining metric if the method that we are considering to inline was already compiled?

No, we don't.

It creates coupling between the back end of the jit and the front end that makes it hard to work on back end changes. The inliner can extrapolate about the general behavior of the backend (say estimating that this kind of IR typically creates this much native code) but doesn't look at exact sizes.

a case of PMI flacky behavior.

Just in SPC, right? Did you see this in other assemblies?

@erozenfeld has been working hard to try and fix the flakiness, but new nondeterminisc modes keep creeping in.

erozenfeld · 2020-03-18T22:23:48Z

I'll try to repro and fix the non-deterministic pmi behaviour.

sandreenko · 2020-03-18T23:47:17Z

I have recollected the numbers with --f and release CoreRoot and gotten these numbers on 3 runs (before I was using checked CoreRoot from base folder):

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -140964 (-0.30% of base)
3111 total methods with Code Size differences (3111 improved, 0 regressed), 236781 unchanged.

erozenfeld · 2020-03-20T21:01:10Z

The non-deterministic pmi behavior is addressed in dotnet/jitutils#255 .

sandreenko · 2020-03-30T19:50:42Z

I am going to close it for now until I find time for more benchmarking if somebody wants to finish it - feel free.

use YMM registers on x64 for BlkUnroll.

b376701

sandreenko added enhancement Product code improvement that does NOT require public API changes/additions arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 17, 2020

Fix x86 linux build break.

bdec26c

tannergooding reviewed Mar 17, 2020

View reviewed changes

src/coreclr/src/jit/codegenxarch.cpp Show resolved Hide resolved

CarolEidt reviewed Mar 17, 2020

View reviewed changes

sandreenko closed this Mar 30, 2020

sandreenko mentioned this pull request Apr 30, 2020

Unoptimal codegen for Vector<T> init if Vector is not a localVar. #33617

Closed

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

use YMM registers on x64 for BlkUnroll. #33665

use YMM registers on x64 for BlkUnroll. #33665

Uh oh!

Conversation

sandreenko commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreenko commented Mar 17, 2020

Uh oh!

tannergooding commented Mar 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sandreenko commented Mar 17, 2020

Uh oh!

tannergooding commented Mar 17, 2020

Uh oh!

sandreenko commented Mar 17, 2020

Uh oh!

michellemcdaniel commented Mar 17, 2020

Uh oh!

EgorBo commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrewScoggins commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding commented Mar 17, 2020

Uh oh!

sandreenko commented Mar 17, 2020

Uh oh!

EgorBo commented Mar 17, 2020

Uh oh!

tannergooding commented Mar 17, 2020

Uh oh!

CarolEidt commented Mar 17, 2020

Uh oh!

lemire commented Mar 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Mar 17, 2020

Uh oh!

EgorBo commented Mar 17, 2020

Uh oh!

lemire commented Mar 18, 2020

Uh oh!

lemire commented Mar 18, 2020

Uh oh!

tannergooding commented Mar 18, 2020

Uh oh!

sandreenko commented Mar 18, 2020

Uh oh!

tannergooding commented Mar 18, 2020

Uh oh!

lemire commented Mar 18, 2020

Uh oh!

sandreenko commented Mar 17, 2020 •

edited

Loading

EgorBo commented Mar 17, 2020 •

edited

Loading

DrewScoggins commented Mar 17, 2020 •

edited

Loading