Skip to content

Conversation

@sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Mar 17, 2020

PMI SPC diffs:
Total bytes of diff: -7255 (-0.13% of base),
the actual diff is better, but we start inlining more and that is shown as asm size regression.

Details
Top file improvements (bytes):
       -7255 : System.Private.CoreLib.dasm (-0.13% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method regressions (bytes):
          61 (1,220.00% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
          54 ( 5.49% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[__Canon][System.__Canon]:Return(System.__Canon[],bool):this
          50 ( 5.52% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[ReadOnlyMemory`1][System.ReadOnlyMemory`1[System.Char]]:Rent(int):System.ReadOnlyMemory`1[System.Char][]:this
          50 ( 4.73% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[ReadOnlyMemory`1][System.ReadOnlyMemory`1[System.Char]]:Return(System.ReadOnlyMemory`1[System.Char][],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Char][System.Char]:Rent(int):System.Char[]:this
          50 ( 4.90% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Char][System.Char]:Return(System.Char[],bool):this
          50 ( 5.23% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[__Canon][System.__Canon]:Rent(int):System.__Canon[]:this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Byte][System.Byte]:Rent(int):System.Byte[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Byte][System.Byte]:Return(System.Byte[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int16][System.Int16]:Rent(int):System.Int16[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int16][System.Int16]:Return(System.Int16[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int32][System.Int32]:Rent(int):System.Int32[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int32][System.Int32]:Return(System.Int32[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Double][System.Double]:Rent(int):System.Double[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Double][System.Double]:Return(System.Double[],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Rent(int):System.Numerics.Vector`1[System.Single][]:this
          50 ( 4.42% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Vector`1][System.Numerics.Vector`1[System.Single]]:Return(System.Numerics.Vector`1[System.Single][],bool):this
          50 ( 5.81% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int64][System.Int64]:Rent(int):System.Int64[]:this
          50 ( 4.55% of base) : System.Private.CoreLib.dasm - System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[Int64][System.Int64]:Return(System.Int64[],bool):this
          50 (46.73% of base) : System.Private.CoreLib.dasm - PerCoreLockedStacks[__Canon][System.__Canon]:TryPush(System.__Canon[]):this
Top method improvements (bytes):
       -1164 (-25.59% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM_S_D(byref,ubyte,byref):bool
       -1152 (-25.50% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HMS_F_D(byref,ubyte,byref):bool
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -147 (-14.53% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_D(byref,ubyte,byref):bool
        -146 (-14.46% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM(byref,ubyte,byref):bool
        -143 (-14.79% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_DHMSF(byref,ubyte,byref):bool
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -84 (-3.24% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:TryParseByFormat(System.ReadOnlySpan`1[Char],System.ReadOnlySpan`1[Char],int,byref):bool
         -83 (-28.82% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Object):bool:this
         -81 (-4.33% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:.ctor(System.Reflection.RuntimeModule,System.Reflection.MetadataToken,byref):this
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -66 (-4.87% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:TryParseExactMultiple(System.ReadOnlySpan`1[Char],System.String[],System.Globalization.DateTimeFormatInfo,int,byref):bool (2 methods)
         -61 (-6.62% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:get_NamedArguments():System.Collections.Generic.IList`1[CustomAttributeNamedArgument]:this

frameworks libraries (I used crossgen without check for compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES to estimate PMI diffs. PMI takes too long)
Total bytes of diff: -56175 (-0.17% of base)

Details
Top file improvements (bytes):
      -12924 : Microsoft.CodeAnalysis.CSharp.dasm (-0.62% of base)
       -9681 : System.Private.CoreLib.dasm (-0.25% of base)
       -9012 : System.Linq.Parallel.dasm (-1.54% of base)
       -6304 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.28% of base)
       -2213 : Microsoft.CodeAnalysis.dasm (-0.29% of base)
       -2107 : System.Reflection.Metadata.dasm (-0.65% of base)
       -1897 : System.Private.Xml.dasm (-0.06% of base)
       -1247 : System.Security.Cryptography.Pkcs.dasm (-0.37% of base)
       -1169 : System.Collections.Immutable.dasm (-0.54% of base)
        -968 : System.Data.Common.dasm (-0.09% of base)
        -866 : System.Security.Cryptography.Algorithms.dasm (-0.31% of base)
        -853 : System.Net.Http.dasm (-0.14% of base)
        -782 : System.Text.Json.dasm (-0.18% of base)
        -560 : System.Reflection.MetadataLoadContext.dasm (-0.30% of base)
        -541 : System.Diagnostics.PerformanceCounter.dasm (-0.74% of base)
        -467 : System.Security.Cryptography.Cng.dasm (-0.31% of base)
        -406 : System.Security.Cryptography.X509Certificates.dasm (-0.29% of base)
        -405 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.01% of base)
        -291 : ILCompiler.Reflection.ReadyToRun.dasm (-0.15% of base)
        -277 : System.DirectoryServices.dasm (-0.07% of base)
70 total files with Code Size differences (70 improved, 0 regressed), 161 unchanged.

Diffs look like:
before:

       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, xmmword ptr [rcx+16]
       vmovdqu  xmmword ptr [rsp+128H], xmm0

after:

       vmovdqu  ymm0, ymmword ptr[rcx]
       vmovdqu  ymmword ptr[rsp+118H], ymm0

and we are emitting VZEROUPPER everywhere now, so adding YMM usage doesn't add any penalty, that could change with #11496

Fixes #33617.

@sandreenko sandreenko added enhancement Product code improvement that does NOT require public API changes/additions arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 17, 2020
@sandreenko
Copy link
Contributor Author

PTAL @CarolEidt @dotnet/jit-contrib

@tannergooding
Copy link
Member

Are the loads/stores aligned? If not, 12.6 Data Alignment for Intel AVX has a fairly lengthy section detailing the potential perf drawbacks and so it may be worth profiling and including perf numbers as part of this.

if (size >= XMM_REGSIZE_BYTES)
{
#ifdef FEATURE_SIMD
bool useYmm = (compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES) && (size >= YMM_REGSIZE_BYTES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to tie this to the size of Vector<T> rather than to the underlying ISAs that are available (based on the instructions used)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to replace compiler->getSIMDVectorRegisterByteLength() == YMM_REGSIZE_BYTES with compiler->>getSIMDSupportLevel() == SIMD_AVX2_Supported? I do not have a preference here, what is your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, to replace it with a compSupports(InstructionSet_AVX2) check so we aren't tied to Vector<T> when we aren't actually using Vector<T>.

The appropriate check might change slightly with #33274, but I imagine this PR would get merged first. CC. @davidwrighton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will do that before merge, thanks.

emit->emitIns_R_R(INS_mov_i2xmm, EA_PTRSIZE, srcXmmReg, srcIntReg);
emit->emitIns_R_R(INS_punpckldq, EA_16BYTE, srcXmmReg, srcXmmReg);
#ifdef TARGET_X86
// For x86, we need one more to convert it from 8 bytes to 16 bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This else block could be more efficient as just a shuffle (SSE2) or broadcast (AVX2). You can see the pattern we use for setting all elements of a vector to value here:

Likewise for V256 it can be a shuffle + insert (AVX) or broadcast (AVX2):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We start with 2 bytes of data after emit->emitIns_R_R(INS_mov_i2xmm, EA_PTRSIZE, srcXmmReg, srcIntReg);, if we want to use a broadcast we will need to copy it up to 4 bytes. It could still be profitable, I think it is worth to open an issue from your comment, mark it as easy and up-for-grabs probably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we start with 2 bytes of data? this is mov_i2xmm which is movd which only supports reading 32-bits or 64-bits (also hence the EA_PTRSIZE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, we are transferring from 1 byte that Initblk accepts to int/long/float size in fgMorphPromoteLocalInitBlock

@sandreenko
Copy link
Contributor Author

Are the loads/stores aligned? If not, 12.6 Data Alignment for Intel AVX has a fairly lengthy section detailing the potential perf drawbacks and so it may be worth profiling and including perf numbers as part of this.

The alignment rules are the same as they were for xmm usages before, and the instruction are the same.
So as long as

  vmovdqu  ymm0, ymmword ptr[rcx]
  vmovdqu  ymmword ptr[rsp+118H], ymm0

faster than

       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+118H], xmm0
       vmovdqu  xmm0, xmmword ptr [rcx+16]
       vmovdqu  xmmword ptr [rsp+128H], xmm0

for a random value rcx value we are good.
In most cases rcx will point to a local struct so it could be aligned, but it is not guaranteed.

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

@tannergooding
Copy link
Member

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

I don't know about that

for a random value rcx value we are good.

I think that depends on a lot of factors. The optimization manual has a lot of notes on the penalties for crossing a cache line vs crossing a page boundary (~150 cycles) and calls out that it may be more optimal to do two 16-bit operations than a single 32-bit operation.

@sandreenko
Copy link
Contributor Author

I see that we have Performance Windows_NT x64 release netcoreapp5.0 job as part of PR testing, do you know if we can get performance number from it?

I don't know about that

@adiaaida could you please help us with that?

@michellemcdaniel
Copy link
Contributor

@DrewScoggins could probably help

@EgorBo
Copy link
Member

EgorBo commented Mar 17, 2020

I think that depends on a lot of factors. The optimization manual has a lot of notes on the penalties for crossing a cache line vs crossing a page boundary (~150 cycles) and calls out that it may be more optimal to do two 16-bit operations than a single 32-bit operation.

For cases when a 32bit operation hits the cache/page boundary, one of those two 16 bit operations most likely will hit it too, won't it? (unless the boundary is between them)

@DrewScoggins
Copy link
Member

DrewScoggins commented Mar 17, 2020

We currently do not have any PR level performance testing that is done on actual hardware. The runs that you are seeing are basically just CI for our performance tests to make sure that changes don't break our perf tests.

To investigate this issue further I would go here, https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md, and folllow the instructions how on to run the performance tests locally against a built version of the runtime repo. If you need any help running these just let me know and I can help with anywhere you get stuck.

Also after you have checked in we will run the full battery of tests against the change and I can point you to the report that we generate and you can look through that and see if we are seeing any major changes to the benchmarks that we have.

@tannergooding
Copy link
Member

For cases when a 32bit operation hit the cache/page boundary, one of those two 16 bit operations most likely will hit it too, won't it? (unless the boundary is between them)

The manual isn't exactly clear on this, it just indicates that it doubles the split rate.

When using Intel AVX with unaligned 32-byte vectors, every second load is a cache-line split, since the
cache-line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-
byte vectors. Even though split line access penalties were reduced significantly since Intel microarchitecture
code name Nehalem, a high cache-line split rate in memory-intensive code may cause performance
degradation.

Assembly/Compiler Coding Rule 72. (H impact, M generality) Align data to 32-byte boundary
when possible. Prefer store alignment over load alignment.

For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where
you cannot align the data, or data alignment is unknown. This can happen when you are writing a library
function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may
be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte
YMM registers.

Consider replacing unaligned 32-byte memory accesses using a combination of VMOVUPS,
VINSERTF128, and VEXTRACTF128 instructions.

etc

@sandreenko
Copy link
Contributor Author

Thanks @DrewScoggins.

I will do a local perfomance run when I have time and a free machine and post results here.
Also, note that we are already using YMM registers for STORE_LCL_VAR (see #33617) so it is not something new.

@EgorBo
Copy link
Member

EgorBo commented Mar 17, 2020

When using Intel AVX with unaligned 32-byte vectors, every second load is a cache-line split, since the cache-line is 64 bytes.

The same statement is true for two SSE loads if vectors are not 16 bytes aligned basically.

So:
if data is 32 bytes aligned -- AVX is faster
if data is 16 bytes aligned -- SSE is faster
in all other cases both AVX and 2*SSE will receive a penalty for crossing the boundaries. And AVX case will be slightly faster, won't it?

@tannergooding
Copy link
Member

in all other cases both AVX and 2*SSE will receive a penalty for crossing the boundaries. And AVX case will be slightly faster, won't it?

I don't believe so, hence the explicit recommendations:

However, there are cases where you cannot align the data, or data alignment is unknown. This can happen when you are writing a library function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may be the best alternative. The following method uses 16-byte loads while still benefiting from the 32-byte YMM registers.

I believe the underlying issue is that the cache is 64-bytes and so with YMM you have a cache line split every other access.
While with XMM you have a cache line split every 4 accesses. I also believe it is simpler for the CPU to rationalize the 16-byte split than it is to rationalize the 32-byte split.
But it would be good to get one of our contacts from Intel to better confirm if possible and if the notes in the architecture manual are still relevant to Haswell and newer processors.

@CarolEidt
Copy link
Contributor

@sandreenko - I'm looking forward to seeing performance numbers; it seems that the guidance would indicate that it could be problematic.

@lemire
Copy link

lemire commented Mar 17, 2020

Quoting Agner Fog (2011):

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned. Store-to-load forwarding also works with misaligned operands in most cases.

I have an old (2012) blog post on the topic: Data alignment for speed: myth or reality?

I'm skeptical that switching from ymm to xmm can improve matters.

emit->emitIns_R_S(simdMov, EA_ATTR(regSize), tempReg, srcLclNum, srcOffset);
}
else
auto unrollUsingXMM = [&](unsigned regSize, regNumber tempReg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to make these actual methods rather than lambdas. To me the lambdas make it more difficult to see what's being modified, and it seems that sometimes the debugger(s) don't support them all that well. That said, doing that here would require some out (pointer) arguments). @BruceForstall do we have any guidance on this for the JIT?
I'm interested in others' thoughts on this @dotnet/jit-contrib. Perhaps it's only me that finds them obfuscating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did not want to do a method because it would need 7+ arguments and ~5 of them were pointers or references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I realize that there are tradeoffs, but I would probably make them in the other direction. I'd still like to hear from other JIT devs on this question, because I think it will continue to come up.
ping @dotnet/jit-contrib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is mainly just used to cover the regular + trailing case, and the trailing case should only ever be one iteration, might it be simpler to just copy the logic for the one trailing case needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For just this case I don't think it's worth debating over - but I think this is a good opportunity to discuss these tradeoffs relative to the JIT coding conventions. It may be that most others feel as I do, but it may also be that most JIT devs would prefer the slight obfuscation and possible debug challenges over the admittedly messy approach of passing in a bunch of pointer arguments. As I say, it's a tradeoff and it would be nice to get some consensus on where the balance should lie.

@tannergooding
Copy link
Member

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands

This, to my knowledge (and based on past profiling numbers) is only applicable to unaligned loads/stores that don't cross a cache-line or page-line boundary.
Both Intel and AMD have multiple examples and metrics showing the perf penalties when you do cross one of these boundaries.
They have also spent effort reducing these penalties in newer generations, hence asking for perf numbers showing the difference if possible.

@EgorBo
Copy link
Member

EgorBo commented Mar 17, 2020

@tannergooding in this comment you quoted the optimization manual but it seems it doesn't include this green block:

image

@lemire
Copy link

lemire commented Mar 18, 2020

I'm skeptical that switching from ymm to xmm can improve matters.

After checking, if you configure things just right, so that the ymm is stored always on two cache lines, and where the two xmm avoid crossing a cache line, there is a small benefit to the xmm approach.

But as with my Data alignment for speed: myth or reality? post quoted above, these are small differences of the order of 10%. You are not going to save 50% or anything of the sort.

@lemire
Copy link

lemire commented Mar 18, 2020

Blog post with hard numbers and source code (C++) :

Avoiding cache line overlap by replacing one 256-bit store with two 128-bit stores

Screenshot:

Screen Shot 2020-03-17 at 9 59 10 PM

@tannergooding
Copy link
Member

you quoted the optimization manual but it seems it doesn't include this green block:

@EgorBo, looks like I just didn't have the latest copy of the manual. I have the copy from 2018, not the copy from Sep 2019.

In any case, Skylake is a relatively newer baseline and many of the azure machines sizes are still on Haswell/Broadwell: https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general.

But as with my Data alignment for speed: myth or reality? post quoted above, these are small differences of the order of 10%. You are not going to save 50% or anything of the sort.

Right, I wasn't expecting anything crazy, but 10% could still be significant for certain workloads. This also looks to assume that the 16-byte read/writes will be aligned. It might be beneficial to also see what the numbers are like when none of it is aligned (so you get a cache line split every other load/store for YMM and every 4 loads/stores for XMM). It might also be beneficial to check on a Haswell or Broadwell era processor as that is a common baseline for Azure and AWS machines.

@sandreenko
Copy link
Contributor Author

Thanks everybody for the feedback.

I was running performance\benchmarks\micro for a few hours and the results are on my local machine (Intel Core i7-6700 CPU 3.40GHz (Skylake)) by MannWhitney(3ms):
223 Same;
17 Faster;
13 Slower;
but none of them stayed the same on reruns, so I would say they are all in noise borders.

I would be glad to discuss it further and to do more measurements for this change if I had more time.
I want to merge this change because it fixes diffs between STORE_LCL_VAR and STORE_BLK codegen (#33617) and more such diffs are coming with my future changes. However, I could ignore these diffs if people think that the change is dangerous or needs more analysis. Also, we can always open a follow-up issue or revert it if the perf report shows diffs.

@tannergooding
Copy link
Member

Also, we can always open a follow-up issue or revert it if the perf report shows diffs.

I'd be fine with this.

@lemire
Copy link

lemire commented Mar 18, 2020

It might be beneficial to also see what the numbers are like when none of it is aligned (so you get a cache line split every other load/store for YMM and every 4 loads/stores for XMM). It might also be beneficial to check on a Haswell or Broadwell era processor as that is a common baseline for Azure and AWS machines.

Of course, one needs to test on one's actual hardware in one's actual code paths... but I did run these tests informally last night and allude to it in my post. On AMD Rome, the only time the two XMM bested the one YMM is when you the exact 48-byte offset scenario. That is, the 48-byte offset is designed to make the two XMM approach look as good as possible.

It should be reasonably easy to rerun my benchmark anywhere... you just need Linux with perf counter access... and a compiler that does not screw up your code.

@AndyAyersMS
Copy link
Member

the actual diff is better, but we start inlining more and that is shown as asm size regression.

Can you help me understand this? I don't see how codegen details can impact inlining decisions.

I would be glad to discuss it further and to do more measurements

You might want to create a focused benchmark similar in spirt to the ones we had for the prolog zeroing work (see for example).

@sandreenko
Copy link
Contributor Author

the actual diff is better, but we start inlining more and that is shown as asm size regression.

Can you help me understand this? I don't see how codegen details can impact inlining decisions.

good question, don't we use code size as an inlining metric if the method that we are considering to inline was already compiled?

I saw that System.Threading.Thread:GetCurrentProcessorNumber():int was always inlined with the change, so methods that used it became bigger.

I ran the diffs again but this time it did not show me the regressions, the improved methods stayed the same

PMI CodeSize Diffs for System.Private.CoreLib.dll for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -10873 (-0.20% of base)
    diff is an improvement.
370 total methods with Code Size differences (370 improved, 0 regressed), 33995 unchanged.
Details
Top file improvements (bytes):
      -10873 : System.Private.CoreLib.dasm (-0.20% of base)
1 total files with Code Size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements (bytes):
       -1164 (-25.59% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM_S_D(byref,ubyte,byref):bool
       -1152 (-25.50% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HMS_F_D(byref,ubyte,byref):bool
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -147 (-14.53% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_D(byref,ubyte,byref):bool
        -146 (-14.46% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_HM(byref,ubyte,byref):bool
        -143 (-14.79% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:ProcessTerminal_DHMSF(byref,ubyte,byref):bool
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -84 (-3.24% of base) : System.Private.CoreLib.dasm - System.Globalization.TimeSpanParse:TryParseByFormat(System.ReadOnlySpan`1[Char],System.ReadOnlySpan`1[Char],int,byref):bool
         -83 (-28.82% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Object):bool:this
         -81 (-4.33% of base) : System.Private.CoreLib.dasm - System.Reflection.CustomAttributeData:.ctor(System.Reflection.RuntimeModule,System.Reflection.MetadataToken,byref):this
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -66 (-4.87% of base) : System.Private.CoreLib.dasm - System.DateTimeParse:TryParseExactMultiple(System.ReadOnlySpan`1[Char],System.String[],System.Globalization.DateTimeFormatInfo,int,byref):bool (2 methods)
         -61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
Top method improvements (percentages):
         -61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int
         -10 (-45.45% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256DebugView`1[Vector`1][System.Numerics.Vector`1[System.Single]]:.ctor(System.Runtime.Intrinsics.Vector256`1[Vector`1]):this
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -116 (-36.59% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
        -190 (-35.98% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -34 (-34.00% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:EqualsAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -22 (-33.33% of base) : System.Private.CoreLib.dasm - System.Numerics.Matrix4x4:Equals(System.Numerics.Matrix4x4):bool:this
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-33.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAny(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -22 (-32.35% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256:<WithLower>g__SoftwareFallback|67_0(System.Runtime.Intrinsics.Vector256`1[Vector`1],System.Runtime.Intrinsics.Vector128`1[Vector`1]):System.Runtime.Intrinsics.Vector256`1[Vector`1]
         -10 (-32.26% of base) : System.Private.CoreLib.dasm - System.Runtime.InteropServices.WindowsRuntime.CLRIKeyValuePairImpl`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:.ctor(byref):this
        -137 (-32.01% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:ConditionalSelect(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -50 (-31.85% of base) : System.Private.CoreLib.dasm - PerCoreLockedStacks[__Canon][System.__Canon]:TryPush(System.__Canon[]):this
         -34 (-31.78% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:op_Inequality(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:LessThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
        -136 (-31.70% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:GreaterThanOrEqualAll(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):bool
         -10 (-31.25% of base) : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.DataCollector:Disable():this
         -22 (-30.56% of base) : System.Private.CoreLib.dasm - System.Runtime.Intrinsics.Vector256:<WithUpper>g__SoftwareFallback|69_0(System.Runtime.Intrinsics.Vector256`1[Vector`1],System.Runtime.Intrinsics.Vector128`1[Vector`1]):System.Runtime.Intrinsics.Vector256`1[Vector`1]
         -41 (-30.37% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector:Negate(System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]
         -71 (-30.21% of base) : System.Private.CoreLib.dasm - System.Numerics.Vector`1[Vector`1][System.Numerics.Vector`1[System.Single]]:GreaterThanOrEqual(System.Numerics.Vector`1[Vector`1],System.Numerics.Vector`1[Vector`1]):System.Numerics.Vector`1[Vector`1]

Hm, I have run it a few more times, it shows a bit different results each time, sometimes it shows that GetCurrentProcessorId is improved:
-61 (-92.42% of base) : System.Private.CoreLib.dasm - System.Threading.Thread:GetCurrentProcessorId():int, sometimes it is not in the list of changed methods, probably it is a case of PMI flacky behavior.

@AndyAyersMS
Copy link
Member

don't we use code size as an inlining metric if the method that we are considering to inline was already compiled?

No, we don't.

It creates coupling between the back end of the jit and the front end that makes it hard to work on back end changes. The inliner can extrapolate about the general behavior of the backend (say estimating that this kind of IR typically creates this much native code) but doesn't look at exact sizes.

a case of PMI flacky behavior.

Just in SPC, right? Did you see this in other assemblies?

@erozenfeld has been working hard to try and fix the flakiness, but new nondeterminisc modes keep creeping in.

@erozenfeld
Copy link
Member

I'll try to repro and fix the non-deterministic pmi behaviour.

@sandreenko
Copy link
Contributor Author

I have recollected the numbers with --f and release CoreRoot and gotten these numbers on 3 runs (before I was using checked CoreRoot from base folder):

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -140964 (-0.30% of base)
3111 total methods with Code Size differences (3111 improved, 0 regressed), 236781 unchanged.

@erozenfeld
Copy link
Member

The non-deterministic pmi behavior is addressed in dotnet/jitutils#255 .

@sandreenko
Copy link
Contributor Author

I am going to close it for now until I find time for more benchmarking if somebody wants to finish it - feel free.

@sandreenko sandreenko closed this Mar 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unoptimal codegen for Vector<T> init if Vector is not a localVar.

9 participants