Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Enable block init unroll on ARM32#27450

Merged
sandreenko merged 2 commits into
dotnet:masterfrom
mikedn:arm-init-unroll
Oct 30, 2019
Merged

Enable block init unroll on ARM32#27450
sandreenko merged 2 commits into
dotnet:masterfrom
mikedn:arm-init-unroll

Conversation

@mikedn
Copy link
Copy Markdown

@mikedn mikedn commented Oct 25, 2019

Extracted from #21711

On ARM32 unroll was already done for block copy but not for block init. This was the only target that didn't unroll block init.

I've lowered the existing init unroll limit from 32 to 16, I'm not familiar with ARM perf characteristic but the limit seemed quite high, considering that the current implementation uses only 32 bit stores. Other targets have 64 and even 128 unroll limits but they also use 128 bit stores.

16 also looks like a reasonable code size trade off - with the 32 byte limit the diff ends up 5 kbytes in red but with 16 it is a 15 kbytes improvement.

@mikedn
Copy link
Copy Markdown
Author

mikedn commented Oct 25, 2019

Diff summary (x86 crossgen + arm32 altjit):

Total bytes of diff: -15036 (-0.02% of base)
    diff is an improvement.
Top file regressions by size (bytes):
        1848 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.02% of base)
        1656 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.02% of base)
        1376 : System.Text.RegularExpressions.dasm (0.23% of base)
        1020 : Microsoft.CodeAnalysis.CSharp.dasm (0.01% of base)
         412 : Microsoft.DotNet.Cli.Utils.dasm (0.33% of base)
Top file improvements by size (bytes):
       -5328 : System.Private.Xml.dasm (-0.07% of base)
       -4512 : System.Private.CoreLib.dasm (-0.06% of base)
       -2940 : System.Net.Http.dasm (-0.20% of base)
       -2296 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
       -1640 : Newtonsoft.Json.dasm (-0.11% of base)
63 total files with size differences (43 improved, 20 regressed), 66 unchanged.
Top method regressions by size (bytes):
        1312 ( 2.10% of base) : System.Text.RegularExpressions.dasm - RegexCharClass:.cctor() (4 methods)
         484 ( 1.80% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ApplicationServerTraceEventParser:.cctor() (2 methods)
         160 ( 1.87% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - JScriptTraceEventParser:.cctor() (2 methods)
         140 ( 6.24% of base) : System.Security.AccessControl.dasm - ObjectAce:ParseBinaryForm(ref,int,byref,byref,byref,byref,byref,byref,byref,byref):bool (2 methods)
         120 ( 3.64% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - VisualBasicDataFlowAnalysis:AnalyzeReadWrite():this (4 methods)
Top method improvements by size (bytes):
        -752 (-0.39% of base) : Microsoft.CodeAnalysis.dasm - DesktopAssemblyIdentityComparer:.cctor() (4 methods)
        -736 (-2.17% of base) : System.Net.Http.dasm - Huffman:.cctor() (2 methods)
        -344 (-0.71% of base) : System.Net.Http.dasm - <SendAsyncCore>d__53:MoveNext():this (4 methods)
        -336 (-2.75% of base) : System.Private.CoreLib.dasm - OpCodes:.cctor() (2 methods)
        -268 (-3.07% of base) : System.Private.CoreLib.dasm - Utf8Parser:TryParse(struct,byref,byref,ushort):bool (32 methods)
Top method regressions by size (percentage):
          72 (29.51% of base) : Microsoft.DotNet.Cli.Utils.dasm - <>c:<IsEnd>b__3_0(struct):struct:this (2 methods)
          68 (22.37% of base) : System.Private.Xml.dasm - XsdDateTime:.ctor(struct):this (4 methods)
          24 (21.43% of base) : System.Private.CoreLib.dasm - ValueTask:ConfigureAwait(bool):struct:this (2 methods)
          72 (19.15% of base) : Microsoft.DotNet.Cli.Utils.dasm - Cursor:Advance(ref,int):struct:this (4 methods)
          60 (18.99% of base) : Microsoft.DotNet.Cli.Utils.dasm - <>c:<Ch>b__2_0(struct):struct:this (2 methods)
Top method improvements by size (percentage):
         -40 (-41.67% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - MemberModifiers:.ctor(int,int):this (4 methods)
         -60 (-33.33% of base) : System.Drawing.Primitives.dasm - RectangleF:FromLTRB(float,float,float,float):struct (2 methods)
         -16 (-30.77% of base) : System.Security.Cryptography.Cng.dasm - CngAlgorithmCore:.ctor(ref):this (2 methods)
         -16 (-28.57% of base) : Microsoft.CodeAnalysis.dasm - ConstantValueDefault:get_DateTimeValue():struct:this (4 methods)
         -16 (-26.67% of base) : Microsoft.DotNet.Cli.Utils.dasm - Chain`2:.ctor(ushort,ref):this (2 methods)
2676 total methods with size differences (1611 improved, 1065 regressed), 146406 unchanged.

@mikedn
Copy link
Copy Markdown
Author

mikedn commented Oct 25, 2019

Diff should be better if local destination address containment is enabled (it is enabled for block copy but not for block init) but I'll do that in a separate PR because it affects all targets.

#ifdef _TARGET_ARM64_
attr = EA_1BYTE;
#else
attr = EA_4BYTE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we have that code here and in genCodeForCpBlkUnroll.
And it takes me time to recall why it is like this. Maybe it can be a separate method with an explanation?

Also, unrelated to this change question, now it is unclear for me why do other places where we call emitActualTypeSize(short) on arm32 work?
For example, inst_SETCC could have a short or byte type, in this case it will call GetEmitter()->emitIns_R_I(INS_mov, emitActualTypeSize(type), dstReg, 0); with emitActualTypeSize(type) == 1 or 2, that should fail with an assert in the emmitter, why doesn't it happen?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it takes me time to recall why it is like this. Maybe it can be a separate method with an explanation?

Well, the proper fix would be to change the ARM64 emitter to not demand EA_1BYTE/EA_2BYTE. Actually I need to doublecheck this because I remember I did a change like this a while ago, maybe I missed a case.

with emitActualTypeSize(type) == 1 or 2, that should fail with an assert in the emmitter, why doesn't it happen?

Not sure I understand, emitActualTypeSize can't be 1 or 2, the "actual" types start with TYP_INT so it's at least 4. In fact, the "actual type" IR model matches well the ARM ISA, which does not have small int type registers nor small int type operations.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratching head - if I remove the ifdefs it works fine with EA_4BYTE on ARM64. It is possible that I wrote this code before my PR that fixed the ARM64 emitter inconsistency was merged. Oh well, I'll check again tonight when I get back from work.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, emitActualTypeSize can't be 1 or 2, the "actual" types start with TYP_INT so it's at least 4. In fact, the "actual type" IR model matches well the ARM ISA, which does not have small int type registers nor small int type operations.

Makes sense, so there are no any other places where we could call `emit->emitIns_smth(*, EA_2BYTE, *). Thanks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it turns out that I managed to confuse myself. The inconsistency that I fixed was actually in the ARM32 emitter, it insisted that ldrb/ldrh/strb/strh use EA_1BYTE/EA_2BYTE. That was changed in #20126 so it now requires EA_4BYTE.

The ARM64 never required this and it only cares about the size being EA_8BYTE or something else:

coreclr/src/jit/emitarm64.cpp

Lines 7857 to 7879 in 4795f24

/*static*/ emitter::code_t emitter::insEncodeDatasizeLS(emitter::code_t code, emitAttr size)
{
bool exclusive = ((code & 0x35000000) == 0);
bool atomic = ((code & 0x31200C00) == 0x30200000);
if ((code & 0x00800000) && !exclusive && !atomic) // Is this a sign-extending opcode? (i.e. ldrsw, ldrsh, ldrsb)
{
if ((code & 0x80000000) == 0) // Is it a ldrsh or ldrsb and not ldrsw ?
{
if (EA_SIZE(size) != EA_8BYTE) // Do we need to encode the 32-bit Rt size bit?
{
return 0x00400000; // set the bit at location 22
}
}
}
else if (code & 0x80000000) // Is this a ldr/str/ldur/stur opcode?
{
if (EA_SIZE(size) == EA_8BYTE) // Do we need to encode the 64-bit size bit?
{
return 0x40000000; // set the bit at location 30
}
}
return 0;

So the ifdefs are actually useless. They were copied from the copy block code and the copy block code basically preserved them from the pre-cleanup code, when we had a huge ifdef for the whole copy block loop:

#ifdef _TARGET_ARM64_

I'll remove these ifdefs in the next PR.

Copy link
Copy Markdown

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

@sandreenko sandreenko requested a review from a team October 29, 2019 22:28
@sandreenko sandreenko merged commit 41546b0 into dotnet:master Oct 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants