Don't retype struct as primitive types in import. #33225

sandreenko · 2020-03-05T10:44:25Z

This change adds COMPlus_JitNoStructRetyping that prevents struct retyping for call struct and return struct cases. Currently, this retyping is happening in importer, we want to move it to lower.
The current retyping forbids later phases to do optimization with these values, for example, it affects code for inlined methods

struct nativeSizeStruct
{
  int a;
  int b;
}
nativeSizeStruct foo();

if we inline foo() we won't be able to promote or enregister fields of nativeSizeStruct because importer would access it as LCL_FIELD long and set doNotEnreg in impFixupCallStructReturn.

Why can't we do that right after inlining? Because other optimizations also need to know real types, for example, CSE and VN can't propagate a sequence like that:

ASG(nativeSizeStruct A, nativeSizeStruct B);
return LCL_FIELD long A;

but can do it if we don't do retyping.

Notes:

Note1: right after inlining is the right time to do retyping for methods that use return buffer, in that PR this is left in importer.

Note2: that change doesn't fix retyping in impNormStructVal, for cases like
methodWithStructArgs(foo(), foo()) we currently always create a local var for each struct local argument and retype it as a primitive type. That would be fixed in a separate PR.

Note3: that change doesn't fix retyping for limited struct promotion, when for a struct like:
struct PromotedStruct
{
StructWithOneField a; <- we will change the type of this field from struct to int, because we don't have recursive struct promotion.
int b;
}

struct StructWithOneField
{
int a;
}
but that change will get us closer to it.

So the phase in which we should retype struct types into ABI specific types is lowering. See details in https://github.com/dotnet/runtime/blob/master/docs/design/features/first-class-structs.md and #1231.

The PR is done to have no diffs if COMPlus_JitNoStructRetyping=0, it helps to catch unwanted side-effects and makes merging safer. I have checked that there are no diffs on all windows platforms using altjit and framework assemblies. The change with the flag enabled was tested on SPMI collections, crossgen of framework libraries, and in a pri1 test run.

The overall design is simple: keep structs as struct until lowering, then do retyping for returns and calls, but insert BITCAST back to struct types to keep IR correct. Then teach the next phases (lsra, codegen) to work with the new struct nodes.

New cases of struct nodes force us to have struct handle on all trees (except the right side of ASG), so gtGetStructHandleIfPresent becomes more important.

Some initial diffs:
for the benchmark that motivated this change:

| ScalarFloatSinglethreadADT | \Core_Root_base\CoreRun.exe | 4.657 s | 0.0317 s | 0.0296 s | 4.650 s | 4.618 s | 4.710 s |  1.00 |             Base |     - |     - |     - |     272 B |
| ScalarFloatSinglethreadADT | \Core_Root_diff\CoreRun.exe | 1.157 s | 0.0083 s | 0.0078 s | 1.157 s | 1.145 s | 1.173 s |  0.25 |           Faster |     - |     - |     - |     272 B |

some other benchmarks are also winning, but not that significant:

| RegexRedux_1 | \Core_Root_base\CoreRun.exe | 603.4 ms | 96.37 ms | 110.97 ms | 612.5 ms | 424.6 ms | 762.8 ms |  1.00 |             Base |    0.00 |     - |     - |     - |   2.83 MB |
| RegexRedux_1 | \Core_Root_diff\CoreRun.exe | 431.6 ms |  7.89 ms |   6.99 ms | 432.3 ms | 418.2 ms | 443.7 ms |  0.76 |           Faster |    0.17 |     - |     - |     - |   2.83 MB |

code size changes for my small StructABI\structreturn.cs test, improvements happen when we inline the constructor method:

Overall, right now, it is a regression, I will start fixing them in the next change. Maybe I will push them to this PR or merge this PR with the flag disabled and fix the regressions in the next.

sandreenko · 2020-03-06T04:58:15Z

PTAL @CarolEidt @dotnet/jit-contrib, I think that is ready for the first round of review.

CarolEidt

Overall looks good.
It's awesome to remove the retyping from the front-end, but I think that in Lowering when we have something that's returned in a single register, we should go ahead and retype. I think it would simplify things and would not lose information that the backend would need.

src/coreclr/src/jit/compiler.hpp

src/coreclr/src/jit/jitconfigvalues.h

src/coreclr/src/jit/compiler.h

src/coreclr/src/jit/importer.cpp

CarolEidt · 2020-03-06T22:53:34Z

src/coreclr/src/jit/lower.cpp

I would like to see us, perhaps in future, use BITCAST only where we actually require that the bits get moved to a different register file. We should be able to handle those cases much like the way multireg returns are handled. In fact, it's not clear why we need separate handling for that.

src/coreclr/src/jit/lsraxarch.cpp

src/coreclr/src/jit/lower.cpp

src/coreclr/src/jit/lowerxarch.cpp

sandreenko · 2020-04-06T07:25:05Z

I think this is ready for the second round, maybe expect codegenxarch.cpp.

No diffs if JitAllowStructRetyping=1 (default), when the feature is enabled then it is a regression:

Crossgen CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: 79646 (0.24% of base)

but the biggest part of it due to ASG(LCL_VAR struct with 1 field, call) cases, that are part of #34105, without these regressions it is a good improvement.

I have checked SPMI/crossgen/pmi locally. I will kick appropriate CI testing this night.

sandreenko · 2020-04-06T07:32:11Z

src/coreclr/src/jit/gentree.cpp

I am not happy with that solution, but the cost of having fieldSeq with information about overlapping fields was too high, both in terms of TP (checking the new flag in each access to it), in terms of memory consumption (make it 24 bytes instead of 16) and in implementation cost(there are dozens of places where we check for NotAField, many of them are very old and it was unclear how they should work with the new type.
Also, we don't often have overlapping fields, so it felt like doing I was spending too much on such a rare case. The frameworks showed -12 bytes of improvements from my fix that was an awful regression in TP and memory.

sandreenko · 2020-04-06T07:33:41Z

src/coreclr/tests/src/JIT/Directed/StructABI/StructWithOverlappingFields.cs

The test uses a class and a struct to do the same logic, guess what case we optimize better. I should probably create an issue about that, but again, overlapping fields are rare.

sandreenko · 2020-04-06T07:55:36Z

I would like to see us, perhaps in future, use BITCAST only where we actually require that the bits get moved to a different register file. We should be able to handle those cases much like the way multireg returns are handled. In fact, it's not clear why we need separate handling for that.

Hm, I think we are thinking about two different possible representations, with these changes we have the following new nodes in lowering:
RETURN struct,
IND struct,
call struct;
LCL_VAR/LCL_FLD and similar;
they could be combined in patterns like:
RETURN struct (IND struct(ADDR ref (LCL_VAR struct))),
RETURN struct(call struct),
RETURN struct(SIMD SIMD8),
STORE_OBJ(byref address, call struct).

I want to keep this struct representation visible even after lower, so for STORE_OBJ(LCL_VAR struct, call struct) do STORE_OBJ(byref address, BITCAST<struct>(call long)), the advantage is that we can see where we have structs and where we don't, bitcast nodes could do moves when the use and the def need different registers (like call(call struct)), they produce nothing if they don't need.

Another approach would be to produce the old LIR after lowering, meaning replace all such STORE_OBJ with native types STORE_IND, retype LCL_VAR, CALL into native types and try to avoid having any changes in codegen or lsra. We will need bitcast nodes only for SIMD and LCL_FLD nodes then.
I remember having issues with STORE_IND byref, but that could be fixed. Another issue with that would be STORE_OBJ destination. It is an address, but it knows the type, to which we are storing and it could be hidden deep in the dst tree. Finding it and changing it to a native type in lower, would be expensive, leaving it as-is will create strange trees like STORE_IND int (ADDR ref(LCL_VAR struct)), call int) where dst type != src type.

sandreenko · 2020-04-21T07:25:24Z

ping @dotnet/jit-contrib

CarolEidt

I'm still just a bit apprehensive about using GT_BITCAST for structs, but I think the idea is growing on me.
I'd like to better understand why the elimination of retyping is tied to FEATURE_MULTIREG_RET but otherwise it loosk good.

CarolEidt · 2020-04-22T20:41:28Z

src/coreclr/src/jit/jitconfigvalues.h

This will disable this for x64/windows and not for any other targets - is that due to the regressions you described in your PR comments? I'm not sure why this would be tied to FEATURE_MULTIREG_RET.
Also, sorry for the focus on naming - but I think it might be good to give this a name that more clearly reflects that it's the "old" or "bad" way. "Allow" sounds too nice. I always described this as "lying about the types" but maybe that's just a bit too negative. Maybe "JitDoOldStructRetyping"? Or is that too verbose and/or negative?

tannergooding · 2020-04-30T15:04:31Z

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

CarolEidt · 2020-04-30T19:18:10Z

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

I'll let @sandreenko reply to the specific question of how this work impacts those types (I think it's largely agnostic to those types). In future, however, now that we have a ClassLayout on GenTreeJitIntrinsic we should be able to retain the correct handle (and not have to look them up all over the place).

AndyAyersMS · 2020-04-30T21:58:17Z

when the feature is enabled then it is a regression ... but the biggest part of it due to ASG(LCL_VAR struct with 1 field, call) cases, that are part of #34105, without these regressions it is a good improvement.

Can you say more about what the CS/CQ of this looks like once #34105 is fixed?

sandreenko · 2020-04-30T22:34:57Z

How will this impact GT_HWINTRINSIC and GT_SIMD which currently always lose the struct handle and retype to TYP_SIMD8, TYP_SIMD12, TYP_SIMD16, or TYP_SIMD32 + a base type?

I'll let @sandreenko reply to the specific question of how this work impacts those types (I think it's largely agnostic to those types). In future, however, now that we have a ClassLayout on GenTreeJitIntrinsic we should be able to retain the correct handle (and not have to look them up all over the place).

The change is agnostic to GT_HWINTRINSIC, it touches TYP_SIMD* a bit, they are the difference between varTypeIsStruct(type) and type == TYP_STRUCT checks. As Carol has said, it is a move towards using the same logic for all varTypeIsStruct(type) like you can see in Lowering::LowerRet(GenTreeUnOp* ret) and some other cases. In future changes we will delete more special handling for TYP_SIMD*.

sandreenko · 2020-05-01T20:39:01Z

Ok, the tests are green now, there were a few new failures due to tail call changes.

The PR is ready for review.

@AndyAyersMS of course, I will repeat analysis for the diffs and post results here.

CarolEidt

Mostly comments and some non-blocking suggestions or questions.
Looks good!

CarolEidt · 2020-05-01T21:18:17Z

src/coreclr/src/jit/jitconfigvalues.h

I like the new name - thanks :-)

src/coreclr/src/jit/gentree.cpp

CarolEidt · 2020-05-01T21:25:21Z

src/coreclr/src/jit/gentree.cpp

Thanks for adding this TODO

CarolEidt · 2020-05-01T21:39:00Z

src/coreclr/src/jit/lower.cpp

It is a bit unfortunate that we have to transform the call's user. I wonder whether it would be feasible and reasonable to handle these when lowering the user. I think this is fine for now.

You are right, we have agreed to do that in the user lowering but when I did so I saw that:

we have to process call in LowerCall when it is unused, so compilation time is spent to find a user in any case;

the processing goes into 2 functions (in addition to LowerCall) LowerBlockStore, LowerStoreIndir and they are platform-specific, so the logic to find these cases should be duplicated for different platforms;

it takes more code and operations to find these cases from the user, because each struct call has its user that should be modified, but not all struct stores have such calls as they src.

so after a while, I changed it to process it in one place, I think it is more visible.

src/coreclr/src/jit/lower.cpp

sandreenko · 2020-05-04T09:05:12Z

@AndyAyersMS
examples of some improvements (I will write a complete report when we enable this by default):

the first type, -580 (-40.93% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BinopEasyOut:TypeToIndex(TypeSymbol):Nullable`1

This is the first type of improvements.
This method returns a struct<8> { 0x0 bool, 0x4 int } and has 30 return instructions, so before the change we were retyping Return struct(LCL_VAR struct V01) to Return long(LCL_FLD long V01) for all of them. It was forbiding independent struct promotion of all involved LCL_VAR. Now we keep them as structs, create a merge return LCL_VAR, but we copy results from LCL_VARs that are promoted independently, so instead of that:

N003 (  3,  4) [000707] -A------R---              *  ASG       long  
N002 (  1,  1) [000706] D------N----              +--*  LCL_VAR   long   V35 tmp32        
N001 (  3,  4) [000078] ------------              \--*  LCL_FLD   long   V11 tmp8         [+0]
                                                  \--*    bool   V11.hasValue (offs=0x00) -> V52 tmp49        
                                                  \--*    int    V11.value (offs=0x04) -> V53 tmp50

where optimizations of both V35 and V11 are blocked, we have:

N007 ( 16, 12) [000763] -A----------              *  COMMA     void  
N003 (  9,  7) [000759] -A------R---              +--*  ASG       bool  
N002 (  4,  3) [000757] D------N----              |  +--*  LCL_VAR   bool   V94 tmp91        
N001 (  4,  3) [000758] -------N----              |  \--*  LCL_VAR   bool   V52 tmp49        
N006 (  7,  5) [000762] -A------R---              \--*  ASG       int   
N005 (  3,  2) [000760] D------N----                 +--*  LCL_VAR   int    V95 tmp92        
N004 (  3,  2) [000761] -------N----                 \--*  LCL_VAR   int    V53 tmp50

Then we apply CSE and propagation optimizations that benefit from bool lclVars and propagate that field assignment well. As a result, we have 58 Removing tree [TREE_ID] in [BB_ID] as useless (0 in base) and then allocate the rest to registers instead of memory.
ASM before:

IN0011: 000054 mov      byte  ptr [V52 rsp+C0H], 0
IN0012: 00005C xor      eax, eax
IN0013: 00005E mov      dword ptr [V53 rsp+C4H], eax
IN0014: 000065 mov      dword ptr [V53 rsp+C4H], 6
IN0015: 000070 mov      byte  ptr [V52 rsp+C0H], 1
IN0016: 000078 mov      rax, qword ptr [V11 rsp+C0H]

asm after:

IN0011: 000043 mov      eax, 6
IN0012: 000048 mov      ecx, 1
IN0013: 00004D mov      byte  ptr [V94 rsp+20H], cl                                           
IN0014: 000051 mov      dword ptr [V95 rsp+24H], eax

the second type, -203 (-20.86% of base) : Newtonsoft.Json.dasm - DefaultContractResolver:CreatePropertyFromConstructorParameter(JsonProperty,ParameterInfo):JsonProperty:this

This is the second type of improvements, that was the initial motivation for that work (from SIMD bench).

We are inlining a method that returns a small struct:

    [000149] -AC---------              *  ASG       struct (copy)                                                   
    [000147] D------N----              +--*  LCL_VAR   struct<System.Nullable`1[Boolean], 2> V06 loc3         
    [000146] --C---------              \--*  RET_EXPR  struct(inl return from call [000145])

and before we were transforming it to:

    [000150] -AC---------              *  ASG       short 
    [000149] ------------              +--*  IND       short 
    [000148] ------------              |  \--*  ADDR      byref 
    [000147] -------N----              |     \--*  LCL_VAR   struct<System.Nullable`1[Boolean], 2> V06 loc3         
    [000146] --C---------              \--*  RET_EXPR  int   (inl return from call [000145])

that was blocking V06 struct promotion and enregestering, now it does not happen.
so when we copy these values instead of two moves (from a memory loc to a reg and from the reg to another mem loc) we have one reg to reg move:

movzx    rax, byte  ptr [V54 rsp+90H]
mov      byte  ptr [V68 rsp+60H], al

after:

movzx    rcx, dl

the third type, -13 (-81.25% of base) : System.Private.CoreLib.dasm - ValueTuple:Create():ValueTuple

That is a funny one, for IL like:

Importing BB01 (PC=000) of 'System.ValueTuple:Create():System.ValueTuple'
    [ 0]   0 (0x000) ldloca.s 0
    [ 1]   2 (0x002) initobj 020001C9

we were generating

***** BB01
STMT00000 (IL 0x000...0x003)
               [000003] IA----------              *  ASG       struct (init)
               [000000] D------N----              +--*  LCL_VAR   struct<System.ValueTuple, 1> V00 loc0         
               [000002] ------------              \--*  CNS_INT   int    0

***** BB01
STMT00001 (IL 0x008...0x009)
               [000005] ------------              *  RETURN    int   
               [000004] ------------              \--*  LCL_FLD   byte   V00 loc0         [+0]

now it is

N002 (  2,  2) [000005] ------------              *  RETURN    struct
N001 (  1,  1) [000004] ------------              \--*  CNS_INT   int    0

(copy propagation for structs was unblocked in one of the preparation PRs).
so instead of

IN0004: 000000 push     rax
IN0001: 000001 xor      eax, eax
IN0002: 000003 mov      byte  ptr [V00 rsp], al
IN0003: 000006 movsx    rax, byte  ptr [V00 rsp]
IN0005: 00000B add      rsp, 8
IN0006: 00000F ret

we now have

IN0001: 000000 xor      eax, eax
IN0002: 000002 ret

when we create a ValueTuple (wonder if it is popular, like std::map<T,S> in C++).

After #34105 and #11413 are fixed we will probably see some other patterns as well instead of the current regressions.

There are 2 issues that prevent it from being enbaled by default. They are causing significant asm regressions.

AndyAyersMS · 2020-05-04T20:45:35Z

examples of some improvements

@sandreenko looks good. I would expect Nullable<int> to be one of the things that really benefits here. You might want to explore add this as an instantiation type in PMI to see more broadly what happens. You should also see the Range type (struct of two ints) benefit, see notes over in #11848.

sandreenko · 2020-05-05T00:48:15Z

@sandreenko looks good. I would expect Nullable to be one of the things that really benefits here. You might want to explore add this as an instantiation type in PMI to see more broadly what happens. You should also see the Range type (struct of two ints) benefit, see notes over in #11848.

yes, all returns and calls that return structs in a register should benefit.

I have added Type[] typesToTry = typeof(int?) and got additional 216 methods improved. The full logs are attached.
I have looked at a few regressed methods with int? and they are in the expected form of ASG(LCL_VAR struct with 1 field, call struct) where we now block independent promotion of the struct.
pmiDiffsWithNullInt.txt
crossgenDiffs.txt
pmiDiffs.txt

sandreenko added os-windows arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 5, 2020

sandreenko force-pushed the optRetStruct-3-4-forPR branch from 96ff2ec to 6cb6b6c Compare March 6, 2020 03:17

sandreenko force-pushed the optRetStruct-3-4-forPR branch from f7875df to 2660e14 Compare March 6, 2020 07:56

CarolEidt reviewed Mar 6, 2020

View reviewed changes

AndyAyersMS mentioned this pull request Mar 7, 2020

JIT: remove GTF_INX_REFARR_LAYOUT #33098

Merged

This was referenced Mar 26, 2020

Improve codegen for Unsafe<> same size casts. #34156

Closed

Make compCurBB available for fgMorphBlockReturn. #34184

Merged

sandreenko force-pushed the optRetStruct-3-4-forPR branch 2 times, most recently from 25f7779 to 2e00f95 Compare April 6, 2020 07:16

sandreenko commented Apr 6, 2020

View reviewed changes

sandreenko mentioned this pull request Apr 13, 2020

Fix an incorrect CSE case with struct retyping. #34676

Merged

CarolEidt reviewed Apr 22, 2020

View reviewed changes

sandreenko force-pushed the optRetStruct-3-4-forPR branch from 2e00f95 to c215a49 Compare April 30, 2020 02:17

sandreenko marked this pull request as ready for review April 30, 2020 02:17

sandreenko force-pushed the optRetStruct-3-4-forPR branch from c215a49 to e131051 Compare April 30, 2020 11:49

sandreenko force-pushed the optRetStruct-3-4-forPR branch from e131051 to 6cc7061 Compare May 1, 2020 10:04

CarolEidt approved these changes May 1, 2020

View reviewed changes

Add two tests.

2545d08

Sergey Andreenko added 3 commits May 3, 2020 19:26

Turn on for Windows x64 by default for tests.

abf34fb

Don't retype struct returns x64 windows.

fe93871

fix a few typos/leftovers.

342b12d

sandreenko force-pushed the optRetStruct-3-4-forPR branch from 6cc7061 to 342b12d Compare May 4, 2020 07:58

Disable for Win64 by default.

169ce1a

There are 2 issues that prevent it from being enbaled by default. They are causing significant asm regressions.

sandreenko force-pushed the optRetStruct-3-4-forPR branch from e06016f to 169ce1a Compare May 4, 2020 19:54

sandreenko merged commit 5da855d into dotnet:master May 5, 2020

briansull mentioned this pull request Jul 8, 2020

Assertion failed 'fieldCorType == CORINFO_TYPE_VALUECLASS' during 'Optimize Valnum CSEs' #38541

Closed

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Don't retype struct as primitive types in import. #33225

Don't retype struct as primitive types in import. #33225

Uh oh!

Conversation

sandreenko commented Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreenko commented Mar 6, 2020

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sandreenko commented Apr 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandreenko commented Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sandreenko commented Apr 21, 2020

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding commented Apr 30, 2020

Uh oh!

CarolEidt commented Apr 30, 2020

Uh oh!

AndyAyersMS commented Apr 30, 2020

Uh oh!

sandreenko commented Apr 30, 2020

Uh oh!

sandreenko commented May 1, 2020

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sandreenko commented May 4, 2020

Uh oh!

AndyAyersMS commented May 4, 2020

Uh oh!

sandreenko commented May 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sandreenko commented Mar 5, 2020 •

edited

Loading

sandreenko commented Apr 6, 2020 •

edited

Loading