Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Fold away Sse.StaticCast and Avx.StaticCast in the importer#18519

Merged
tannergooding merged 3 commits into
dotnet:masterfrom
tannergooding:hwintrin
Jul 3, 2018
Merged

Fold away Sse.StaticCast and Avx.StaticCast in the importer#18519
tannergooding merged 3 commits into
dotnet:masterfrom
tannergooding:hwintrin

Conversation

@tannergooding
Copy link
Copy Markdown
Member

FYI. @CarolEidt, @eerhardt, @fiigii

This improves the codegen for code using StaticCast as the rest of the JIT no longer has to care that it exists at all.

@tannergooding
Copy link
Copy Markdown
Member Author

Beginning PMI Diffs for assemblies in E:\Users\tagoo\Repos\coreclr\bin\tests\Windows_NT.x64.Checked
Completed PMI Diffs for assemblies in E:\Users\tagoo\Repos\coreclr\bin\tests\Windows_NT.x64.Checked in 896.77s
Analyzing diffs...
Found files with textual diffs.
Summary:
(Note: Lower is better)
Total bytes of diff: -411 (0.00% of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
        -165 : JIT\HardwareIntrinsics\X86\Avx\StaticCast_ro\StaticCast_ro.dasm (-9.49% of base)
        -124 : JIT\HardwareIntrinsics\X86\Sse\StaticCast_ro\StaticCast_ro.dasm (-10.23% of base)
        -100 : JIT\HardwareIntrinsics\X86\Sse2\Sse2_ro\Sse2_ro.dasm (-0.01% of base)
         -12 : JIT\HardwareIntrinsics\X86\Avx\StaticCast_r\StaticCast_r.dasm (-0.61% of base)
         -10 : JIT\HardwareIntrinsics\X86\General\VectorRet_ro\VectorRet_ro.dasm (-0.10% of base)
5 total files with size differences (5 improved, 0 regressed), 2684 unchanged.
Top method improvements by size (bytes):
        -165 : JIT\HardwareIntrinsics\X86\Avx\StaticCast_ro\StaticCast_ro.dasm - IntelHardwareIntrinsicTest.Program:Main(ref):int
        -124 : JIT\HardwareIntrinsics\X86\Sse\StaticCast_ro\StaticCast_ro.dasm - IntelHardwareIntrinsicTest.Program:Main(ref):int
         -12 : JIT\HardwareIntrinsics\X86\Avx\StaticCast_r\StaticCast_r.dasm - IntelHardwareIntrinsicTest.Program:Main(ref):int
          -5 : JIT\HardwareIntrinsics\X86\General\VectorRet_ro\VectorRet_ro.dasm - IntelHardwareIntrinsicTest:F1_v128i(int):struct
          -5 : JIT\HardwareIntrinsics\X86\General\VectorRet_ro\VectorRet_ro.dasm - IntelHardwareIntrinsicTest:F2_v128i(short):struct
25 total methods with size differences (25 improved, 0 regressed), 185912 unchanged.
Completed analysis in 272.79s

@tannergooding
Copy link
Copy Markdown
Member Author

For AVX, this gets rid of some stack shuffling on method entry, and in some cases removes the need for vzeroupper:

-vmovaps  xmm8, qword ptr [rsp+100H]
-vmovaps  xmm9, qword ptr [rsp+F0H]
-vzeroupper 

However, in some cases, it results in stack spillage where it was previously just using an extra register:

lea      rcx, bword ptr [rbp-*8H]
call     System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
vmovupd  ymm6, ymmword ptr[rax]
-vmovdqa  ymm7, ymm6
+vmovupd  ymmword ptr[rbp-B0H], ymm6
lea      rcx, bword ptr [rbp-*0H]
-vextractf128 ymm8, ymm6, 1
-vextractf128 ymm9, ymm7, 1
+vextractf128 ymm7, ymm6, 1
call     System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
-vinsertf128 ymm7, ymm9, 1
-vinsertf128 ymm6, ymm8, 1
-vmovupd  ymmword ptr[rax], ymm7
+vinsertf128 ymm6, ymm7, 1
+vmovupd  ymm0, ymmword ptr[rbp-B0H]
+vmovupd  ymmword ptr[rax], ymm0

@mikedn
Copy link
Copy Markdown

mikedn commented Jun 17, 2018

AFAIR last time when I tried this an assert fired. Some bit of code does not like a struct handle, most likely because vector with different base type are really different types. Not sure if anything has changed since then.

@tannergooding
Copy link
Copy Markdown
Member Author

AFAIR last time when I tried this an assert fired.

I think we've cleaned up some of the code since then. There weren't any asserts firing locally

However, there are asserts firing if I try to do the same with ConvertToSingle and the like, since the difference is TYP_SIMD16 -> TYP_FLOAT or for GetLowerHalf where it is TYP_SIMD32 -> TYP_SIMD16

@mikedn
Copy link
Copy Markdown

mikedn commented Jun 17, 2018

I think we've cleaned up some of the code since then. There weren't any asserts firing locally

Cool. Make sure to try it with the examples from #18069, just in case. I think it's the one I used to experiment with.

@tannergooding
Copy link
Copy Markdown
Member Author

Seems we also get worse codegen occasionally for Sse.StaticCast:

lea      rcx, bword ptr [rbp-**H]
call     System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
-vmovupd  xmm6, xmmword ptr [rax]
-vmovaps  xmm7, xmm6
+mov      rcx, qword ptr [rax]
+mov      qword ptr [rbp-60H], rcx
+mov      rcx, qword ptr [rax+8]
+mov      qword ptr [rbp-58H], rcx
+vmovapd  xmm0, xmmword ptr [rbp-60H]
+vmovapd  xmmword ptr [rbp-70H], xmm0
lea      rcx, bword ptr [rbp-**H]
call     System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
-vmovapd  xmmword ptr [rbp-B0H], xmm7
-vmovupd  xmmword ptr [rax], xmm7
+mov      rcx, qword ptr [rbp-70H]
+mov      rdx, qword ptr [rbp-68H]
+mov      qword ptr [rax], rcx
+mov      qword ptr [rax+8], rdx
mov      rcx, 0xD1FFAB1E

This looks like it is part of 1st class structs, however and is due to Vector128<T> having two private ulong fields.

@tannergooding
Copy link
Copy Markdown
Member Author

The non-VEX diff is basically the same as #18519 (comment), just replace vmovapd with movapd (etc).

Will get diffs for #18069 shortly...

@AndyAyersMS, @CarolEidt. Is there some smaller change that could fix the stack shuffling that happens for Vector128<T> (aside from the larger 1st class struct work)?

{
// We fold away the static cast here, as it only exists to satisfy
// the type system. It is safe to do this here since the retNode type
// and the signature return type are both TYP_SIMD16.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an assert be added for this comment as well? (assert the return type)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Will update.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
// the type system. It is safe to do this here since the retNode type
// and the signature return type are both TYP_SIMD16.
assert(sig->numArgs == 1);
retNode = impPopStack().val;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change handle the nested StaticCast?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any nested StaticCast should have already been handled by this point.

Copy link
Copy Markdown

@fiigii fiigii Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But impPopStack().val seems not working with SIMD types. Shall we use impSIMDPopStack(TYP_SIMDxx)?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason that you decided not to use impSIMDPopStack?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an oversight. Fixing now, will push an update shortly (after build completes locally).

Also, to address @fiigii's comment:

But impPopStack().val seems not working with SIMD types.

impPopStack() works just fine with SIMD types, and is what impSIMDPopStack uses under the hood. The benefit of using impSIMDPopStack is that it performs additional validation on the type and, in certain cases, normalization of the type for existing nodes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impPopStack() works just fine with SIMD types

In ordinary situation (i.e., on local stack), it indeed works fine. But, IIRC, impSIMDPopStack is necessary for some complex environment, such as passing/returning or referred SIMD variables.

@fiigii
Copy link
Copy Markdown

fiigii commented Jun 18, 2018

This looks like it is part of 1st class structs, however and is due to Vector128 having two private ulong fields.

@tannergooding Could you take a look at the JITDump? I remember that Vector128/256<T> would still be promoted sometimes.

@CarolEidt
Copy link
Copy Markdown

Is there some smaller change that could fix the stack shuffling that happens for Vector128 (aside from the larger 1st class struct work)?

I'd have to look at the dump first to better understand what's going on. If you could post a dump, that would save me having to download & build your changes.

@tannergooding
Copy link
Copy Markdown
Member Author

I'll get dumps put up sometime this evening.

@tannergooding
Copy link
Copy Markdown
Member Author

Full JitDumps for both AVX and SSE are here: StaticCast_ro.zip

@tannergooding
Copy link
Copy Markdown
Member Author

From what I can tell, the struct is being promoted at times.

For example:

*************** In fgPromoteStructs()

lvaTable before fgPromoteStructs
; Initial local variable assignments
;
;  V00 RetBuf          byref 
;  V01 arg0           simd16 
;  V02 OutArgs        lclBlk (na) 
;  V03 tmp1           struct (16) 

Promoting struct local V01 (System.Runtime.Intrinsics.Vector128`1[Single]):
lvaGrabTemp returning 4 (V04 tmp2) (a long lifetime temp) called for field V01._00 (fldOffset=0x0).
Set preferred register for V04 to [rdx]
New refCnts for V04: refCnt =  1, refCntWtd = 1   

lvaGrabTemp returning 5 (V05 tmp3) (a long lifetime temp) called for field V01._01 (fldOffset=0x8).
Set preferred register for V05 to [rdx]
New refCnts for V05: refCnt =  1, refCntWtd = 1   

Promoting struct local V03 (System.Runtime.Intrinsics.Vector128`1[Num]):
lvaGrabTemp returning 6 (V06 tmp4) (a long lifetime temp) called for field V03._00 (fldOffset=0x0).

lvaGrabTemp returning 7 (V07 tmp5) (a long lifetime temp) called for field V03._01 (fldOffset=0x8).

lvaTable after fgPromoteStructs
; Initial local variable assignments
;
;  V00 RetBuf          byref 
;  V01 arg0           simd16 
;  V02 OutArgs        lclBlk (na) 
;  V03 tmp1           struct (16) 
;  V04 tmp2             long  V01._00(offs=0x00) P-INDEP
;  V05 tmp3             long  V01._01(offs=0x08) P-INDEP
;  V06 tmp4             long  V03._00(offs=0x00) P-INDEP
;  V07 tmp5             long  V03._01(offs=0x08) P-INDEP

@CarolEidt
Copy link
Copy Markdown

@tannergooding - sorry to have taken so long to get around to looking at this.

I don't think this is really related to first class structs. This is, I believe, due to problems with the way we decide when to promote SIMD types. In general, if a struct can be promoted, it will be. In the SIMD case, however, we don't necessarily want to always do that. Looking at the code, it seems that the only case that we disable promotion is when lvUsedInSIMDIntrinsic is true, and what's more, we only set that if we see some sequential references to SIMD fields, see impMarkContiguousSIMDFieldAssignments() in simd.cpp. It transforms some field references into the get/set intrinsics, That, and the related code, needs some serious overhaul, I think.

On thing would be to try modifying lvaShouldPromoteStructVar() so that it only promotes a SIMD lclVar if it has direct field references, and lvUsedInSIMDIntrinsic is false. That would, I believe, address a number of cases where we shouldn't be promoting.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, thanks!

Would it also make sense, in the case of HardwareIntrinsics, to never promote, since there are no (public) fields to access?

@CarolEidt
Copy link
Copy Markdown

Would it also make sense, in the case of HardwareIntrinsics, to never promote, since there are no (public) fields to access?

Yes, that would be even better, I would say.

@tannergooding
Copy link
Copy Markdown
Member Author

The simplest "fix" for the struct promotion issue is to continue calling SetOpLclRelatedToSIMDIntrinsic on the retNode (it was previously called for it when creating the GenTree node).

We now get

vmovupd  xmm6, xmmword ptr [rax]
vmovapd  xmmword ptr [rbp-80H], xmm6
lea      rcx, bword ptr [rbp-50H]

instead of

mov      rcx, qword ptr [rax]
mov      qword ptr [rbp-60H], rcx
mov      rcx, qword ptr [rax+8]
mov      qword ptr [rbp-58H], rcx
vmovapd  xmm0, xmmword ptr [rbp-60H]
vmovapd  xmmword ptr [rbp-70H], xmm0
lea      rcx, bword ptr [rbp-38H]

@tannergooding
Copy link
Copy Markdown
Member Author

Going to see if I can get a bigger/better fix as part of a separate PR.

@tannergooding
Copy link
Copy Markdown
Member Author

Rebased onto master.

@tannergooding
Copy link
Copy Markdown
Member Author

@CarolEidt, could you review when you get the chance?

#18519 (comment) shows the codegen after ensuring the operands are not promoted.

https://github.com/dotnet/coreclr/issues/18069 shows the better codgen for the issue this is fixing.

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, but have one question.

Comment thread src/jit/hwintrinsicxarch.cpp Outdated
// the type system. It is safe to do this here since the retNode type
// and the signature return type are both TYP_SIMD16.
assert(sig->numArgs == 1);
retNode = impPopStack().val;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason that you decided not to use impSIMDPopStack?

@CarolEidt
Copy link
Copy Markdown

... and sorry for the delay in reviewing!

@tannergooding
Copy link
Copy Markdown
Member Author

Fixed to use impSIMDPopStack

Copy link
Copy Markdown

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@tannergooding
Copy link
Copy Markdown
Member Author

JitDump is here: JitDump.zip

Using impSIMDPopStack here breaks some things when using generics:

Assert failure(PID 5768 [0x00001688], Thread: 22180 [0x56a4]): Assertion failed 'src->gtObj.gtClass == structHnd' in 'IntelHardwareIntrinsicTest:Vector128Add(struct,struct):struct' (IL size 595)

    File: e:\users\tagoo\repos\coreclr\src\jit\importer.cpp Line: 1255
    Image: E:\Users\tagoo\Repos\coreclr\bin\tests\Windows_NT.x64.Debug\Tests\Core_Root\corerun.exe

The assertion is here: https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L1255
The method causing the failure is here: https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/HardwareIntrinsics/X86/General/VectorHelpers.cs#L14

I'll try to take a closer look tomorrow.

@tannergooding
Copy link
Copy Markdown
Member Author

NOTE: The break only happens for the _r tests (optimizations disabled), so perhaps it has something to do with the unused codepaths.

@tannergooding
Copy link
Copy Markdown
Member Author

Found the issue.

impSIMDPopStack normalizes the tree for GT_RET_EXPR, GT_CALL, and isParam == true: https://github.com/dotnet/coreclr/blob/master/src/jit/simd.cpp#L1050

In this case, impNormStructVal makes the type canonical by ensuring it is wrapped in a GT_OBJ: https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L1583. This assigns it the backing struct handle for the original node, and not the struct handle of the StaticCast return type.

Updating the GT_OBJ returned by impSIMDPopStack to have the correct return type resolves the issue...

  • I'm wondering if passing the "target" struct handle down to impSIMDPopStack might be a better fix, since that would also gove the gtIndex and gtRetExpr case...

@CarolEidt, thoughts?

@tannergooding
Copy link
Copy Markdown
Member Author

I opted for modifying impSIMDPopStack to optionally take a CORINFO_CLASS_HANDLE that takes the place of the stack struct type (for the normalization case). This ensures that all normalization paths (GT_OBJ, GT_CALL, GT_RET_EXPR, etc) set the structHnd to the type that would have been returned by the StaticCast that was folded away.

@tannergooding
Copy link
Copy Markdown
Member Author

No diffs between the impPopStack().val implementation (7b82ba1) and the new impSIMDPopStack implementation (a935e3a).

@tannergooding
Copy link
Copy Markdown
Member Author

Test failures for COMPlus_FeatureSIMD=0 and COMPlus_EnableAVX=0 are unrelated. Fix is up here: #18734

@tannergooding
Copy link
Copy Markdown
Member Author

@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x64 Checked jitnox86hwintrinsic

@dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Windows_NT x86 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Windows_NT x86 Checked jitnox86hwintrinsic

@dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
@dotnet-bot test Ubuntu x64 Checked jitx86hwintrinsicnosimd
@dotnet-bot test Ubuntu x64 Checked jitnox86hwintrinsic

@tannergooding
Copy link
Copy Markdown
Member Author

Rebased onto dotnet/master to pick-up the test fixes

@CarolEidt, when you get the chance, it would be nice if you could review the additional commit made since you signed-off (which passes the class handle down through impSIMDPopStack)

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment suggestion.

Comment thread src/jit/simd.cpp Outdated
// Arguments:
// type - the type of value that the caller expects to be popped off the stack.
// expectAddr - if true indicates we are expecting type stack entry to be a TYP_BYREF.
// structType - the class handle to use when normalizing if it is not the same as the stack entry class handle
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like for this comment to explain the scenario where this matters - even though I originally understood the context, once I came back to this for review, I couldn't imagine why it wouldn't be the same as the stack entry class handle (even though it's obvious in the context of folding away a static cast!). Anyway, it wouldn't hurt to add something like "(this can happen, e.g., when folding away a static cast, where we want the value popped to have the type that would have been returned)"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tannergooding
Copy link
Copy Markdown
Member Author

tannergooding commented Jul 2, 2018

Thanks @CarolEidt!

I am not going to rerun the full suite (of HWIntrinsic tests) since the delta was just clarifying the parameter documentation comment (and all tests were passing).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants