[RyuJIT] Fix bad VEX encoding to avoid false register dependency by fiigii · Pull Request #14225 · dotnet/coreclr

fiigii · 2017-09-27T17:55:44Z

This codegen issue was detected from SqrtDouble and SqrtSinge benchmarks.

Disassembly of a hot loop in SqrtDouble shows the second operand of vsqrtsd always set to xmm0 (the default value of VEX.vvvv in RyuJIT).

Block 2:
--
vaddsd xmm1, xmm1, qword ptr   [rip+0xd9]
vsqrtsd xmm2, xmm0, xmm1  ;;; xmm0 is not allocated to vsqrtsd
vaddsd xmm0, xmm0, xmm2
inc edi
cmp edi, 0x1388
jl 0x7fc0b96f3efe <Block 2>

This codegen introduces false register dependency on xmm0 that causes obviously higher CPI. Meanwhile, we recommend that keep the second operand same as the third one rather than same as the destination for this kind of instructions.

code	execution time
`vsqrtsd dst, xmm0, src`	3.07s
`vsqrtsd dst, dst, src`	2.25s
`vsqrtsd dst, src, src`	0.84s

The codegen after this change

Block 2:
--
vaddsd xmm1, xmm1, qword ptr   [rip+0xd4]
vsqrtsd xmm2, xmm1, xmm1
vaddsd xmm0, xmm0, xmm2
inc edi
cmp edi, 0x1388
jl 0x7fe8c0da44ab <Block 2>

fiigii · 2017-09-27T17:56:17Z

cc @russellhadley @CarolEidt @tannergooding @jkotas

fiigii · 2017-09-27T17:58:21Z

#14134 has the same concern on VEX-encoded roundss/d, @tannergooding you may be interested 😄

CarolEidt

Thanks for identifying these. I think the naming should be made clearer.

fiigii · 2017-09-27T18:25:19Z

@CarolEidt how about IsDstDstSrcAVXInstruction?

tannergooding · 2017-09-27T18:41:24Z

@fiigii, I wonder if this might be the issue I was encountering, on machines with AVX, when using Vector<double> to vectorize code one of the BenchmarkGame benchmarks.

I was actually seeing slightly worse performance than the non-vectorized implementation and, based on a cursory glance, it looked to be poor codegen.

CarolEidt · 2017-09-27T18:47:52Z

how about IsDstDstSrcAVXInstruction?

I like it! Better that either alternative I proposed.

fiigii · 2017-09-27T18:54:19Z

@tannergooding If you mixed vectorized code with the scalar sqrt calculations, it might be. I recommend you to run VTune on the BenchmarkGame benchmarks with "advanced hotspot" or "general exploration" analysis to locate the problematic basic blocks.

tannergooding · 2017-09-27T19:08:26Z

@fiigii, thanks for the tip, I'll definitely try that out.

For reference:
In this case, its a block that looks like this:

for (int i = 0; i < n; i += 4)
{
	var tmp = vi * invN;

	Unsafe.Write(((byte*)pCrb) + i, tmp - onePtFive);
	Unsafe.Write(((byte*)pCib) + i, tmp - onePtZero);

	vi += add;
}

Which is generating:

	xor         edx,edx
	test        ebx,ebx
	jle         end
loop:
	vmulpd      ymm1,ymm0,ymm6
	vsubpd      ymm2,ymm1,ymm8
	mov         rcx,qword ptr [rsp+48h]
	movsxd      rax,edx
	vmovupd     ymmword ptr [rcx+rax],ymm2
	vsubpd      ymm1,ymm1,ymm7
	mov         rcx,qword ptr [rsp+40h]
	movsxd      rax,edx
	vmovupd     ymmword ptr [rcx+rax],ymm1
	vaddpd      ymm0,ymm0,ymm9
	add         edx,4
	cmp         edx,ebx
	jl          loop
end:
	xor         edx,edx
	mov         qword ptr [rsp+40h],rdx
	mov         qword ptr [rsp+48h],rdx

Where:

ymm0 = vi
ymm6 = invN
ymm7 = onePtZero
ymm8 = onePtFive
ymm9 = add

fiigii · 2017-09-27T19:27:03Z

@tannergooding the codgen of this program looks fine, but there are two ymm stores in the loop body. Could you try to measure the cache line split in VTune? That requires you to collect customized PMU events and analysis.

Please see the section B.5.4.4 in Intel® 64 and IA-32 Architectures Optimization Reference Manual.

fiigii · 2017-09-27T19:45:34Z

@tannergooding BTW, what was the CPU you used for the benchmark?

tannergooding · 2017-09-27T19:49:56Z

I checked on an AMD Ryzen 1800X, an Intel i7-6600U, and a Intel i7-4790

fiigii · 2017-09-27T19:57:31Z

Test VectorConvert failed in the CI system but passed on my local machine...

CarolEidt · 2017-09-27T20:18:31Z

Test VectorConvert failed in the CI system but passed on my local machine...

What machine are you testing on? Based on the output, it appears that the highest order 64-bits of a 256-bit Vector conversion from Vector<int64> to Vector<double> are not being correctly converted. It appears to be a real failure.

fiigii · 2017-09-27T21:25:04Z

I am using Intel Core i7 6700K (Skylake). I will look into the codegen, thanks.

CarolEidt · 2017-09-27T22:26:26Z

I am using Intel Core i7 6700K (Skylake). I will look into the codegen, thanks.

That's odd; I believe it's the case that the codegen is the same for any AVX2-capable target, and clearly the CI system was an AVX2-capable target, since it's using 256-bit vectors. I'll be curious to see if you can find out why you are unable to duplicate the failure.

tannergooding · 2017-09-27T23:22:12Z

@fiigii, they are using Azure VM's. @mmitche might know which CPU they are using. Otherwise, I know where to look It up from and can update in a bit.

mmitche · 2017-09-27T23:34:39Z

@tannergooding Not sure. We're using D3_v2's, which might have a mix of types.

fiigii · 2017-09-28T07:28:31Z

@CarolEidt @tannergooding I checked the manual again that cvtsi2ss/d is different from sqrtss/d(my mistake 😞), so revoked the change on cvtsi2ss/d.
Passed all the tests. Could you please merge this PR?

CarolEidt

LGTM Thanks!

dnfclas added the cla-already-signed label Sep 27, 2017

CarolEidt suggested changes Sep 27, 2017

View reviewed changes

fiigii force-pushed the falsedependency branch from ad69974 to 7cf6dee Compare September 27, 2017 19:02

fiigii force-pushed the falsedependency branch from 7cf6dee to f3d3420 Compare September 27, 2017 19:12

fiigii force-pushed the falsedependency branch from f3d3420 to 0bb4642 Compare September 28, 2017 03:57

fix bad VEX.vvvv to avoid false register dependency

57c4021

fiigii force-pushed the falsedependency branch from 0bb4642 to 57c4021 Compare September 28, 2017 04:50

CarolEidt approved these changes Sep 28, 2017

View reviewed changes

CarolEidt merged commit f8b5f53 into dotnet:master Sep 28, 2017

fiigii mentioned this pull request Sep 30, 2017

[RyuJIT] Fix VEX.vvvv for cvtss2sd and cvtsd2ss #14274

Merged

fiigii deleted the falsedependency branch October 25, 2017 18:01

fiigii mentioned this pull request Oct 29, 2017

Enable CORINFO_INTRINSIC Round, Ceiling, and Floor to generate ROUNDSS and ROUNDSD #14736

Merged

Conversation

fiigii commented Sep 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

tannergooding commented Sep 27, 2017

Uh oh!

CarolEidt commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

tannergooding commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

tannergooding commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

CarolEidt commented Sep 27, 2017

Uh oh!

fiigii commented Sep 27, 2017

Uh oh!

CarolEidt commented Sep 27, 2017

Uh oh!

tannergooding commented Sep 27, 2017

Uh oh!

mmitche commented Sep 27, 2017

Uh oh!

fiigii commented Sep 28, 2017

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fiigii commented Sep 27, 2017 •

edited

Loading