Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[RyuJIT] Fix bad VEX encoding to avoid false register dependency#14225

Merged
CarolEidt merged 1 commit into
dotnet:masterfrom
fiigii:falsedependency
Sep 28, 2017
Merged

[RyuJIT] Fix bad VEX encoding to avoid false register dependency#14225
CarolEidt merged 1 commit into
dotnet:masterfrom
fiigii:falsedependency

Conversation

@fiigii
Copy link
Copy Markdown

@fiigii fiigii commented Sep 27, 2017

This codegen issue was detected from SqrtDouble and SqrtSinge benchmarks.

Disassembly of a hot loop in SqrtDouble shows the second operand of vsqrtsd always set to xmm0 (the default value of VEX.vvvv in RyuJIT).

Block 2:
--
vaddsd xmm1, xmm1, qword ptr   [rip+0xd9]
vsqrtsd xmm2, xmm0, xmm1  ;;; xmm0 is not allocated to vsqrtsd
vaddsd xmm0, xmm0, xmm2
inc edi
cmp edi, 0x1388
jl 0x7fc0b96f3efe <Block 2>

This codegen introduces false register dependency on xmm0 that causes obviously higher CPI. Meanwhile, we recommend that keep the second operand same as the third one rather than same as the destination for this kind of instructions.

code execution time
vsqrtsd dst, xmm0, src 3.07s
vsqrtsd dst, dst, src 2.25s
vsqrtsd dst, src, src 0.84s

The codegen after this change

Block 2:
--
vaddsd xmm1, xmm1, qword ptr   [rip+0xd4]
vsqrtsd xmm2, xmm1, xmm1
vaddsd xmm0, xmm0, xmm2
inc edi
cmp edi, 0x1388
jl 0x7fe8c0da44ab <Block 2>

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

#14134 has the same concern on VEX-encoded roundss/d, @tannergooding you may be interested 😄

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying these. I think the naming should be made clearer.

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

@CarolEidt how about IsDstDstSrcAVXInstruction?

@tannergooding
Copy link
Copy Markdown
Member

@fiigii, I wonder if this might be the issue I was encountering, on machines with AVX, when using Vector<double> to vectorize code one of the BenchmarkGame benchmarks.

I was actually seeing slightly worse performance than the non-vectorized implementation and, based on a cursory glance, it looked to be poor codegen.

@CarolEidt
Copy link
Copy Markdown

how about IsDstDstSrcAVXInstruction?

I like it! Better that either alternative I proposed.

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

@tannergooding If you mixed vectorized code with the scalar sqrt calculations, it might be. I recommend you to run VTune on the BenchmarkGame benchmarks with "advanced hotspot" or "general exploration" analysis to locate the problematic basic blocks.

@tannergooding
Copy link
Copy Markdown
Member

@fiigii, thanks for the tip, I'll definitely try that out.

For reference:
In this case, its a block that looks like this:

for (int i = 0; i < n; i += 4)
{
	var tmp = vi * invN;

	Unsafe.Write(((byte*)pCrb) + i, tmp - onePtFive);
	Unsafe.Write(((byte*)pCib) + i, tmp - onePtZero);

	vi += add;
}

Which is generating:

	xor         edx,edx
	test        ebx,ebx
	jle         end
loop:
	vmulpd      ymm1,ymm0,ymm6
	vsubpd      ymm2,ymm1,ymm8
	mov         rcx,qword ptr [rsp+48h]
	movsxd      rax,edx
	vmovupd     ymmword ptr [rcx+rax],ymm2
	vsubpd      ymm1,ymm1,ymm7
	mov         rcx,qword ptr [rsp+40h]
	movsxd      rax,edx
	vmovupd     ymmword ptr [rcx+rax],ymm1
	vaddpd      ymm0,ymm0,ymm9
	add         edx,4
	cmp         edx,ebx
	jl          loop
end:
	xor         edx,edx
	mov         qword ptr [rsp+40h],rdx
	mov         qword ptr [rsp+48h],rdx

Where:

ymm0 = vi
ymm6 = invN
ymm7 = onePtZero
ymm8 = onePtFive
ymm9 = add

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

@tannergooding the codgen of this program looks fine, but there are two ymm stores in the loop body. Could you try to measure the cache line split in VTune? That requires you to collect customized PMU events and analysis.
image
Please see the section B.5.4.4 in Intel® 64 and IA-32 Architectures Optimization Reference Manual.

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

@tannergooding BTW, what was the CPU you used for the benchmark?

@tannergooding
Copy link
Copy Markdown
Member

I checked on an AMD Ryzen 1800X, an Intel i7-6600U, and a Intel i7-4790

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

Test VectorConvert failed in the CI system but passed on my local machine...

@CarolEidt
Copy link
Copy Markdown

Test VectorConvert failed in the CI system but passed on my local machine...

What machine are you testing on? Based on the output, it appears that the highest order 64-bits of a 256-bit Vector conversion from Vector<int64> to Vector<double> are not being correctly converted. It appears to be a real failure.

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 27, 2017

I am using Intel Core i7 6700K (Skylake). I will look into the codegen, thanks.

@CarolEidt
Copy link
Copy Markdown

I am using Intel Core i7 6700K (Skylake). I will look into the codegen, thanks.

That's odd; I believe it's the case that the codegen is the same for any AVX2-capable target, and clearly the CI system was an AVX2-capable target, since it's using 256-bit vectors. I'll be curious to see if you can find out why you are unable to duplicate the failure.

@tannergooding
Copy link
Copy Markdown
Member

@fiigii, they are using Azure VM's. @mmitche might know which CPU they are using. Otherwise, I know where to look It up from and can update in a bit.

@mmitche
Copy link
Copy Markdown
Member

mmitche commented Sep 27, 2017

@tannergooding Not sure. We're using D3_v2's, which might have a mix of types.

@fiigii
Copy link
Copy Markdown
Author

fiigii commented Sep 28, 2017

@CarolEidt @tannergooding I checked the manual again that cvtsi2ss/d is different from sqrtss/d(my mistake 😞), so revoked the change on cvtsi2ss/d.
Passed all the tests. Could you please merge this PR?

Copy link
Copy Markdown

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants