[RyuJIT] Fix bad VEX encoding to avoid false register dependency#14225
Conversation
|
#14134 has the same concern on VEX-encoded |
CarolEidt
left a comment
There was a problem hiding this comment.
Thanks for identifying these. I think the naming should be made clearer.
|
@CarolEidt how about |
|
@fiigii, I wonder if this might be the issue I was encountering, on machines with AVX, when using I was actually seeing slightly worse performance than the non-vectorized implementation and, based on a cursory glance, it looked to be poor codegen. |
I like it! Better that either alternative I proposed. |
|
@tannergooding If you mixed vectorized code with the scalar |
ad69974 to
7cf6dee
Compare
|
@fiigii, thanks for the tip, I'll definitely try that out. For reference: for (int i = 0; i < n; i += 4)
{
var tmp = vi * invN;
Unsafe.Write(((byte*)pCrb) + i, tmp - onePtFive);
Unsafe.Write(((byte*)pCib) + i, tmp - onePtZero);
vi += add;
}Which is generating: xor edx,edx
test ebx,ebx
jle end
loop:
vmulpd ymm1,ymm0,ymm6
vsubpd ymm2,ymm1,ymm8
mov rcx,qword ptr [rsp+48h]
movsxd rax,edx
vmovupd ymmword ptr [rcx+rax],ymm2
vsubpd ymm1,ymm1,ymm7
mov rcx,qword ptr [rsp+40h]
movsxd rax,edx
vmovupd ymmword ptr [rcx+rax],ymm1
vaddpd ymm0,ymm0,ymm9
add edx,4
cmp edx,ebx
jl loop
end:
xor edx,edx
mov qword ptr [rsp+40h],rdx
mov qword ptr [rsp+48h],rdxWhere: ymm0 = vi
ymm6 = invN
ymm7 = onePtZero
ymm8 = onePtFive
ymm9 = add |
7cf6dee to
f3d3420
Compare
|
@tannergooding the codgen of this program looks fine, but there are two |
|
@tannergooding BTW, what was the CPU you used for the benchmark? |
|
I checked on an AMD Ryzen 1800X, an Intel i7-6600U, and a Intel i7-4790 |
|
Test |
What machine are you testing on? Based on the output, it appears that the highest order 64-bits of a 256-bit Vector conversion from |
|
I am using Intel Core i7 6700K (Skylake). I will look into the codegen, thanks. |
That's odd; I believe it's the case that the codegen is the same for any AVX2-capable target, and clearly the CI system was an AVX2-capable target, since it's using 256-bit vectors. I'll be curious to see if you can find out why you are unable to duplicate the failure. |
|
@tannergooding Not sure. We're using D3_v2's, which might have a mix of types. |
f3d3420 to
0bb4642
Compare
0bb4642 to
57c4021
Compare
|
@CarolEidt @tannergooding I checked the manual again that |

This codegen issue was detected from SqrtDouble and SqrtSinge benchmarks.
Disassembly of a hot loop in SqrtDouble shows the second operand of
vsqrtsdalways set toxmm0(the default value of VEX.vvvv in RyuJIT).This codegen introduces false register dependency on
xmm0that causes obviously higher CPI. Meanwhile, we recommend that keep the second operand same as the third one rather than same as the destination for this kind of instructions.vsqrtsd dst, xmm0, srcvsqrtsd dst, dst, srcvsqrtsd dst, src, srcThe codegen after this change