There are quite a few front-end optimizations that RyuJIT can do on SIMD vector types
-
Constant vector propagation.
E.g. Back-end (lowerxarch.cpp) looks to see if (in)Equality operation is against a zero vector. It checks to see if op2 is a zero vector. If front-end phases propagate constant vectors and make sure they appear as op2 in a comparison, will increase the chance of back-end making the optimization.
-
CSE'ing of operations on SIMD types
Right now the SetEvalOder() costs need to be updated for SIMD types. Note that depending on the target, SIMD vector size could be 16 bytes (on SSE2 machines) or 32-bytes (on AVX2 machines). Therefore, these costs cannot be static constants, should be a function of vector size.
When two successive indexed accesses of a SIMD vector take place why not RyuJIT optimize vextractf/shift operations generated?
offset = 2;
if (_vectorUlongSpan < 4 || (u = vector64[2]) == 0)
{
offset = 3;
if (_vectorUlongSpan < 4 || (u = vector64[3]) == 0)
{
And the assembly it generated:
vextractf128 xmm1,ymm0,1
vmovd rdi,xmm1
test rdi,rdi
jne 00007FFF07225841
mov esi,3
vextractf128 xmm1,ymm0,1
vpsrldq ymm1,ymm1,8
vmovd rdi,xmm1
One possible route is that SIMD Vector index operations are expanded into Extract and Shift operations early in front-end phase. CSE phase could evaluate Extract operation into a temp and replace all further occurrences that could be eliminated with the temp.
3) Loop unrolling
To iterate over individual vector elements, one uses
for (int i=0; i < Vector<int>.Count; ++i)
{
....
= V[i]
}
- Elimination of GT_SIMD_CHK when the index is a non-const, in loops like above.
Vector indexed access using a non-constant would result in writing vector to memory and accessing the required element from memory. When the loop is small enough, it might be beneficial to unroll loop so that SIMD vectors are indexed using constant indices. One such opportunity is in Kestrel server FindFistEqualByte() method.
- Loop hoisting of constant vectors
e.g.
// We can hoist the following constant vectors
// in the below loop: Vector<T>.One, Vector<T>.Zero
// new Vector<T>(T val)
for(...)
{
.. = Vector<int>.One + b;
pi = new Vector<float>(3.1412);
....
if (x == Vector<long>.Zero)
}
- Promotion of structs containing SIMD type fields.
This is tracked by issue dotnet/coreclr#7508
category:cq
theme:vector-codegen
skill-level:intermediate
cost:large
impact:medium
There are quite a few front-end optimizations that RyuJIT can do on SIMD vector types
Constant vector propagation.
E.g. Back-end (lowerxarch.cpp) looks to see if (in)Equality operation is against a zero vector. It checks to see if op2 is a zero vector. If front-end phases propagate constant vectors and make sure they appear as op2 in a comparison, will increase the chance of back-end making the optimization.
CSE'ing of operations on SIMD types
Right now the SetEvalOder() costs need to be updated for SIMD types. Note that depending on the target, SIMD vector size could be 16 bytes (on SSE2 machines) or 32-bytes (on AVX2 machines). Therefore, these costs cannot be static constants, should be a function of vector size.
When two successive indexed accesses of a SIMD vector take place why not RyuJIT optimize vextractf/shift operations generated?
Vector indexed access using a non-constant would result in writing vector to memory and accessing the required element from memory. When the loop is small enough, it might be beneficial to unroll loop so that SIMD vectors are indexed using constant indices. One such opportunity is in Kestrel server FindFistEqualByte() method.
e.g.
This is tracked by issue dotnet/coreclr#7508
category:cq
theme:vector-codegen
skill-level:intermediate
cost:large
impact:medium