Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383
Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383echesakov merged 2 commits intodotnet:masterfrom
Conversation
Vector128<T>.AllBits: ``` vpcmpeqd xmmReg, xmmReg, xmmReg ``` Vector256<T>.AllBits: ``` vpcmpeqd ymmReg, ymmReg, ymmReg ``` If Avx.IsSupported Vector128<T>.AllBits: ``` vpcmpeqd xmmReg, xmmReg, xmmReg ``` Vector256<T>.AllBits: ``` vxorps xmmReg, xmmReg, xmmReg vcmptruesd ymmReg, ymmReg, ymmReg ``` Otherwise Vector128<T>.AllBits: ``` pcmpeqd xmm0, xmm0 ``` Vector256<T>.AllBits: Software fallback
| { | ||
| assert(op1 == nullptr); | ||
| if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX)) | ||
| if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX)) |
There was a problem hiding this comment.
This if shouldn't be needed. emitIns_SIMD_R_R_R should already be handling VEX vs non VEX
| { | ||
| assert(op1 == nullptr); | ||
| if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX)) | ||
| if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX)) |
There was a problem hiding this comment.
Likewise, while the behavior is the same, this is changing the instruction used for floating-point. ins was not pcmpeqd here but instead was cmpps or cmppd.
Both the Intel and AMD optimization manuals have notes similar to:
Previous microarchitectures (before Intel Core microarchitecture) do not have explicit restrictions on
mixing integer and floating-point (FP) operations on XMM registers. For Intel Core microarchitecture,
mixing integer and floating-point operations on the content of an XMM register can degrade performance. Software should avoid mixed-use of integer/FP operation on XMM registers. Specifically:
• Use SIMD integer operations to feed SIMD integer operations. Use PXOR for idiom.
• Use SIMD floating-point operations to feed SIMD floating-point operations. Use XORPS for idiom.
• When floating-point operations are bitwise equivalent, use PS data type instead of PD data type.
MOVAPS and MOVAPD do the same thing, but MOVAPS takes one less byte to encode the instruction.
and
Avoid mixing packed integer operations with packed floating point operations
Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.
There was a problem hiding this comment.
Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.
@tannergooding But do we expect the value of Vector128/256<double>.AllBitsSet to be used by a floating-point operation in some user scenarios? Do you have some examples in mind?
BTW, have you had a chance to read the SO posts I mentioned?
There was a problem hiding this comment.
But do we expect the value of Vector128/256.AllBitsSet
Yes. AllBitsSet is a fairly central value for masking, comparing, and blending because it is the value returned when a given element is true.
| else | ||
| { | ||
| assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX)); | ||
| emit->emitIns_SIMD_R_R_R(INS_xorps, EA_16BYTE, targetReg, targetReg, targetReg); |
There was a problem hiding this comment.
We shouldn't need to zero the target register when using 8 as the immediate.
8 is VCMPEQ_UQSS which is "compare equal, unordered, non-signalling" which effectively means that if left and right are the same or either is NaN, return true and do not throw for signalling NaN if floating-point exceptions are enabled".
Since left and right are both target, it means it will always return true (the same is true for 15).
You can confirm this with a program such as:
using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
class Program
{
static void Main(string[] args)
{
var nan = Vector128.Create(float.NaN);
var one = Vector128.Create(1.0f);
var inf = Vector128.Create(float.PositiveInfinity);
Test(nan, FloatComparisonMode.UnorderedEqualNonSignaling);
Test(nan, FloatComparisonMode.UnorderedTrueNonSignaling);
Test(one, FloatComparisonMode.UnorderedEqualNonSignaling);
Test(one, FloatComparisonMode.UnorderedTrueNonSignaling);
Test(inf, FloatComparisonMode.UnorderedEqualNonSignaling);
Test(inf, FloatComparisonMode.UnorderedTrueNonSignaling);
}
static void Test(Vector128<float> value, FloatComparisonMode mode)
{
Console.WriteLine($"{value,-41}, {mode,-28}: {Avx.Compare(value, value, mode)}");
}
}which prints:
<NaN, NaN, NaN, NaN> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN>
<NaN, NaN, NaN, NaN> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN>
<1, 1, 1, 1> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN>
<1, 1, 1, 1> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN>
<Infinity, Infinity, Infinity, Infinity> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN>
<Infinity, Infinity, Infinity, Infinity> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN>
There was a problem hiding this comment.
But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?
There was a problem hiding this comment.
The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.
Also see https://www.agner.org/optimize/optimizing_assembly.pdf page 65
There was a problem hiding this comment.
But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?
For the scenario being considered; they both look to operate identically. I don't recall why VCMPEQ_UQSS was chosen originally over VCMPTRUESS but it was and is what we are emitting today.
I don't have a strong preference on which of these two are emitted provided they are always identical for tgt, tgt, tgt.
The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.
We aren't doing this for any of the other comparisons or for intrinsics in general (outside LZCNT or POPCNT IIRC, because older Intel CPUs had a bug where it always treated it as having a dependency). Likewise, I don't think we should be unconditionally inserting a dependency breaking idiom here and instead this would be better handled by some general purpose check that identifies possible dependencies and therefore where inserting this would be beneficial.
Hardware intrinsics are used in performance oriented code paths and even if xorps is elided in the decoding step, it still takes up space in the decoder and is going to change the overall throughput of small-medium sized loops.
Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).
There was a problem hiding this comment.
Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).
Right, for AVX2+ Clang emits vpcmpeqd ymmReg, ymmReg, ymmReg as I poined out in the first post and implemented for NI_Vector256_get_AllBitsSet.
…int type. Don't break dependency chain with xorps
|
@tannergooding Updated PR based on your feedback - would you take another look? |
If Avx2.IsSupported
Vector128<T>.AllBitsVector256<T>.AllBitsIf Avx.IsSupported (but not Avx2)
Vector128<T>.AllBitsVector256<T>.AllBitsOtherwise
Vector128<T>.AllBitsVector256<T>.AllBitsSoftware fallback
Clang generates similar instruction sequence for 256-bits vectors - https://godbolt.org/z/erd5sf
References:
Fixes #48283
@dotnet/jit-contrib @tannergooding PTAL