Skip to content

Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383

Merged
echesakov merged 2 commits intodotnet:masterfrom
echesakov:Runtime_48283
Feb 19, 2021
Merged

Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383
echesakov merged 2 commits intodotnet:masterfrom
echesakov:Runtime_48283

Conversation

@echesakov
Copy link
Contributor

If Avx2.IsSupported

Vector128<T>.AllBits

 vpcmpeqd xmmReg, xmmReg, xmmReg

Vector256<T>.AllBits

 vpcmpeqd ymmReg, ymmReg, ymmReg

If Avx.IsSupported (but not Avx2)

Vector128<T>.AllBits

 vpcmpeqd xmmReg, xmmReg, xmmReg

Vector256<T>.AllBits

 vxorps xmmReg, xmmReg, xmmReg
 vcmptruesd ymmReg, ymmReg, ymmReg

Otherwise

Vector128<T>.AllBits

 pcmpeqd xmm0, xmm0

Vector256<T>.AllBits

Software fallback

Clang generates similar instruction sequence for 256-bits vectors - https://godbolt.org/z/erd5sf

References:

Fixes #48283

@dotnet/jit-contrib @tannergooding PTAL

Vector128<T>.AllBits:
```
 vpcmpeqd xmmReg, xmmReg, xmmReg
```

Vector256<T>.AllBits:
```
 vpcmpeqd ymmReg, ymmReg, ymmReg
```

If Avx.IsSupported

Vector128<T>.AllBits:
```
 vpcmpeqd xmmReg, xmmReg, xmmReg
```

Vector256<T>.AllBits:
```
 vxorps xmmReg, xmmReg, xmmReg
 vcmptruesd ymmReg, ymmReg, ymmReg
```

Otherwise

Vector128<T>.AllBits:
```
 pcmpeqd xmm0, xmm0
```

Vector256<T>.AllBits:
  Software fallback
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 17, 2021
@echesakov echesakov self-assigned this Feb 17, 2021
@echesakov echesakov changed the title Prevent emitting Avx2 instructions with Vector256<Integer>.AllBitsSet when not supported Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported Feb 17, 2021
{
assert(op1 == nullptr);
if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if shouldn't be needed. emitIns_SIMD_R_R_R should already be handling VEX vs non VEX

{
assert(op1 == nullptr);
if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, while the behavior is the same, this is changing the instruction used for floating-point. ins was not pcmpeqd here but instead was cmpps or cmppd.

Both the Intel and AMD optimization manuals have notes similar to:

Previous microarchitectures (before Intel Core microarchitecture) do not have explicit restrictions on
mixing integer and floating-point (FP) operations on XMM registers. For Intel Core microarchitecture,
mixing integer and floating-point operations on the content of an XMM register can degrade performance. Software should avoid mixed-use of integer/FP operation on XMM registers. Specifically:
• Use SIMD integer operations to feed SIMD integer operations. Use PXOR for idiom.
• Use SIMD floating-point operations to feed SIMD floating-point operations. Use XORPS for idiom.
• When floating-point operations are bitwise equivalent, use PS data type instead of PD data type.
MOVAPS and MOVAPD do the same thing, but MOVAPS takes one less byte to encode the instruction.

and

Avoid mixing packed integer operations with packed floating point operations

Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.

@tannergooding But do we expect the value of Vector128/256<double>.AllBitsSet to be used by a floating-point operation in some user scenarios? Do you have some examples in mind?

BTW, have you had a chance to read the SO posts I mentioned?

Copy link
Member

@tannergooding tannergooding Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we expect the value of Vector128/256.AllBitsSet

Yes. AllBitsSet is a fairly central value for masking, comparing, and blending because it is the value returned when a given element is true.

else
{
assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX));
emit->emitIns_SIMD_R_R_R(INS_xorps, EA_16BYTE, targetReg, targetReg, targetReg);
Copy link
Member

@tannergooding tannergooding Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to zero the target register when using 8 as the immediate.

8 is VCMPEQ_UQSS which is "compare equal, unordered, non-signalling" which effectively means that if left and right are the same or either is NaN, return true and do not throw for signalling NaN if floating-point exceptions are enabled".

Since left and right are both target, it means it will always return true (the same is true for 15).

You can confirm this with a program such as:

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

class Program
{
    static void Main(string[] args)
    {
        var nan = Vector128.Create(float.NaN);
        var one = Vector128.Create(1.0f);
        var inf = Vector128.Create(float.PositiveInfinity);

        Test(nan, FloatComparisonMode.UnorderedEqualNonSignaling);
        Test(nan, FloatComparisonMode.UnorderedTrueNonSignaling);

        Test(one, FloatComparisonMode.UnorderedEqualNonSignaling);
        Test(one, FloatComparisonMode.UnorderedTrueNonSignaling);

        Test(inf, FloatComparisonMode.UnorderedEqualNonSignaling);
        Test(inf, FloatComparisonMode.UnorderedTrueNonSignaling);
    }

    static void Test(Vector128<float> value, FloatComparisonMode mode)
    {
        Console.WriteLine($"{value,-41}, {mode,-28}: {Avx.Compare(value, value, mode)}");
    }
}

which prints:

<NaN, NaN, NaN, NaN>                     , UnorderedEqualNonSignaling  : <NaN, NaN, NaN, NaN>
<NaN, NaN, NaN, NaN>                     , UnorderedTrueNonSignaling   : <NaN, NaN, NaN, NaN>
<1, 1, 1, 1>                             , UnorderedEqualNonSignaling  : <NaN, NaN, NaN, NaN>
<1, 1, 1, 1>                             , UnorderedTrueNonSignaling   : <NaN, NaN, NaN, NaN>
<Infinity, Infinity, Infinity, Infinity> , UnorderedEqualNonSignaling  : <NaN, NaN, NaN, NaN>
<Infinity, Infinity, Infinity, Infinity> , UnorderedTrueNonSignaling   : <NaN, NaN, NaN, NaN>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?

Copy link
Contributor Author

@echesakov echesakov Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.

Also see https://www.agner.org/optimize/optimizing_assembly.pdf page 65

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?

For the scenario being considered; they both look to operate identically. I don't recall why VCMPEQ_UQSS was chosen originally over VCMPTRUESS but it was and is what we are emitting today.
I don't have a strong preference on which of these two are emitted provided they are always identical for tgt, tgt, tgt.

The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.

We aren't doing this for any of the other comparisons or for intrinsics in general (outside LZCNT or POPCNT IIRC, because older Intel CPUs had a bug where it always treated it as having a dependency). Likewise, I don't think we should be unconditionally inserting a dependency breaking idiom here and instead this would be better handled by some general purpose check that identifies possible dependencies and therefore where inserting this would be beneficial.

Hardware intrinsics are used in performance oriented code paths and even if xorps is elided in the decoding step, it still takes up space in the decoder and is going to change the overall throughput of small-medium sized loops.
Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).

Right, for AVX2+ Clang emits vpcmpeqd ymmReg, ymmReg, ymmReg as I poined out in the first post and implemented for NI_Vector256_get_AllBitsSet.

…int type. Don't break dependency chain with xorps
@echesakov
Copy link
Contributor Author

echesakov commented Feb 18, 2021

@tannergooding Updated PR based on your feedback - would you take another look?

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@echesakov echesakov merged commit e730cd8 into dotnet:master Feb 19, 2021
@echesakov echesakov deleted the Runtime_48283 branch February 19, 2021 02:31
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector256<Integer>.AllBitsSet emits illegal instructions on a machine without AVX2

3 participants