Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported by echesakov · Pull Request #48383 · dotnet/runtime

echesakov · 2021-02-17T04:25:22Z

If Avx2.IsSupported

Vector128<T>.AllBits

 vpcmpeqd xmmReg, xmmReg, xmmReg

Vector256<T>.AllBits

 vpcmpeqd ymmReg, ymmReg, ymmReg

If Avx.IsSupported (but not Avx2)

Vector128<T>.AllBits

 vpcmpeqd xmmReg, xmmReg, xmmReg

Vector256<T>.AllBits

 vxorps xmmReg, xmmReg, xmmReg
 vcmptruesd ymmReg, ymmReg, ymmReg

Otherwise

Vector128<T>.AllBits

 pcmpeqd xmm0, xmm0

Vector256<T>.AllBits

Software fallback

Clang generates similar instruction sequence for 256-bits vectors - https://godbolt.org/z/erd5sf

References:

Fixes #48283

@dotnet/jit-contrib @tannergooding PTAL

Vector128<T>.AllBits: ``` vpcmpeqd xmmReg, xmmReg, xmmReg ``` Vector256<T>.AllBits: ``` vpcmpeqd ymmReg, ymmReg, ymmReg ``` If Avx.IsSupported Vector128<T>.AllBits: ``` vpcmpeqd xmmReg, xmmReg, xmmReg ``` Vector256<T>.AllBits: ``` vxorps xmmReg, xmmReg, xmmReg vcmptruesd ymmReg, ymmReg, ymmReg ``` Otherwise Vector128<T>.AllBits: ``` pcmpeqd xmm0, xmm0 ``` Vector256<T>.AllBits: Software fallback

tannergooding · 2021-02-17T15:32:25Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

-        {
            assert(op1 == nullptr);
-            if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
+            if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))


This if shouldn't be needed. emitIns_SIMD_R_R_R should already be handling VEX vs non VEX

tannergooding · 2021-02-17T15:41:49Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

-        {
            assert(op1 == nullptr);
-            if (varTypeIsFloating(baseType) && compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))
+            if (compiler->compOpportunisticallyDependsOn(InstructionSet_AVX))


Likewise, while the behavior is the same, this is changing the instruction used for floating-point. ins was not pcmpeqd here but instead was cmpps or cmppd.

Both the Intel and AMD optimization manuals have notes similar to:

Previous microarchitectures (before Intel Core microarchitecture) do not have explicit restrictions on
mixing integer and floating-point (FP) operations on XMM registers. For Intel Core microarchitecture,
mixing integer and floating-point operations on the content of an XMM register can degrade performance. Software should avoid mixed-use of integer/FP operation on XMM registers. Specifically:
• Use SIMD integer operations to feed SIMD integer operations. Use PXOR for idiom.
• Use SIMD floating-point operations to feed SIMD floating-point operations. Use XORPS for idiom.
• When floating-point operations are bitwise equivalent, use PS data type instead of PD data type.
MOVAPS and MOVAPD do the same thing, but MOVAPS takes one less byte to encode the instruction.

and

Avoid mixing packed integer operations with packed floating point operations

Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.

Which implies that we should try to use the matching floating-point comparisons over the integer comparisons when the target type is floating-point.

@tannergooding But do we expect the value of Vector128/256<double>.AllBitsSet to be used by a floating-point operation in some user scenarios? Do you have some examples in mind?

BTW, have you had a chance to read the SO posts I mentioned?

But do we expect the value of Vector128/256.AllBitsSet

Yes. AllBitsSet is a fairly central value for masking, comparing, and blending because it is the value returned when a given element is true.

tannergooding · 2021-02-17T15:57:18Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+            else
+            {
+                assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX));
+                emit->emitIns_SIMD_R_R_R(INS_xorps, EA_16BYTE, targetReg, targetReg, targetReg);


We shouldn't need to zero the target register when using 8 as the immediate.

8 is VCMPEQ_UQSS which is "compare equal, unordered, non-signalling" which effectively means that if left and right are the same or either is NaN, return true and do not throw for signalling NaN if floating-point exceptions are enabled".

Since left and right are both target, it means it will always return true (the same is true for 15).

You can confirm this with a program such as:

using System; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; class Program { static void Main(string[] args) { var nan = Vector128.Create(float.NaN); var one = Vector128.Create(1.0f); var inf = Vector128.Create(float.PositiveInfinity); Test(nan, FloatComparisonMode.UnorderedEqualNonSignaling); Test(nan, FloatComparisonMode.UnorderedTrueNonSignaling); Test(one, FloatComparisonMode.UnorderedEqualNonSignaling); Test(one, FloatComparisonMode.UnorderedTrueNonSignaling); Test(inf, FloatComparisonMode.UnorderedEqualNonSignaling); Test(inf, FloatComparisonMode.UnorderedTrueNonSignaling); } static void Test(Vector128<float> value, FloatComparisonMode mode) { Console.WriteLine($"{value,-41}, {mode,-28}: {Avx.Compare(value, value, mode)}"); } }

which prints:

<NaN, NaN, NaN, NaN> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN> <NaN, NaN, NaN, NaN> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN> <1, 1, 1, 1> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN> <1, 1, 1, 1> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN> <Infinity, Infinity, Infinity, Infinity> , UnorderedEqualNonSignaling : <NaN, NaN, NaN, NaN> <Infinity, Infinity, Infinity, Infinity> , UnorderedTrueNonSignaling : <NaN, NaN, NaN, NaN>

But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?

The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.

Also see https://www.agner.org/optimize/optimizing_assembly.pdf page 65

But why choosing VCMPEQ_UQSS over VCMPTRUESS when the latter does exactly (setting all the bits to ones) what we need?

For the scenario being considered; they both look to operate identically. I don't recall why VCMPEQ_UQSS was chosen originally over VCMPTRUESS but it was and is what we are emitting today.
I don't have a strong preference on which of these two are emitted provided they are always identical for tgt, tgt, tgt.

The idea behind emitting xorps before vcmptruess was also explained in one of the links I mentioned earlier - it is there to break dependency on the prior value in the register.

We aren't doing this for any of the other comparisons or for intrinsics in general (outside LZCNT or POPCNT IIRC, because older Intel CPUs had a bug where it always treated it as having a dependency). Likewise, I don't think we should be unconditionally inserting a dependency breaking idiom here and instead this would be better handled by some general purpose check that identifies possible dependencies and therefore where inserting this would be beneficial.

Hardware intrinsics are used in performance oriented code paths and even if xorps is elided in the decoding step, it still takes up space in the decoder and is going to change the overall throughput of small-medium sized loops.
Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).

Additionally, Clang itself isn't emitting this idiom on haswell or newer (essentially any machine with AVX2+) and is not emitting it on skylake (which is what most of our benchmarks and even estimated cycle counts are based on).

Right, for AVX2+ Clang emits vpcmpeqd ymmReg, ymmReg, ymmReg as I poined out in the first post and implemented for NI_Vector256_get_AllBitsSet.

…int type. Don't break dependency chain with xorps

echesakov · 2021-02-18T21:13:12Z

@tannergooding Updated PR based on your feedback - would you take another look?

tannergooding

LGTM. Thanks!

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 17, 2021

echesakov added the arch-x64 label Feb 17, 2021

echesakov self-assigned this Feb 17, 2021

echesakov changed the title ~~Prevent emitting Avx2 instructions with Vector256<Integer>.AllBitsSet when not supported~~ Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported Feb 17, 2021

EgorBo approved these changes Feb 17, 2021

View reviewed changes

tannergooding reviewed Feb 17, 2021

View reviewed changes

Use cmpps for generating all-ones vector when baseType is floating-po…

78a23ea

…int type. Don't break dependency chain with xorps

tannergooding approved these changes Feb 18, 2021

View reviewed changes

echesakov merged commit e730cd8 into dotnet:master Feb 19, 2021

echesakov deleted the Runtime_48283 branch February 19, 2021 02:31

echesakov mentioned this pull request Feb 22, 2021

[release/5.0] Prevent emitting Avx2 instruction for Vector256<T>.AllBitsSet and Vector256.Create(-1) when Avx2 is not supported #48630

Merged

ghost locked as resolved and limited conversation to collaborators Mar 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383

Prevent emitting Avx2 instructions with Vector256<T>.AllBitsSet when not supported#48383
echesakov merged 2 commits intodotnet:masterfrom
echesakov:Runtime_48283

echesakov commented Feb 17, 2021

Uh oh!

tannergooding Feb 17, 2021

Uh oh!

tannergooding Feb 17, 2021

Uh oh!

echesakov Feb 17, 2021

Uh oh!

tannergooding Feb 17, 2021 •

edited

Loading

Uh oh!

tannergooding Feb 17, 2021 •

edited

Loading

Uh oh!

echesakov Feb 17, 2021

Uh oh!

echesakov Feb 17, 2021 •

edited

Loading

Uh oh!

tannergooding Feb 17, 2021

Uh oh!

echesakov Feb 17, 2021

Uh oh!

echesakov commented Feb 18, 2021 •

edited

Loading

Uh oh!

tannergooding left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

echesakov commented Feb 17, 2021

Uh oh!

tannergooding Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

echesakov Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echesakov Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

echesakov Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

echesakov Feb 17, 2021

Choose a reason for hiding this comment

Uh oh!

echesakov commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tannergooding left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tannergooding Feb 17, 2021 •

edited

Loading

tannergooding Feb 17, 2021 •

edited

Loading

echesakov Feb 17, 2021 •

edited

Loading

echesakov commented Feb 18, 2021 •

edited

Loading