Skip to content
This repository was archived by the owner on Aug 2, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion scripts/PerfHarness/PerfHarness.csproj
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netcoreapp1.0</TargetFramework>
<TargetFramework>netcoreapp1.1</TargetFramework>
<DebugType>portable</DebugType>
<AssemblyName>PerfHarness</AssemblyName>
<OutputType>Exe</OutputType>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@
</ItemGroup>
<ItemGroup>
<PackageReference Include="System.Buffers" Version="4.3.0" />
<PackageReference Include="System.Numerics.Vectors" Version="4.3.0" />
</ItemGroup>
</Project>
103 changes: 103 additions & 0 deletions src/System.Buffers.Experimental/System/Buffers/BufferExtensions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

using System.Collections.Sequences;
using System.Diagnostics;
using System.Numerics;
using System.Runtime.CompilerServices;

namespace System.Buffers
Expand Down Expand Up @@ -117,5 +118,107 @@ internal static int IndexOfStraddling(this ReadOnlySpan<byte> first, IReadOnlyMe

return -1;
}

static readonly int s_longSize = Vector<ulong>.Count;
static readonly int s_byteSize = Vector<byte>.Count;

public static int IndexOfVectorized(this Span<byte> buffer, byte value)
{
Debug.Assert(s_longSize == 4 || s_longSize == 2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AVX-512 this could go to s_longSize == 8; going above 64 bytes is probably unlikely in near term as cache line is 64 bytes which changing would probably break lots of assumptions in software


var byteSize = s_byteSize;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jit team told me off for not using Vector<byte>.Count directly; as they can't infer intrincs like auto loop unrolling dotnet/coreclr#8001

Copy link
Copy Markdown
Member Author

@KrzysztofCwalina KrzysztofCwalina Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. I think @sivarv told me about it. I will change.


if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test !Vector.IsHardwareAccelerated first so the Jit eliminates everything in the function. Might be worth adding a definitely inlined indirection shim?

public static int IndexOfVectorized(this Span<byte> buffer, byte value)
{
    if (Vector.IsHardwareAccelerated && buffer.Length >= Vector<byte>.Count)
    {
        return buffer.IndexOfVectorizedImpl(value);
    }

    return buffer.IndexOf(value);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do


Vector<byte> match = new Vector<byte>(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs jit fix dotnet/coreclr#7683 to be performant; else use

Vector<byte> match = Vector.AsVectorByte(new Vector<uint>(value * 0x01010101u));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know about this issue. The question is: is it worth doing the workaround above? It's going to be slower when the fix is in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could #ifdef for current and above? (> Desktop 4.6.3? + Coreclr 1.2?) Don't really know the version numbers 😄

var vectors = buffer.NonPortableCast<byte, Vector<byte>>();
var zero = Vector<byte>.Zero;

for (int vectorIndex = 0; vectorIndex < vectors.Length; vectorIndex++)
{
var vector = vectors.GetItem(vectorIndex);
var result = Vector.Equals(vector, match);
if (result != zero)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero test directly rather than via variable; easier for Jit to pick up dotnet/coreclr#7367

!result.Equals(Vector<byte>.Zero)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I tested it and local was faster, but you are right that it should be slower. I will retest and possibly change.

{
var longer = Vector.AsVectorUInt64(result);
Debug.Assert(s_longSize == 4 || s_longSize == 2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be true on AVX -512

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus the assert :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams Will the JIT emit AVX-512 instructions on processors that support it today?

@KrzysztofCwalina Does the corefx(lab) CI run tests in Debug mode? If the JIT changes to support AVX-512, do any of the CI servers run a Xeon Phi processor or whatever it would take for this Debug.Assert to fail?

If we were to ship a System.Buffers package that didn't support a Vector<ulong>.Count of 8 (or greater), could IndexOfVectorized simply skip over matching bytes and continue the for loop? If that were the case, Kestrel couldn't use it. That would be a security issue as that could cause Kestrel to read requests differently than proxies in front of it. Hopefully the server would just fall over instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xeon Phi won't help you, it's a co-processor card you have to specifically target (usually with an intel c++ complier). I think some of the Sandy Bridge EP Xeon's have it, but I suspect it will be not a straight exposure of the registers because the AVX512 spec is a bit all over the shop.


var candidate = longer[0];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to for loop using Vector<ulong>.Count as limit dotnet/coreclr#7912

Copy link
Copy Markdown
Member Author

@KrzysztofCwalina KrzysztofCwalina Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it as such loop. Was 10% slower. This is also due to a missing feature in JIT. Once we fully run on 2.0, the loop (as you say) will be auto unrolled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (candidate != 0) return vectorIndex * byteSize + IndexOf(candidate);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break and continue as inline returns make codegen nasty; and loop for unrolling as above dotnet/coreclr#7912

ulong candidate = 0;
int longIndex = 0;
for (; longIndex < Vector<ulong>.Count; longIndex++)
{
    var candidate = longer[longIndex];
    if (candidate == 0) continue;
    break;
}
return 8 * longIndex + vectorIndex * Vector<byte>.Count + IndexOf(candidate);

candidate = longer[1];
if (candidate != 0) return 8 + vectorIndex * byteSize + IndexOf(candidate);
if (s_longSize == 4)
{
candidate = longer[2];
if (candidate != 0) return 16 + vectorIndex * byteSize + IndexOf(candidate);
candidate = longer[3];
if (candidate != 0) return 24 + vectorIndex * byteSize + IndexOf(candidate);
}
}
}

var processed = vectors.Length * byteSize;
var index = buffer.Slice(processed).IndexOf(value);
if (index == -1) return -1;
return index + processed;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static int IndexOfVectorized(this ReadOnlySpan<byte> buffer, byte value)
{
Debug.Assert(s_longSize == 4 || s_longSize == 2);

var byteSize = s_byteSize;

if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above


Vector<byte> match = new Vector<byte>(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

var vectors = buffer.NonPortableCast<byte, Vector<byte>>();
var zero = Vector<byte>.Zero;

for (int vectorIndex = 0; vectorIndex < vectors.Length; vectorIndex++)
{
var vector = vectors[vectorIndex];
var result = Vector.Equals(vector, match);
if (result != zero)
{
var longer = Vector.AsVectorUInt64(result);
var candidate = longer[0];
if (candidate != 0) return vectorIndex * byteSize + IndexOf(candidate);
candidate = longer[1];
if (candidate != 0) return 8 + vectorIndex * byteSize + IndexOf(candidate);
if (s_longSize == 4)
{
candidate = longer[2];
if (candidate != 0) return 16 + vectorIndex * byteSize + IndexOf(candidate);
candidate = longer[3];
if (candidate != 0) return 24 + vectorIndex * byteSize + IndexOf(candidate);
}
}
}

var processed = vectors.Length * byteSize;
var index = buffer.Slice(processed).IndexOf(value);
if (index == -1) return -1;
return index + processed;
}

// used by IndexOfVectorized
static int IndexOf(ulong next)
Copy link
Copy Markdown
Member

@benaadams benaadams Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force inline? (as only called once per vector, if loop changed as suggested)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the attribute disappear from the PR? Seriously :-)

{
// Flag least significant power of two bit
var powerOfTwoFlag = (next ^ (next - 1));
// Shift all powers of two into the high byte and extract
var foundByteIndex = (int)((powerOfTwoFlag * _xorPowerOfTwoToHighByte) >> 57);
return foundByteIndex;
}

const ulong _xorPowerOfTwoToHighByte = (0x07ul |
0x06ul << 8 |
0x05ul << 16 |
0x04ul << 24 |
0x03ul << 32 |
0x02ul << 40 |
0x01ul << 48) + 1;
}
}
2 changes: 1 addition & 1 deletion tests/Benchmarks/Benchmarks.csproj
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<Project Sdk="Microsoft.NET.Sdk" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
<PropertyGroup>
<TargetFramework>netcoreapp1.0</TargetFramework>
<TargetFramework>netcoreapp1.1</TargetFramework>
<AllowUnsafeBlocks>False</AllowUnsafeBlocks>
<AssemblyOriginatorKeyFile>../../tools/test_key.snk</AssemblyOriginatorKeyFile>
<SignAssembly>true</SignAssembly>
Expand Down
58 changes: 58 additions & 0 deletions tests/Benchmarks/IndexOf.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.Xunit.Performance;
using System;
using System.Buffers;
using System.Numerics;
using System.Text;

public class IndexOfBench
{
static int s_bufferLength = 1000;
static byte[] s_buffer = new byte[s_bufferLength];
static int s_loops = 1000;

static IndexOfBench()
{
s_buffer[s_bufferLength - 100] = 255;
}

[Benchmark]
static int SpanIndexOf()
{
Span<byte> buffer = s_buffer;
int index = 0;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for(int i=0; i<s_loops; i++) {
index += buffer.IndexOf(255);
}
}
}
return index;
}

[Benchmark]
static int VectorizedIndexOf()
{
if(!Vector.IsHardwareAccelerated) return 0;

Span<byte> buffer = s_buffer;
int index = 0;
foreach (var iteration in Benchmark.Iterations)
{
using (iteration.StartMeasurement())
{
for(int i=0; i<s_loops; i++) {
index += buffer.IndexOfVectorized(255);
}
}
}
return index;
}
}

64 changes: 64 additions & 0 deletions tests/System.Buffers.Experimental.Tests/VectorizedOperations.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
// Copyright (c) Microsoft. All rights reserved.
// Licensed under the MIT license. See LICENSE file in the project root for full license information.
using System.Buffers;
using Xunit;

namespace System.Buffers.Tests
{
public class VectorizedOperationsTests
{
[Fact]
public void SpanIndexOf()
{
int len = 10000;
byte[] buffer = new byte[len];
buffer[0] = 1;
buffer[len / 2] = 2;
buffer[len - 1] = 3;

Span<byte> span = buffer;
Assert.Equal(0, span.IndexOfVectorized(1));
Assert.Equal(len/2, span.IndexOfVectorized(2));
Assert.Equal(len-1, span.IndexOfVectorized(3));
Assert.Equal(-1, span.IndexOfVectorized(4));
}

[Fact]
public void ReadOnlySpanIndexOf()
{
int len = 10000;
byte[] buffer = new byte[len];
buffer[0] = 1;
buffer[len / 2] = 2;
buffer[len - 1] = 3;

ReadOnlySpan<byte> span = buffer;
Assert.Equal(0, span.IndexOfVectorized(1));
Assert.Equal(len/2, span.IndexOfVectorized(2));
Assert.Equal(len-1, span.IndexOfVectorized(3));
Assert.Equal(-1, span.IndexOfVectorized(4));
}

[Fact]
public void EmptySpanIndexOf()
{
int len = 0;
byte[] buffer = new byte[len];
Span<byte> span = buffer;
Assert.Equal(-1, span.IndexOfVectorized(4));
}

[Fact]
public void EmptyReadOnlySpanIndexOf()
{
int len = 10000;
byte[] buffer = new byte[len];
buffer[0] = 1;
buffer[len / 2] = 2;
buffer[len - 1] = 3;

ReadOnlySpan<byte> span = buffer;
Assert.Equal(-1, span.IndexOfVectorized(4));
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="15.0.0-preview-20170125-04" />
<PackageReference Include="System.Numerics.Vectors" Version="4.3.0" />
<PackageReference Include="xunit" Version="2.2.0-beta5-build3474" />
<PackageReference Include="xunit.runner.visualstudio" Version="2.2.0-beta5-build1225" />
</ItemGroup>
Expand Down