Skip to content
This repository was archived by the owner on Aug 2, 2023. It is now read-only.

Added SIMD-based IndexOf#1222

Merged
KrzysztofCwalina merged 1 commit into
dotnet:masterfrom
KrzysztofCwalina:SimdIndexOf
Feb 16, 2017
Merged

Added SIMD-based IndexOf#1222
KrzysztofCwalina merged 1 commit into
dotnet:masterfrom
KrzysztofCwalina:SimdIndexOf

Conversation

@KrzysztofCwalina
Copy link
Copy Markdown
Member

@KrzysztofCwalina KrzysztofCwalina commented Feb 16, 2017

IndexOf is very commonly used in parsers. This PR is the first stab at implementing vectorized IndexOf operating over Span.

I added a simple benchmark, and it shows that vectorized IndexOf is ~3x faster than the sequential IndexOf for searching items far into the buffer. It is significantly slower if the searched for item is at index 0. The break even point is around index 30.

Test Name / Average [time]
IndexOfBench.VectorizedIndexOf 0.091765931
IndexOfBench.SpanIndexOf 0.29441888

Note, that I tried to implement a hybrid algorithm, i.e. sequential for indexes 0-32 and then vectorized. Unfortunately the overhead of the branch, slicing, etc. was so high that it almost nullified the gains for indexes 0-30 and then made perf slightly worse for other indexes. I might play with this idea a bit more later.

One more thing, there is a small different between Span and ReadOnlySpan IndexOfVectorized implementations. Span has the GetItem method that returns a by ref (i.e. no copy). ReadOnlySpan does not and so I have to use the indexer (which copies a large vector). @jaredpar, are we getting readonly by refs soon? It would be a great match for ReadOnlySpan.

cc: @ahsonkhan, @shiftylogic, @vancem, @jkotas, @joshfree, @benaadams, @davidfowl, @sivarv

@joshfree
Copy link
Copy Markdown
Member

@mellinoe could you also take a quick look?

Copy link
Copy Markdown
Member

@benaadams benaadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes to work with the jit happier


if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);

Vector<byte> match = new Vector<byte>(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above


if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);

Vector<byte> match = new Vector<byte>(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs jit fix dotnet/coreclr#7683 to be performant; else use

Vector<byte> match = Vector.AsVectorByte(new Vector<uint>(value * 0x01010101u));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know about this issue. The question is: is it worth doing the workaround above? It's going to be slower when the fix is in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could #ifdef for current and above? (> Desktop 4.6.3? + Coreclr 1.2?) Don't really know the version numbers 😄


var byteSize = s_byteSize;

if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test !Vector.IsHardwareAccelerated first so the Jit eliminates everything in the function. Might be worth adding a definitely inlined indirection shim?

public static int IndexOfVectorized(this Span<byte> buffer, byte value)
{
    if (Vector.IsHardwareAccelerated && buffer.Length >= Vector<byte>.Count)
    {
        return buffer.IndexOfVectorizedImpl(value);
    }

    return buffer.IndexOf(value);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will do


var byteSize = s_byteSize;

if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above


public static int IndexOfVectorized(this Span<byte> buffer, byte value)
{
Debug.Assert(s_longSize == 4 || s_longSize == 2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AVX-512 this could go to s_longSize == 8; going above 64 bytes is probably unlikely in near term as cache line is 64 bytes which changing would probably break lots of assumptions in software

if (result != zero)
{
var longer = Vector.AsVectorUInt64(result);
Debug.Assert(s_longSize == 4 || s_longSize == 2);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be true on AVX -512

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus the assert :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams Will the JIT emit AVX-512 instructions on processors that support it today?

@KrzysztofCwalina Does the corefx(lab) CI run tests in Debug mode? If the JIT changes to support AVX-512, do any of the CI servers run a Xeon Phi processor or whatever it would take for this Debug.Assert to fail?

If we were to ship a System.Buffers package that didn't support a Vector<ulong>.Count of 8 (or greater), could IndexOfVectorized simply skip over matching bytes and continue the for loop? If that were the case, Kestrel couldn't use it. That would be a security issue as that could cause Kestrel to read requests differently than proxies in front of it. Hopefully the server would just fall over instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xeon Phi won't help you, it's a co-processor card you have to specifically target (usually with an intel c++ complier). I think some of the Sandy Bridge EP Xeon's have it, but I suspect it will be not a straight exposure of the registers because the AVX512 spec is a bit all over the shop.

var longer = Vector.AsVectorUInt64(result);
Debug.Assert(s_longSize == 4 || s_longSize == 2);

var candidate = longer[0];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to for loop using Vector<ulong>.Count as limit dotnet/coreclr#7912

Copy link
Copy Markdown
Member Author

@KrzysztofCwalina KrzysztofCwalina Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it as such loop. Was 10% slower. This is also due to a missing feature in JIT. Once we fully run on 2.0, the loop (as you say) will be auto unrolled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug.Assert(s_longSize == 4 || s_longSize == 2);

var candidate = longer[0];
if (candidate != 0) return vectorIndex * byteSize + IndexOf(candidate);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break and continue as inline returns make codegen nasty; and loop for unrolling as above dotnet/coreclr#7912

ulong candidate = 0;
int longIndex = 0;
for (; longIndex < Vector<ulong>.Count; longIndex++)
{
    var candidate = longer[longIndex];
    if (candidate == 0) continue;
    break;
}
return 8 * longIndex + vectorIndex * Vector<byte>.Count + IndexOf(candidate);

}

// used by IndexOfVectorized
static int IndexOf(ulong next)
Copy link
Copy Markdown
Member

@benaadams benaadams Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force inline? (as only called once per vector, if loop changed as suggested)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the attribute disappear from the PR? Seriously :-)

{
var vector = vectors.GetItem(vectorIndex);
var result = Vector.Equals(vector, match);
if (result != zero)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero test directly rather than via variable; easier for Jit to pick up dotnet/coreclr#7367

!result.Equals(Vector<byte>.Zero)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I tested it and local was faster, but you are right that it should be slower. I will retest and possibly change.

@KrzysztofCwalina
Copy link
Copy Markdown
Member Author

Hmmm, I made most of changes and the code got slower. Let me merge as is and I will try your suggestions one by one after I am back from vacations (next Tuesday).

@KrzysztofCwalina KrzysztofCwalina merged commit 51e0c8e into dotnet:master Feb 16, 2017
@benaadams
Copy link
Copy Markdown
Member

benaadams commented Feb 20, 2017

I think its easier just to fix it rather than doing a NonPortableCast? dotnet/corefx@1055cf3

@benaadams
Copy link
Copy Markdown
Member

Added PR with changes in referenced corefx PR #1231 - not sure how the xunit benchmark works; its seems to generate very very large csv files without much context, that than need to be interpreted?

@KrzysztofCwalina KrzysztofCwalina deleted the SimdIndexOf branch March 7, 2017 22:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants