Added SIMD-based IndexOf by KrzysztofCwalina · Pull Request #1222 · dotnet/corefxlab

KrzysztofCwalina · 2017-02-16T00:09:31Z

IndexOf is very commonly used in parsers. This PR is the first stab at implementing vectorized IndexOf operating over Span.

I added a simple benchmark, and it shows that vectorized IndexOf is ~3x faster than the sequential IndexOf for searching items far into the buffer. It is significantly slower if the searched for item is at index 0. The break even point is around index 30.

Test Name / Average [time]
IndexOfBench.VectorizedIndexOf 0.091765931
IndexOfBench.SpanIndexOf 0.29441888

Note, that I tried to implement a hybrid algorithm, i.e. sequential for indexes 0-32 and then vectorized. Unfortunately the overhead of the branch, slicing, etc. was so high that it almost nullified the gains for indexes 0-30 and then made perf slightly worse for other indexes. I might play with this idea a bit more later.

One more thing, there is a small different between Span and ReadOnlySpan IndexOfVectorized implementations. Span has the GetItem method that returns a by ref (i.e. no copy). ReadOnlySpan does not and so I have to use the indexer (which copies a large vector). @jaredpar, are we getting readonly by refs soon? It would be a great match for ReadOnlySpan.

cc: @ahsonkhan, @shiftylogic, @vancem, @jkotas, @joshfree, @benaadams, @davidfowl, @sivarv

joshfree · 2017-02-16T00:16:00Z

@mellinoe could you also take a quick look?

benaadams

Some changes to work with the jit happier

benaadams · 2017-02-16T00:18:26Z

+
+            if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
+
+            Vector<byte> match = new Vector<byte>(value);


benaadams · 2017-02-16T00:19:08Z

+
+            if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);
+
+            Vector<byte> match = new Vector<byte>(value);


Needs jit fix dotnet/coreclr#7683 to be performant; else use

Vector<byte> match = Vector.AsVectorByte(new Vector<uint>(value * 0x01010101u));

Yeah, I know about this issue. The question is: is it worth doing the workaround above? It's going to be slower when the fix is in.

Could #ifdef for current and above? (> Desktop 4.6.3? + Coreclr 1.2?) Don't really know the version numbers 😄

benaadams · 2017-02-16T00:27:48Z

+
+            var byteSize = s_byteSize;
+
+            if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);


Test !Vector.IsHardwareAccelerated first so the Jit eliminates everything in the function. Might be worth adding a definitely inlined indirection shim?

public static int IndexOfVectorized(this Span<byte> buffer, byte value) { if (Vector.IsHardwareAccelerated && buffer.Length >= Vector<byte>.Count) { return buffer.IndexOfVectorizedImpl(value); } return buffer.IndexOf(value); }

Good point. Will do

benaadams · 2017-02-16T00:28:13Z

+
+            var byteSize = s_byteSize;
+
+            if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);


benaadams · 2017-02-16T00:31:09Z

+
+        public static int IndexOfVectorized(this Span<byte> buffer, byte value)
+        {
+            Debug.Assert(s_longSize == 4 || s_longSize == 2);


With AVX-512 this could go to s_longSize == 8; going above 64 bytes is probably unlikely in near term as cache line is 64 bytes which changing would probably break lots of assumptions in software

benaadams · 2017-02-16T00:40:38Z

+                if (result != zero)
+                {
+                    var longer = Vector.AsVectorUInt64(result);
+                    Debug.Assert(s_longSize == 4 || s_longSize == 2);


Might not be true on AVX -512

Thus the assert :-)

@benaadams Will the JIT emit AVX-512 instructions on processors that support it today?

@KrzysztofCwalina Does the corefx(lab) CI run tests in Debug mode? If the JIT changes to support AVX-512, do any of the CI servers run a Xeon Phi processor or whatever it would take for this Debug.Assert to fail?

If we were to ship a System.Buffers package that didn't support a Vector<ulong>.Count of 8 (or greater), could IndexOfVectorized simply skip over matching bytes and continue the for loop? If that were the case, Kestrel couldn't use it. That would be a security issue as that could cause Kestrel to read requests differently than proxies in front of it. Hopefully the server would just fall over instead.

Xeon Phi won't help you, it's a co-processor card you have to specifically target (usually with an intel c++ complier). I think some of the Sandy Bridge EP Xeon's have it, but I suspect it will be not a straight exposure of the registers because the AVX512 spec is a bit all over the shop.

benaadams · 2017-02-16T00:41:55Z

+                    var longer = Vector.AsVectorUInt64(result);
+                    Debug.Assert(s_longSize == 4 || s_longSize == 2);
+
+                    var candidate = longer[0];


Change to for loop using Vector<ulong>.Count as limit dotnet/coreclr#7912

I had it as such loop. Was 10% slower. This is also due to a missing feature in JIT. Once we fully run on 2.0, the loop (as you say) will be auto unrolled.

/cc @JosephTremoulet

benaadams · 2017-02-16T00:50:15Z

+                    Debug.Assert(s_longSize == 4 || s_longSize == 2);
+
+                    var candidate = longer[0];
+                    if (candidate != 0) return vectorIndex * byteSize + IndexOf(candidate);


break and continue as inline returns make codegen nasty; and loop for unrolling as above dotnet/coreclr#7912

ulong candidate = 0; int longIndex = 0; for (; longIndex < Vector<ulong>.Count; longIndex++) { var candidate = longer[longIndex]; if (candidate == 0) continue; break; } return 8 * longIndex + vectorIndex * Vector<byte>.Count + IndexOf(candidate);

benaadams · 2017-02-16T00:50:57Z

+        }
+
+        // used by IndexOfVectorized
+        static int IndexOf(ulong next)


Force inline? (as only called once per vector, if loop changed as suggested)

How did the attribute disappear from the PR? Seriously :-)

benaadams · 2017-02-16T00:55:04Z

+            {
+                var vector = vectors.GetItem(vectorIndex);
+                var result = Vector.Equals(vector, match);
+                if (result != zero)


Zero test directly rather than via variable; easier for Jit to pick up dotnet/coreclr#7367

!result.Equals(Vector<byte>.Zero)

I thought I tested it and local was faster, but you are right that it should be slower. I will retest and possibly change.

KrzysztofCwalina · 2017-02-16T01:25:22Z

Hmmm, I made most of changes and the code got slower. Let me merge as is and I will try your suggestions one by one after I am back from vacations (next Tuesday).

benaadams · 2017-02-20T14:56:41Z

I think its easier just to fix it rather than doing a NonPortableCast? dotnet/corefx@1055cf3

benaadams · 2017-02-21T06:23:50Z

Added PR with changes in referenced corefx PR #1231 - not sure how the xunit benchmark works; its seems to generate very very large csv files without much context, that than need to be interpreted?

Added SIMD-based IndexOf

20e7852

dnfclas added the cla-already-signed label Feb 16, 2017

benaadams suggested changes Feb 16, 2017

View reviewed changes

KrzysztofCwalina merged commit 51e0c8e into dotnet:master Feb 16, 2017

KrzysztofCwalina deleted the SimdIndexOf branch March 7, 2017 22:22


		if (buffer.Length < byteSize * 2 \|\| !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);

		Vector<byte> match = new Vector<byte>(value);


		var byteSize = s_byteSize;

		if (buffer.Length < byteSize * 2 \|\| !Vector.IsHardwareAccelerated) return buffer.IndexOf(value);

Conversation

KrzysztofCwalina commented Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshfree commented Feb 16, 2017

Uh oh!

benaadams left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KrzysztofCwalina Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KrzysztofCwalina commented Feb 16, 2017

Uh oh!

benaadams commented Feb 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

KrzysztofCwalina commented Feb 16, 2017 •

edited

Loading

KrzysztofCwalina Feb 16, 2017 •

edited

Loading

benaadams Feb 16, 2017 •

edited

Loading

benaadams commented Feb 20, 2017 •

edited

Loading