Added SIMD-based IndexOf#1222
Conversation
|
@mellinoe could you also take a quick look? |
benaadams
left a comment
There was a problem hiding this comment.
Some changes to work with the jit happier
|
|
||
| if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value); | ||
|
|
||
| Vector<byte> match = new Vector<byte>(value); |
|
|
||
| if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value); | ||
|
|
||
| Vector<byte> match = new Vector<byte>(value); |
There was a problem hiding this comment.
Needs jit fix dotnet/coreclr#7683 to be performant; else use
Vector<byte> match = Vector.AsVectorByte(new Vector<uint>(value * 0x01010101u));There was a problem hiding this comment.
Yeah, I know about this issue. The question is: is it worth doing the workaround above? It's going to be slower when the fix is in.
There was a problem hiding this comment.
Could #ifdef for current and above? (> Desktop 4.6.3? + Coreclr 1.2?) Don't really know the version numbers 😄
|
|
||
| var byteSize = s_byteSize; | ||
|
|
||
| if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value); |
There was a problem hiding this comment.
Test !Vector.IsHardwareAccelerated first so the Jit eliminates everything in the function. Might be worth adding a definitely inlined indirection shim?
public static int IndexOfVectorized(this Span<byte> buffer, byte value)
{
if (Vector.IsHardwareAccelerated && buffer.Length >= Vector<byte>.Count)
{
return buffer.IndexOfVectorizedImpl(value);
}
return buffer.IndexOf(value);
}There was a problem hiding this comment.
Good point. Will do
|
|
||
| var byteSize = s_byteSize; | ||
|
|
||
| if (buffer.Length < byteSize * 2 || !Vector.IsHardwareAccelerated) return buffer.IndexOf(value); |
|
|
||
| public static int IndexOfVectorized(this Span<byte> buffer, byte value) | ||
| { | ||
| Debug.Assert(s_longSize == 4 || s_longSize == 2); |
There was a problem hiding this comment.
With AVX-512 this could go to s_longSize == 8; going above 64 bytes is probably unlikely in near term as cache line is 64 bytes which changing would probably break lots of assumptions in software
| if (result != zero) | ||
| { | ||
| var longer = Vector.AsVectorUInt64(result); | ||
| Debug.Assert(s_longSize == 4 || s_longSize == 2); |
There was a problem hiding this comment.
Might not be true on AVX -512
There was a problem hiding this comment.
Thus the assert :-)
There was a problem hiding this comment.
@benaadams Will the JIT emit AVX-512 instructions on processors that support it today?
@KrzysztofCwalina Does the corefx(lab) CI run tests in Debug mode? If the JIT changes to support AVX-512, do any of the CI servers run a Xeon Phi processor or whatever it would take for this Debug.Assert to fail?
If we were to ship a System.Buffers package that didn't support a Vector<ulong>.Count of 8 (or greater), could IndexOfVectorized simply skip over matching bytes and continue the for loop? If that were the case, Kestrel couldn't use it. That would be a security issue as that could cause Kestrel to read requests differently than proxies in front of it. Hopefully the server would just fall over instead.
There was a problem hiding this comment.
Xeon Phi won't help you, it's a co-processor card you have to specifically target (usually with an intel c++ complier). I think some of the Sandy Bridge EP Xeon's have it, but I suspect it will be not a straight exposure of the registers because the AVX512 spec is a bit all over the shop.
| var longer = Vector.AsVectorUInt64(result); | ||
| Debug.Assert(s_longSize == 4 || s_longSize == 2); | ||
|
|
||
| var candidate = longer[0]; |
There was a problem hiding this comment.
Change to for loop using Vector<ulong>.Count as limit dotnet/coreclr#7912
There was a problem hiding this comment.
I had it as such loop. Was 10% slower. This is also due to a missing feature in JIT. Once we fully run on 2.0, the loop (as you say) will be auto unrolled.
| Debug.Assert(s_longSize == 4 || s_longSize == 2); | ||
|
|
||
| var candidate = longer[0]; | ||
| if (candidate != 0) return vectorIndex * byteSize + IndexOf(candidate); |
There was a problem hiding this comment.
break and continue as inline returns make codegen nasty; and loop for unrolling as above dotnet/coreclr#7912
ulong candidate = 0;
int longIndex = 0;
for (; longIndex < Vector<ulong>.Count; longIndex++)
{
var candidate = longer[longIndex];
if (candidate == 0) continue;
break;
}
return 8 * longIndex + vectorIndex * Vector<byte>.Count + IndexOf(candidate);| } | ||
|
|
||
| // used by IndexOfVectorized | ||
| static int IndexOf(ulong next) |
There was a problem hiding this comment.
Force inline? (as only called once per vector, if loop changed as suggested)
There was a problem hiding this comment.
How did the attribute disappear from the PR? Seriously :-)
| { | ||
| var vector = vectors.GetItem(vectorIndex); | ||
| var result = Vector.Equals(vector, match); | ||
| if (result != zero) |
There was a problem hiding this comment.
Zero test directly rather than via variable; easier for Jit to pick up dotnet/coreclr#7367
!result.Equals(Vector<byte>.Zero)There was a problem hiding this comment.
I thought I tested it and local was faster, but you are right that it should be slower. I will retest and possibly change.
|
Hmmm, I made most of changes and the code got slower. Let me merge as is and I will try your suggestions one by one after I am back from vacations (next Tuesday). |
|
I think its easier just to fix it rather than doing a |
|
Added PR with changes in referenced corefx PR #1231 - not sure how the xunit benchmark works; its seems to generate very very large csv files without much context, that than need to be interpreted? |
IndexOf is very commonly used in parsers. This PR is the first stab at implementing vectorized IndexOf operating over Span.
I added a simple benchmark, and it shows that vectorized IndexOf is ~3x faster than the sequential IndexOf for searching items far into the buffer. It is significantly slower if the searched for item is at index 0. The break even point is around index 30.
Test Name / Average [time]
IndexOfBench.VectorizedIndexOf 0.091765931
IndexOfBench.SpanIndexOf 0.29441888
Note, that I tried to implement a hybrid algorithm, i.e. sequential for indexes 0-32 and then vectorized. Unfortunately the overhead of the branch, slicing, etc. was so high that it almost nullified the gains for indexes 0-30 and then made perf slightly worse for other indexes. I might play with this idea a bit more later.
One more thing, there is a small different between Span and ReadOnlySpan IndexOfVectorized implementations. Span has the GetItem method that returns a by ref (i.e. no copy). ReadOnlySpan does not and so I have to use the indexer (which copies a large vector). @jaredpar, are we getting readonly by refs soon? It would be a great match for ReadOnlySpan.
cc: @ahsonkhan, @shiftylogic, @vancem, @jkotas, @joshfree, @benaadams, @davidfowl, @sivarv