-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Improve Adler32 vectorization #125191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve Adler32 vectorization #125191
Changes from all commits
d70bfc1
9735c7e
e379f9f
f35db78
d411032
b5dc5c2
2a9af33
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,389 @@ | ||
| // Licensed to the .NET Foundation under one or more agreements. | ||
| // The .NET Foundation licenses this file to you under the MIT license. | ||
|
|
||
| using System.Diagnostics; | ||
| using System.Runtime.CompilerServices; | ||
| using System.Runtime.InteropServices; | ||
| using System.Runtime.Intrinsics; | ||
| using System.Runtime.Intrinsics.Arm; | ||
| using System.Runtime.Intrinsics.X86; | ||
|
|
||
| namespace System.IO.Hashing; | ||
|
|
||
| public sealed partial class Adler32 | ||
| { | ||
| private static bool IsVectorizable(ReadOnlySpan<byte> source) | ||
| => Vector128.IsHardwareAccelerated && source.Length >= Vector128<byte>.Count; | ||
|
|
||
| private static uint UpdateVectorized(uint adler, ReadOnlySpan<byte> source) | ||
| => Adler32Simd.UpdateVectorized(adler, source); | ||
| } | ||
|
|
||
| file static class Adler32Simd | ||
| { | ||
| // VMax represents the maximum number of 16-byte vectors we can process before reducing | ||
| // mod 65521. This is analogous to NMax in the scalar code, however because the accumulated | ||
| // values are distributed across vector elements, we can process more bytes before possible | ||
| // overflow in any individual element. For this implementation, the max is actually 460 | ||
| // vectors, but we choose 448, because it divides evenly by any reasonable block size. | ||
| public const uint VMax = 448; | ||
|
|
||
| private static ReadOnlySpan<byte> MaskBytes => [ | ||
| 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, | ||
| 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff | ||
| ]; | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static uint UpdateVectorized(uint adler, ReadOnlySpan<byte> source) | ||
| { | ||
| if (Vector256.IsHardwareAccelerated && Avx2.IsSupported) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What hardware were you testing AVX512 on? I wouldn't expect it to be slower than Vector256 at all and at worst the same speed here.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm on an AMD Zen5 (see full benchmark details in the PR description). The AVX-512 implementation adds some extra high-latency calculations to the inner loop, so it's expected to be slower. It can't be made to match the perf of AVX2 using the same logic widened, because A fast
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to treat it as 2x256 instead of as 1x512 to avoid the wider multiplier issue and still get the perf gains.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's what this PR does to improve 2x over main. Main uses 1x256 or 1x512. 2x256 is faster than either.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect that actual 2x256 (i.e. effectively unrolling) should be slightly slower than using actual 512 and treating it as 2x256, namely due to the denser code and not needing to manually pipeline the instructions. I would not expect the V512 path to be slower.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This already gets instruction-level parallelism. At best, you could match the perf with V512, but that's a lot of extra complexity for nothing. |
||
| { | ||
| return UpdateCore<AdlerVector256, AccumulateX86, DotProductX86>(adler, source); | ||
| } | ||
|
|
||
| if (Ssse3.IsSupported) | ||
| { | ||
| return UpdateCore<AdlerVector128, AccumulateX86, DotProductX86>(adler, source); | ||
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
|
|
||
| if (AdvSimd.Arm64.IsSupported) | ||
| { | ||
| if (Dp.IsSupported) | ||
| { | ||
| return UpdateCore<AdlerVector128, AccumulateArm64, DotProductArm64Dp>(adler, source); | ||
| } | ||
|
|
||
| return UpdateCore<AdlerVector128, AccumulateArm64, DotProductArm64>(adler, source); | ||
|
Comment on lines
+51
to
+56
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the perf difference between these two paths? Is it worth the additional complexity here?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
| } | ||
|
|
||
| return UpdateCore<AdlerVector128, AccumulateXplat, DotProductXplat>(adler, source); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the perf difference of the above code paths with the xplat path (all platforms)?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Xplat is around 1/3 the speed of native on both x64 (if restricted to Vector128) and Arm64 (if restricted to AdvSimd base). |
||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.NoInlining)] | ||
| private static uint UpdateCore<TSimdStrategy, TAccumulate, TDotProduct>(uint adler, ReadOnlySpan<byte> source) | ||
| where TSimdStrategy : struct, ISimdStrategy | ||
| where TAccumulate : struct, ISimdAccumulate | ||
| where TDotProduct : struct, ISimdDotProduct | ||
| { | ||
| Debug.Assert(source.Length >= Vector128<byte>.Count); | ||
|
|
||
| ref byte bufRef = ref MemoryMarshal.GetReference(source); | ||
| uint totalLength = (uint)source.Length; | ||
| uint totalVectors = totalLength / (uint)Vector128<byte>.Count; | ||
|
|
||
| uint loopVectors = totalVectors & ~1u; | ||
| uint tailVectors = totalVectors - loopVectors; | ||
| uint tailLength = totalLength - totalVectors * (uint)Vector128<byte>.Count; | ||
|
|
||
| uint s1 = (ushort)adler; | ||
| uint s2 = adler >>> 16; | ||
|
|
||
| Vector128<uint> vs1 = Vector128.CreateScalar(s1); | ||
| Vector128<uint> vs2 = Vector128.CreateScalar(s2); | ||
|
|
||
| (vs1, vs2) = TSimdStrategy.VectorLoop<TAccumulate, TDotProduct>(vs1, vs2, ref bufRef, loopVectors); | ||
| bufRef = ref Unsafe.Add(ref bufRef, loopVectors * (uint)Vector128<byte>.Count); | ||
|
|
||
| Vector128<byte> weights = Vector128.CreateSequence((byte)16, unchecked((byte)-1)); | ||
|
|
||
| if (tailVectors != 0) | ||
| { | ||
| Debug.Assert(tailVectors == 1); | ||
|
|
||
| Vector128<byte> bytes = Vector128.LoadUnsafe(ref bufRef); | ||
| bufRef = ref Unsafe.Add(ref bufRef, (uint)Vector128<byte>.Count); | ||
|
|
||
| Vector128<uint> vps = vs1; | ||
|
|
||
| vs1 = TAccumulate.Accumulate(vs1, bytes); | ||
| vs2 = TDotProduct.DotProduct(vs2, bytes, weights); | ||
|
|
||
| vs2 += vps << 4; | ||
| } | ||
|
|
||
| if (tailLength != 0) | ||
| { | ||
| Debug.Assert(tailLength < (uint)Vector128<byte>.Count); | ||
|
|
||
| Vector128<byte> bytes = Vector128.LoadUnsafe(ref Unsafe.Subtract(ref bufRef, (uint)Vector128<byte>.Count - tailLength)); | ||
| bytes &= Vector128.LoadUnsafe(ref MemoryMarshal.GetReference(MaskBytes), tailLength); | ||
|
|
||
| Vector128<uint> vps = vs1; | ||
|
|
||
| vs1 = TAccumulate.Accumulate(vs1, bytes); | ||
| vs2 = TDotProduct.DotProduct(vs2, bytes, weights); | ||
|
|
||
| vs2 += vps * Vector128.Create(tailLength); | ||
| } | ||
|
|
||
| s1 = Vector128.Sum(vs1) % Adler32.ModBase; | ||
| s2 = Vector128.Sum(vs2) % Adler32.ModBase; | ||
|
|
||
| return s1 | (s2 << 16); | ||
| } | ||
| } | ||
|
|
||
| file struct AdlerVector128 : ISimdStrategy | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| private static Vector128<uint> QuickModBase(Vector128<uint> values) | ||
| { | ||
| // Calculating the residual mod 65521 is impractical in SIMD, however we can reduce by | ||
| // enough to prevent overflow without changing the final result of a modulo performed later. | ||
| // | ||
| // Essentially, the high word of the accumulator represents the number of times it has | ||
| // wrapped to 65536. | ||
| // 65536 % 65521 = 15, which is what would be carried forward from the high word. | ||
| // We can simply multiply the high word by 15 and add that to the low word to perform | ||
| // the reduction, resulting in a maximum possible residual of 0xFFFF0. | ||
| // | ||
| // This is further optimized to: `high * 16 - high + low` | ||
| // and implemented as: `(high << 4) + (low - high)`. | ||
|
|
||
| Vector128<uint> vlo = values & (Vector128<uint>.AllBitsSet >>> 16); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually wrote the code as I wanted it to be interpreted by JIT, i.e. |
||
| Vector128<uint> vhi = values >>> 16; | ||
| return (vhi << 4) + (vlo - vhi); | ||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static (Vector128<uint> vs1, Vector128<uint> vs2) VectorLoop<TAccumulate, TDotProduct>(Vector128<uint> vs1, Vector128<uint> vs2, ref byte sourceRef, uint vectors) | ||
| where TAccumulate : struct, ISimdAccumulate | ||
| where TDotProduct : struct, ISimdDotProduct | ||
| { | ||
| Debug.Assert(uint.IsEvenInteger(vectors)); | ||
|
|
||
| const uint blockSize = 2; | ||
|
|
||
| Vector128<byte> weights1 = Vector128.CreateSequence((byte)32, unchecked((byte)-1)); | ||
| Vector128<byte> weights2 = Vector128.CreateSequence((byte)16, unchecked((byte)-1)); | ||
|
|
||
| while (vectors >= blockSize) | ||
| { | ||
| Vector128<uint> vs3 = Vector128<uint>.Zero; | ||
| Vector128<uint> vps = Vector128<uint>.Zero; | ||
|
|
||
| uint blocks = uint.Min(vectors, Adler32Simd.VMax) / blockSize; | ||
| vectors -= blocks * blockSize; | ||
|
|
||
| do | ||
| { | ||
| Vector128<byte> bytes1 = Vector128.LoadUnsafe(ref sourceRef); | ||
| Vector128<byte> bytes2 = Vector128.LoadUnsafe(ref sourceRef, (uint)Vector128<byte>.Count); | ||
| sourceRef = ref Unsafe.Add(ref sourceRef, (uint)Vector128<byte>.Count * 2); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're looking at ways to reduce or otherwise remove unsafe code like this. While we can't really remove
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the exact same reference math done in the current implementation. What's changed between last week when that was approved and now?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A comment was given then too. Last week was namely just taking the existing code "as is" and extending it for the parameterization. This is touching the code with a slightly more significant and non-critical rewrite (even with quite a lot of it being the same and just moved down for sharing). Since we're actively doing work to reduce unsafe usage where feasible, then ideally we fix this up rather than continuing to persist the problematic code.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you've mixed up the Adler32 and parameterized CRC32/64 PRs. Vectorized Adler32 was 100% new in #124409. It's certainly possible to move to a buffer offset, but I think any code using |
||
|
|
||
| vps += vs1; | ||
|
|
||
| vs1 = TAccumulate.Accumulate(vs1, bytes1, bytes2); | ||
| vs2 = TDotProduct.DotProduct(vs2, bytes1, weights1); | ||
| vs3 = TDotProduct.DotProduct(vs3, bytes2, weights2); | ||
| } | ||
| while (--blocks != 0); | ||
|
|
||
| vs2 += vps << 5; | ||
| vs2 += vs3; | ||
|
|
||
| vs1 = QuickModBase(vs1); | ||
| vs2 = QuickModBase(vs2); | ||
| } | ||
|
|
||
| return (vs1, vs2); | ||
| } | ||
| } | ||
|
|
||
| file struct AdlerVector256 : ISimdStrategy | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector256<uint> QuickModBase(Vector256<uint> values) | ||
| { | ||
| Vector256<uint> vlo = values & (Vector256<uint>.AllBitsSet >>> 16); | ||
| Vector256<uint> vhi = values >>> 16; | ||
| return (vhi << 4) + (vlo - vhi); | ||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector256<uint> Accumulate(Vector256<uint> sums, Vector256<byte> bytes) | ||
| => Avx2.SumAbsoluteDifferences(bytes, Vector256<byte>.Zero).AsUInt32() + sums; | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector256<uint> Accumulate(Vector256<uint> sums, Vector256<byte> bytes1, Vector256<byte> bytes2) | ||
| { | ||
| Vector256<byte> zero = Vector256<byte>.Zero; | ||
| Vector256<uint> sad = Avx2.SumAbsoluteDifferences(bytes1, zero).AsUInt32(); | ||
| return sad + Avx2.SumAbsoluteDifferences(bytes2, zero).AsUInt32() + sums; | ||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector256<uint> DotProduct(Vector256<uint> addend, Vector256<byte> left, Vector256<byte> right) | ||
| { | ||
| Vector256<short> mad = Avx2.MultiplyAddAdjacent(left, right.AsSByte()); | ||
| return Avx2.MultiplyAddAdjacent(mad, Vector256<short>.One).AsUInt32() + addend; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this second one be
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a widen + add pairwise. There's actually a single dot product instruction that does it all in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, but the whole setup here is effectively just doing
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, and that's a dot product. The only widening dot product instructions I'm familiar with for x86 are VNNI. What exact instruction sequence are you thinking of? |
||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static (Vector128<uint> vs1, Vector128<uint> vs2) VectorLoop<TAccumulate, TDotProduct>(Vector128<uint> vs1, Vector128<uint> vs2, ref byte sourceRef, uint vectors) | ||
| where TAccumulate : struct, ISimdAccumulate | ||
| where TDotProduct : struct, ISimdDotProduct | ||
| { | ||
| Debug.Assert(uint.IsEvenInteger(vectors)); | ||
|
|
||
| const uint blockSize = 4; | ||
|
|
||
| Vector256<byte> weights1 = Vector256.CreateSequence((byte)64, unchecked((byte)-1)); | ||
| Vector256<byte> weights2 = Vector256.CreateSequence((byte)32, unchecked((byte)-1)); | ||
|
|
||
| Vector256<uint> ws1 = vs1.ToVector256Unsafe(); | ||
| Vector256<uint> ws2 = vs2.ToVector256Unsafe(); | ||
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| while (vectors >= blockSize) | ||
| { | ||
| Vector256<uint> ws3 = Vector256<uint>.Zero; | ||
| Vector256<uint> wps = Vector256<uint>.Zero; | ||
|
|
||
| uint blocks = uint.Min(vectors, Adler32Simd.VMax) / blockSize; | ||
| vectors -= blocks * blockSize; | ||
|
|
||
| do | ||
| { | ||
| Vector256<byte> bytes1 = Vector256.LoadUnsafe(ref sourceRef); | ||
| Vector256<byte> bytes2 = Vector256.LoadUnsafe(ref sourceRef, (uint)Vector256<byte>.Count); | ||
| sourceRef = ref Unsafe.Add(ref sourceRef, (uint)Vector256<byte>.Count * 2); | ||
|
|
||
| wps += ws1; | ||
|
|
||
| ws1 = Accumulate(ws1, bytes1, bytes2); | ||
|
Comment on lines
+251
to
+253
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to be doing a full reduction every loop iteration here for It seems like a simply
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are different accumulators that move at different rates. The only easy thing to factor out is the multiplication of the previous sum by the number of bytes that it would be added to each iteration, and that's already done. |
||
| ws2 = DotProduct(ws2, bytes1, weights1); | ||
| ws3 = DotProduct(ws3, bytes2, weights2); | ||
|
Comment on lines
+254
to
+255
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar here. This is effectively just doing It isn't clear why the sum at this point is actually needed every inner loop iteration and why it couldn't be hoisted "out" so that it's done only in the outer loop
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It could be hoisted out, but then you still have to widen each element before accumulating, which is still expensive. See the first attempt at Arm64 acceleration in Stephen's PR for an idea what that looks like. It's not faster.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Widening is significantly cheaper and more pipelineable (and at least on AVX512 has single instruction, single cycle versions that goto wider registers). I would expect decent savings if we were only widening and not doing the reductions per inner loop iteration, particularly that would simplify the algorithm and allow better code sharing across all these platforms.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I gave that a quick try again (tried it before a long time ago but didn't keep the result as it wasn't worthwhile). It's still slower. If you think you can do better than this implementation, be my guest. This was the best perf I could get, on every platform.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're notably not strictly looking for the "best perf" on every platform. Rather, we're looking for good enough perf given the complexity, expected payload sizes, and real world impact (not every function is going to be a bottleneck or matter if its taking 200ns vs 400ns). So part of what's being considered here is whether the extra code complexity, generics, impact to NAOT image size, or other scenarios, etc are meaningful enough compared to just having the simpler and very slightly slower code. -- With this being a case where I expect we can remove most of the per ISA customization and still get "close enough" or even matching on most hardware, particularly when not simply looking at "latest" Intel or AMD and rather at the broad range of typical hardware targets which tend to be a bit older (Haswell, Skylake, Ice Lake, Zen 2, etc).
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Although this is more lines of code, I'd argue it's less complex, simply because it's more consistent. The current implementation uses different logic for different vector widths, inconsistent variable names, etc. The new implementation uses the same skeleton for all, with only very small kernels abstracted away per-platform and with names that are easy to follow.
It should be noted, this generic approach is an improvement for NAOT code size, because e.g. on x64, instead of dynamically dispatching between up to 3 different implementations depending on ISA support and input size, this will choose exactly 1, which will always be used for any input >= 16 bytes. In the case of Arm64, it will compile up to 2 potential versions of the core method, but it moves the ISA check out to the dispatcher, avoiding dynamic checks in the inner loop. And if 2x performance isn't good enough to justify a second copy, why are we bothering to implement |
||
| } | ||
| while (--blocks != 0); | ||
|
|
||
| ws2 += wps << 6; | ||
| ws2 += ws3; | ||
|
|
||
| ws1 = QuickModBase(ws1); | ||
| ws2 = QuickModBase(ws2); | ||
| } | ||
|
|
||
| if (vectors != 0) | ||
| { | ||
| Debug.Assert(vectors == 2); | ||
|
|
||
| Vector256<byte> bytes = Vector256.LoadUnsafe(ref sourceRef); | ||
| Vector256<uint> wps = ws1; | ||
|
|
||
| ws1 = Accumulate(ws1, bytes); | ||
| ws2 = DotProduct(ws2, bytes, weights2); | ||
|
|
||
| ws2 += wps << 5; | ||
| } | ||
|
|
||
| vs1 = ws1.GetLower() + ws1.GetUpper(); | ||
| vs2 = ws2.GetLower() + ws2.GetUpper(); | ||
|
|
||
| return (vs1, vs2); | ||
| } | ||
| } | ||
|
|
||
| file struct AccumulateX86 : ISimdAccumulate | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes) | ||
| => Sse2.SumAbsoluteDifferences(bytes, Vector128<byte>.Zero).AsUInt32() + sums; | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes1, Vector128<byte> bytes2) | ||
| { | ||
| Vector128<byte> zero = Vector128<byte>.Zero; | ||
| Vector128<uint> sad = Sse2.SumAbsoluteDifferences(bytes1, zero).AsUInt32(); | ||
| return sad + Sse2.SumAbsoluteDifferences(bytes2, zero).AsUInt32() + sums; | ||
| } | ||
| } | ||
|
|
||
| file struct AccumulateArm64 : ISimdAccumulate | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes) | ||
| => AdvSimd.Arm64.AddAcrossWidening(bytes).AsUInt32().ToVector128Unsafe() + sums; | ||
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
saucecontrol marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes1, Vector128<byte> bytes2) | ||
| => AdvSimd.AddPairwiseWideningAndAdd(sums, AdvSimd.AddPairwiseWideningAndAdd(AdvSimd.AddPairwiseWidening(bytes1), bytes2)); | ||
| } | ||
|
|
||
| file struct AccumulateXplat : ISimdAccumulate | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes) | ||
| { | ||
| (Vector128<ushort> bl, Vector128<ushort> bh) = Vector128.Widen(bytes); | ||
| (Vector128<uint> sl, Vector128<uint> sh) = Vector128.Widen(bl + bh); | ||
| return sums + sl + sh; | ||
| } | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes1, Vector128<byte> bytes2) | ||
| { | ||
| (Vector128<ushort> b1l, Vector128<ushort> b1h) = Vector128.Widen(bytes1); | ||
| (Vector128<ushort> b2l, Vector128<ushort> b2h) = Vector128.Widen(bytes2); | ||
| (Vector128<uint> sl, Vector128<uint> sh) = Vector128.Widen((b1l + b1h) + (b2l + b2h)); | ||
| return sums + sl + sh; | ||
| } | ||
| } | ||
|
|
||
| file struct DotProductX86 : ISimdDotProduct | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> DotProduct(Vector128<uint> addend, Vector128<byte> left, Vector128<byte> right) | ||
| { | ||
| Vector128<short> mad = Ssse3.MultiplyAddAdjacent(left, right.AsSByte()); | ||
| return Sse2.MultiplyAddAdjacent(mad, Vector128<short>.One).AsUInt32() + addend; | ||
| } | ||
| } | ||
|
|
||
| file struct DotProductArm64 : ISimdDotProduct | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> DotProduct(Vector128<uint> addend, Vector128<byte> left, Vector128<byte> right) | ||
| { | ||
| Vector128<ushort> mad = AdvSimd.MultiplyWideningLower(left.GetLower(), right.GetLower()); | ||
| mad = AdvSimd.MultiplyWideningUpperAndAdd(mad, left, right); | ||
| return AdvSimd.AddPairwiseWideningAndAdd(addend, mad); | ||
| } | ||
| } | ||
|
|
||
| file struct DotProductArm64Dp : ISimdDotProduct | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> DotProduct(Vector128<uint> addend, Vector128<byte> left, Vector128<byte> right) | ||
| => Dp.DotProduct(addend, left, right); | ||
| } | ||
|
|
||
| file struct DotProductXplat : ISimdDotProduct | ||
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector128<uint> DotProduct(Vector128<uint> addend, Vector128<byte> left, Vector128<byte> right) | ||
| { | ||
| (Vector128<ushort> ll, Vector128<ushort> lh) = Vector128.Widen(left); | ||
| (Vector128<ushort> rl, Vector128<ushort> rh) = Vector128.Widen(right); | ||
| (Vector128<uint> ml, Vector128<uint> mh) = Vector128.Widen(ll * rl + lh * rh); | ||
| return addend + ml + mh; | ||
| } | ||
| } | ||
|
|
||
| file interface ISimdAccumulate | ||
| { | ||
| static abstract Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes); | ||
|
|
||
| static abstract Vector128<uint> Accumulate(Vector128<uint> sums, Vector128<byte> bytes1, Vector128<byte> bytes2); | ||
| } | ||
|
|
||
| file interface ISimdDotProduct | ||
| { | ||
| static abstract Vector128<uint> DotProduct(Vector128<uint> addend, Vector128<byte> left, Vector128<byte> right); | ||
| } | ||
|
|
||
| file interface ISimdStrategy | ||
| { | ||
| static abstract (Vector128<uint> vs1, Vector128<uint> vs2) VectorLoop<TAccumulate, TDotProduct>(Vector128<uint> vs1, Vector128<uint> vs2, ref byte sourceRef, uint vectors) | ||
| where TAccumulate : struct, ISimdAccumulate | ||
| where TDotProduct : struct, ISimdDotProduct; | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why separate it out like this?
The JIT tends to special case 1 level of inlining differently from 2+ levels of inlining and so simple forwarders like this can hurt things more than help.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SIMD implementation is all in file-scoped types, so it has to be called from something in this file. I could make those types nested private, but since there are so many, I was trying to keep them entirely local. If you prefer the nested approach, I can easily change it, though I don't foresee any issues with inlining limits here given the core method is intentionally marked
NoInlining.