-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Clean up Latin1Encoding implementation and vectorize its logic #32994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Vectorizes Latin1 narrowing / widening code paths - Re-plumbs Latin1Encoding to use refactored Encoding workhorses - Removes unused EncodingNLS type - Removes unused DecoderBestFitFallback type - Uses "? replacement" behavior for all Encoding subclassed types by default, except Latin1Encoding which still uses best-fit
| /// use best-fit substitution instead" semantics. This type allows for devirtualization of calls made directly | ||
| /// off of <see cref="Encoding.Latin1"/>. | ||
| /// </summary> | ||
| internal sealed class Latin1EncodingSealed : Latin1Encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we call it sealed? Can't we call it BestFitLatine1Encoding (including the file name of course)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses same naming pattern as UTF8EncodingSealed and ASCIIEncodingSealed. (The sealed part is the interesting part, since it allows JIT devirtualization.)
src/libraries/System.Private.CoreLib/src/System/Text/DecoderFallback.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/DecoderFallback.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/EncoderLatin1BestFitFallback.cs
Outdated
Show resolved
Hide resolved
|
I can back out the workaround for #33002 from this PR if this isn't the right place for it. But it is giving fairly significant improvements. I'm going to rerun the UTF-8 benchmarks as well as they should be able to benefit from this. |
|
The The workaround is pretty fragile. It is exploiting deficiency in CSE optimization. If/once somebody fixes CSE for shifts and multiplications, it won't work anymore. Is the or |
|
I'm guessing (and this is just a guess) that the |
|
It may be something like that. And changing the distance between where this value is produced and consumed may make the problem go away. |
|
I experimented a bit in the After multiple benchmark runs, not using the temporary register at all still outperforms using a temporary register. But the delta between the two is fairly small now. For reference, here's the runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Lines 1650 to 1670 in 2898e1e
; TEST A - Move the 'lea' instruction to the beginning of the loop
00007ffe`3993d1f5 4c8d1442 lea r10,[rdx+rax*2]
00007ffe`3993d1f9 c5fa6f0401 vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3993d1fe c579d7c8 vpmovmskb r9d,xmm0
00007ffe`3993d202 4585c9 test r9d,r9d
00007ffe`3993d205 7520 jne System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c49db7 (00007ffe`3993d227) [br=0]
00007ffe`3993d207 c5f960d1 vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3993d20b c4c1797f12 vmovdqa xmmword ptr [r10],xmm2
00007ffe`3993d210 4983c210 add r10,10h
00007ffe`3993d214 c5f968c1 vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3993d218 c4c1797f02 vmovdqa xmmword ptr [r10],xmm0
00007ffe`3993d21d 4883c010 add rax,10h
00007ffe`3993d221 493bc0 cmp rax,r8
00007ffe`3993d224 76cf jbe System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c49d85 (00007ffe`3993d1f5)
; TEST B - Move the 'lea' instruction outside the loop, increment both pointers when loop continues
00007ffe`3945d1f5 4c8d1442 lea r10,[rdx+rax*2]
00007ffe`3945d1f9 c5fa6f0401 vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fe c579d7c8 vpmovmskb r9d,xmm0
00007ffe`3945d202 4585c9 test r9d,r9d
00007ffe`3945d205 7521 jne System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9db8 (00007ffe`3945d228)
00007ffe`3945d207 c5f960d1 vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d20b c4c1797f12 vmovdqa xmmword ptr [r10],xmm2
00007ffe`3945d210 c5f968c1 vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d214 c4c1797f4210 vmovdqa xmmword ptr [r10+10h],xmm0
00007ffe`3945d21a 4883c010 add rax,10h
00007ffe`3945d21e 4983c220 add r10,20h
00007ffe`3945d222 493bc0 cmp rax,r8
00007ffe`3945d225 76d2 jbe System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d89 (00007ffe`3945d1f9)
; TEST C - Same as (B), but keeps the vpunpck[h|l]bw and vmovdqa instructions together
00007ffe`3945d1f5 4c8d1442 lea r10,[rdx+rax*2]
00007ffe`3945d1f9 c5fa6f0401 vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fe c579d7c8 vpmovmskb r9d,xmm0
00007ffe`3945d202 4585c9 test r9d,r9d
00007ffe`3945d205 7521 jne System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9da8 (00007ffe`3945d228)
00007ffe`3945d207 c5f960d1 vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d20b c5f968c1 vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d20f c4c1797f12 vmovdqa xmmword ptr [r10],xmm2
00007ffe`3945d214 c4c1797f4210 vmovdqa xmmword ptr [r10+10h],xmm0
00007ffe`3945d21a 4883c010 add rax,10h
00007ffe`3945d21e 4983c220 add r10,20h
00007ffe`3945d222 493bc0 cmp rax,r8
00007ffe`3945d225 76d2 jbe System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d79 (00007ffe`3945d1f9)
; TEST D - Remove the 'lea' entirely, use the memory addressing trick mentioned in this PR
00007ffe`3945d1f5 c5fa6f0401 vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fa c579d7c8 vpmovmskb r9d,xmm0
00007ffe`3945d1fe 4585c9 test r9d,r9d
00007ffe`3945d201 751d jne System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9da0 (00007ffe`3945d220)
00007ffe`3945d203 c5f960d1 vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d207 c5f97f1442 vmovdqa xmmword ptr [rdx+rax*2],xmm2
00007ffe`3945d20c c5f968c1 vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d210 c5f97f444210 vmovdqa xmmword ptr [rdx+rax*2+10h],xmm0
00007ffe`3945d216 4883c010 add rax,10h
00007ffe`3945d21a 493bc0 cmp rax,r8
00007ffe`3945d21d 76d6 jbe System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d75 (00007ffe`3945d1f5)
; TEST E - Same as (D), but keeps the vpunpck[h|l]bw and vmovdqa instructions together
00007ffe`3995d1f5 c5fa6f0401 vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3995d1fa c579d7c8 vpmovmskb r9d,xmm0
00007ffe`3995d1fe 4585c9 test r9d,r9d
00007ffe`3995d201 751d jne System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c69da0 (00007ffe`3995d220)
00007ffe`3995d203 c5f960d1 vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3995d207 c5f968c1 vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3995d20b c5f97f1442 vmovdqa xmmword ptr [rdx+rax*2],xmm2
00007ffe`3995d210 c5f97f444210 vmovdqa xmmword ptr [rdx+rax*2+10h],xmm0
00007ffe`3995d216 4883c010 add rax,10h
00007ffe`3995d21a 493bc0 cmp rax,r8
00007ffe`3995d21d 76d6 jbe System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c69d75 (00007ffe`3995d1f5) |
| // Narrows a vector of words [ w0 w1 w2 w3 ] to a vector of bytes | ||
| // [ b0 b1 b2 b3 b0 b1 b2 b3 ], then writes 4 bytes (32 bits) to the destination. | ||
|
|
||
| Vector128<short> vecWide = Sse2.X64.ConvertScalarToVector128UInt64(value).AsInt16(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have LoadScalarVector128(ulong*) for 32-bit and I believe value is unlikely to be split between 2 registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time this method is called we already have a ulong (as a value), not a ulong*. I tried using &value hoping that the JIT would optimize it since the method is marked as aggressive inlining, but it still results in a stack spill.
Is there a similar method that takes a ulong directly as a value? For context, this method is currently hit in x64 and arm64 code paths, but not x86 or arm32. Though there's nothing that would inherently prevent it from working if called by a 32-bit process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it still results in a stack spill.
Is this true for x86 as well? I would guess there is a higher probability of it already being on the stack due to it being 64-bits and therefore not fitting into register.
| { | ||
| Debug.Assert(AllCharsInUInt32AreLatin1(value)); | ||
|
|
||
| if (BitConverter.IsLittleEndian) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was vectorizing this one not worthwhile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The current implementation compiles to three instructions:
mov byte ptr [outputBuffer], value
shr value, 16
mov byte ptr [outputBuffer + 1], valueIt was hard to beat that with SIMD.
| // this method is running. | ||
|
|
||
| return (Sse2.IsSupported) | ||
| ? GetIndexOfFirstNonLatin1Char_Sse2(pBuffer, bufferLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these split due to method size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done to make the method more friendly to human readers. (AsciiUtility reads largely the same way.) Did I inadvertently defeat JIT or AOT optimizations?
| // Before we drain off char-by-char, try a generic vectorized loop. | ||
| // Only run the loop if we have at least two vectors we can pull out. | ||
|
|
||
| if (Vector.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector<ushort>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses Vector<T> and I noticed you called out wanting to use instructions like pmovmskb.
Do you still get similar wins if you utilize the Vector<T> to/from Vector128/256<T> conversion APIs for just when those instructions are needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try this, but I'm not quite sure how to do this cleanly. It's not just pmovmskb. I'm also using ptest (if allowed) and paddusw (if ptest is unavailable). I also have different alignment requirements between the two methods: the SSE2-specific code path uses only 128-bit instructions because I didn't see any perf gains (and in fact saw a loss for small inputs) when switching to 256-bit instructions. This also means that the SSE2-specific code paths only require 128-bit alignment, whereas the generalized code path has no particular alignment requirements.
src/libraries/System.Private.CoreLib/src/System/Text/Latin1Utility.cs
Outdated
Show resolved
Hide resolved
|
Most recent commit responds to most of the PR feedback. I also changed both the ASCII and the Latin-1 code paths to be a little less sensitive to CSE optimizations, as per #32994 (comment) this was a bit fragile. The reworked logic retains most of the perf gains that the earlier more fragile logic was trying to achieve. |
tannergooding
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vectorization logic looks good
|
@GrabYourPitchforks, other than the conflict, can this be merged? |
4eeaf6b to
98d366c
Compare
In preparation for exposing
Encoding.Latin1publicly (see #31549), this PR cleans up theLatin1Encodingimplementation. It removes from theLatin1Encodingtype all of the logic dealing with fallback behavior and persisting state across calls toEncoderNLS.Convert, and it instead leverages the shared infrastructure already used by theASCIIEncodingandUTF8Encodingclasses (see file Encoding.Internal.cs). Using the shared infrastructure also fixes #415.This PR also removes the
EncodingNLStype as it's no longer needed. Additionally, theInternalDecoderBestFitFallbacktype is not needed since no built-in framework type provides abyte-to-charbest fit mapping, so in effect it was a glorified (and expensive) equivalent toEncoderFallback.ReplacementFallback.The infrastructure for encodings to provide their own
char-to-bytebest fit mapping is also removed, as nobody aside fromLatin1Encodingever used it. So that logic was all moved into the self-containedInternalEncoderBestFitFallbacktype, which in this PR is renamed toEncoderLatin1BestFitFallback.These best-fit changes allow us to remove knowledge of best-fit mappings from the
Encodingbase class. Now the base class can leverage the built-inEncoderFallback.ReplacementFallbackandDecoderFallback.ReplacementFallbacksingletons. I also converted these properties from lazily initialized to eagerly initialized since they're automatically dereferenced by theEncodingbase class ctor, which every application who needs anEncoderFallbackinstance already uses.The new logic in the
Latin1Encodingclass is largely copied from the existingASCIIEncodinglogic. There are some minor tweaks (such as parameter names) to provide compatibility with Full Framework. But the patterns should look largely the same.The new logic in the
Latin1Utilityclass is largely copied from the existingAsciiUtilitylogic, with conditional statements modified as needed to account for the fact that we care about[ 00..FF ]instead of[ 00..7F ]. Notably, there's noGetIndexOfFirstInvalidLatin1Byteequivalent since such a method would be meaningless. Additionally, theWidenLatin1ToUtf16logic becomes simpler (and returns void) since no input validation need be performed. So the method looks a little tighter than its ASCII equivalent./cc @tannergooding to review the SIMD logic. Since as mentioned above it's largely copied from
AsciiUtilityI don't think it'll be too contentious.Perf measurements:
For empty inputs,
GetCharsis slower than the baseline since the new logic doesn't have an early-exit optimization. I anticipate passing empty buffers to these APIs would be uncommon so I'm not too worried about this.Passing a buffer of size 1 byte is essentially on-par with the old implementation: we're within 1 ns on my box. Anything beyond that sees significant gains from being able to take advantage of SIMD operations, eventually hitting a limit of 10x throughput compared to baseline for large inputs.
Like above,
GetBytesis slower than the baseline for empty inputs. All other input sizes (including 1 byte) see throughput improvements. Eventually the SIMD code paths hit a limit of 30x throughput compared to baseline for large inputs.The table below is for the case where all data in the
char[]is Latin-1 (U+0000..U+00FF), so we never need to invoke the fallback mechanism.