Skip to content

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 29, 2020

In preparation for exposing Encoding.Latin1 publicly (see #31549), this PR cleans up the Latin1Encoding implementation. It removes from the Latin1Encoding type all of the logic dealing with fallback behavior and persisting state across calls to EncoderNLS.Convert, and it instead leverages the shared infrastructure already used by the ASCIIEncoding and UTF8Encoding classes (see file Encoding.Internal.cs). Using the shared infrastructure also fixes #415.

This PR also removes the EncodingNLS type as it's no longer needed. Additionally, the InternalDecoderBestFitFallback type is not needed since no built-in framework type provides a byte-to-char best fit mapping, so in effect it was a glorified (and expensive) equivalent to EncoderFallback.ReplacementFallback.

The infrastructure for encodings to provide their own char-to-byte best fit mapping is also removed, as nobody aside from Latin1Encoding ever used it. So that logic was all moved into the self-contained InternalEncoderBestFitFallback type, which in this PR is renamed to EncoderLatin1BestFitFallback.

These best-fit changes allow us to remove knowledge of best-fit mappings from the Encoding base class. Now the base class can leverage the built-in EncoderFallback.ReplacementFallback and DecoderFallback.ReplacementFallback singletons. I also converted these properties from lazily initialized to eagerly initialized since they're automatically dereferenced by the Encoding base class ctor, which every application who needs an EncoderFallback instance already uses.

The new logic in the Latin1Encoding class is largely copied from the existing ASCIIEncoding logic. There are some minor tweaks (such as parameter names) to provide compatibility with Full Framework. But the patterns should look largely the same.

The new logic in the Latin1Utility class is largely copied from the existing AsciiUtility logic, with conditional statements modified as needed to account for the fact that we care about [ 00..FF ] instead of [ 00..7F ]. Notably, there's no GetIndexOfFirstInvalidLatin1Byte equivalent since such a method would be meaningless. Additionally, the WidenLatin1ToUtf16 logic becomes simpler (and returns void) since no input validation need be performed. So the method looks a little tighter than its ASCII equivalent.

/cc @tannergooding to review the SIMD logic. Since as mentioned above it's largely copied from AsciiUtility I don't think it'll be too contentious.

Perf measurements:

For empty inputs, GetChars is slower than the baseline since the new logic doesn't have an early-exit optimization. I anticipate passing empty buffers to these APIs would be uncommon so I'm not too worried about this.

Passing a buffer of size 1 byte is essentially on-par with the old implementation: we're within 1 ns on my box. Anything beyond that sees significant gains from being able to take advantage of SIMD operations, eventually hitting a limit of 10x throughput compared to baseline for large inputs.

Method Toolchain Size Mean Error StdDev Ratio RatioSD
GetChars latin1 0 8.659 ns 7.090 ns 0.3886 ns 2.24 0.06
GetChars master 0 3.875 ns 5.148 ns 0.2822 ns 1.00 0.00
GetChars latin1 1 9.186 ns 8.921 ns 0.4890 ns 1.13 0.10
GetChars master 1 8.173 ns 6.035 ns 0.3308 ns 1.00 0.00
GetChars latin1 7 9.590 ns 2.612 ns 0.1432 ns 0.79 0.02
GetChars master 7 12.206 ns 4.062 ns 0.2227 ns 1.00 0.00
GetChars latin1 12 8.993 ns 8.262 ns 0.4529 ns 0.54 0.04
GetChars master 12 16.658 ns 8.118 ns 0.4450 ns 1.00 0.00
GetChars latin1 84 16.029 ns 8.952 ns 0.4907 ns 0.24 0.01
GetChars master 84 66.176 ns 81.287 ns 4.4556 ns 1.00 0.00
GetChars latin1 128 16.041 ns 13.964 ns 0.7654 ns 0.16 0.00
GetChars master 128 99.205 ns 42.752 ns 2.3434 ns 1.00 0.00
GetChars latin1 4096 275.788 ns 457.864 ns 25.0971 ns 0.10 0.01
GetChars master 4096 2,737.505 ns 231.807 ns 12.7061 ns 1.00 0.00

Like above, GetBytes is slower than the baseline for empty inputs. All other input sizes (including 1 byte) see throughput improvements. Eventually the SIMD code paths hit a limit of 30x throughput compared to baseline for large inputs.

The table below is for the case where all data in the char[] is Latin-1 (U+0000..U+00FF), so we never need to invoke the fallback mechanism.

Method Toolchain Size Mean Error StdDev Ratio RatioSD
GetBytes latin1 0 11.409 ns 17.830 ns 0.9773 ns 3.31 0.23
GetBytes master 0 3.450 ns 5.833 ns 0.3197 ns 1.00 0.00
GetBytes latin1 1 12.165 ns 7.398 ns 0.4055 ns 0.80 0.04
GetBytes master 1 15.307 ns 9.890 ns 0.5421 ns 1.00 0.00
GetBytes latin1 7 13.099 ns 9.813 ns 0.5379 ns 0.46 0.04
GetBytes master 7 28.792 ns 25.198 ns 1.3812 ns 1.00 0.00
GetBytes latin1 12 13.985 ns 9.297 ns 0.5096 ns 0.35 0.01
GetBytes master 12 40.504 ns 37.528 ns 2.0570 ns 1.00 0.00
GetBytes latin1 84 20.343 ns 14.989 ns 0.8216 ns 0.09 0.01
GetBytes master 84 219.465 ns 67.354 ns 3.6919 ns 1.00 0.00
GetBytes latin1 128 23.795 ns 14.981 ns 0.8212 ns 0.08 0.01
GetBytes master 128 317.786 ns 314.992 ns 17.2658 ns 1.00 0.00
GetBytes latin1 4096 276.385 ns 59.697 ns 3.2722 ns 0.03 0.00
GetBytes master 4096 10,225.770 ns 18,583.546 ns 1,018.6269 ns 1.00 0.00

- Vectorizes Latin1 narrowing / widening code paths
- Re-plumbs Latin1Encoding to use refactored Encoding workhorses
- Removes unused EncodingNLS type
- Removes unused DecoderBestFitFallback type
- Uses "? replacement" behavior for all Encoding subclassed types by default, except Latin1Encoding which still uses best-fit
/// use best-fit substitution instead" semantics. This type allows for devirtualization of calls made directly
/// off of <see cref="Encoding.Latin1"/>.
/// </summary>
internal sealed class Latin1EncodingSealed : Latin1Encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we call it sealed? Can't we call it BestFitLatine1Encoding (including the file name of course)?

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses same naming pattern as UTF8EncodingSealed and ASCIIEncodingSealed. (The sealed part is the interesting part, since it allows JIT devirtualization.)

@GrabYourPitchforks
Copy link
Member Author

I can back out the workaround for #33002 from this PR if this isn't the right place for it. But it is giving fairly significant improvements. I'm going to rerun the UTF-8 benchmarks as well as they should be able to benefit from this.

@jkotas
Copy link
Member

jkotas commented Feb 29, 2020

The lea instruction is the result of CSE optimization. The JIT sees it as a good CSE opportunity. And it is not that bad guess - the code size with the CSE is not bigger. Unfortunately, the extra lea seems to have disproportinal performance impact due to some micro-architecture reasons.

The workaround is pretty fragile. It is exploiting deficiency in CSE optimization. If/once somebody fixes CSE for shifts and multiplications, it won't work anymore.

Is the lea performance impact same regardless of where it is relative to the other instructions? It may be useful to measure performance of different instruction orderings:

byte* pStore = (byte*)(pUtf16Buffer + currentOffset);
Sse2.StoreAligned(pStore, Sse2.UnpackLow(asciiVector, zeroVector));
Sse2.StoreAligned(pStore + 0x10, Sse2.UnpackHigh(asciiVector, zeroVector));
currentOffset += SizeOfVector128;

or

byte* pStore = (byte*)(pUtf16Buffer + currentOffset);
currentOffset += SizeOfVector128;
Sse2.StoreAligned(pStore, Sse2.UnpackLow(asciiVector, zeroVector));
Sse2.StoreAligned(pStore + 0x10, Sse2.UnpackHigh(asciiVector, zeroVector));

@GrabYourPitchforks
Copy link
Member Author

I'm guessing (and this is just a guess) that the lea instruction introduces rax into the dependency chain for the subsequent write instructions. Removing the lea breaks the dependency chain. Does this seem like a reasonable guess? Perhaps I'm way off here?

@jkotas
Copy link
Member

jkotas commented Feb 29, 2020

It may be something like that. And changing the distance between where this value is produced and consumed may make the problem go away.

@GrabYourPitchforks
Copy link
Member Author

I experimented a bit in the AsciiUtility code base (which this is based on) with changing the ordering of instructions. If we must use a temp register to hold the intermediate memory address calculations, it seems the best perf comes from moving that as far away from the write as possible. This lends evidence to the earlier theory that it's a dependency chain issue.

After multiple benchmark runs, not using the temporary register at all still outperforms using a temporary register. But the delta between the two is fairly small now.

For reference, here's the AsciiUtility hot loop which served as the baseline:

do
{
// In a loop, perform an unaligned read, widen to two vectors, then aligned write the two vectors.
asciiVector = Sse2.LoadVector128(pAsciiBuffer + currentOffset); // unaligned load
mask = (uint)Sse2.MoveMask(asciiVector);
if (mask != 0)
{
// non-ASCII byte somewhere
goto NonAsciiDataSeenInInnerLoop;
}
byte* pStore = (byte*)(pUtf16Buffer + currentOffset);
Sse2.StoreAligned(pStore, Sse2.UnpackLow(asciiVector, zeroVector));
pStore += SizeOfVector128;
Sse2.StoreAligned(pStore, Sse2.UnpackHigh(asciiVector, zeroVector));
currentOffset += SizeOfVector128;
} while (currentOffset <= finalOffsetWhereCanRunLoop);

Method Toolchain Corpus Mean Error StdDev Ratio RatioSD
GetChars ascii_base 11.txt 1.840 ms 0.0365 ms 0.0910 ms 1.00 0.00
GetChars ascii_testa 11.txt 1.769 ms 0.0352 ms 0.0494 ms 0.91 0.04
GetChars ascii_testb 11.txt 1.708 ms 0.0250 ms 0.0222 ms 0.88 0.03
GetChars ascii_testc 11.txt 1.719 ms 0.0318 ms 0.0281 ms 0.88 0.03
GetChars ascii_testd 11.txt 1.724 ms 0.0338 ms 0.0485 ms 0.89 0.03
GetChars ascii_teste 11.txt 1.696 ms 0.0315 ms 0.0279 ms 0.87 0.04
; TEST A - Move the 'lea' instruction to the beginning of the loop

00007ffe`3993d1f5 4c8d1442        lea     r10,[rdx+rax*2]
00007ffe`3993d1f9 c5fa6f0401      vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3993d1fe c579d7c8        vpmovmskb r9d,xmm0
00007ffe`3993d202 4585c9          test    r9d,r9d
00007ffe`3993d205 7520            jne     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c49db7 (00007ffe`3993d227) [br=0]
00007ffe`3993d207 c5f960d1        vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3993d20b c4c1797f12      vmovdqa xmmword ptr [r10],xmm2
00007ffe`3993d210 4983c210        add     r10,10h
00007ffe`3993d214 c5f968c1        vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3993d218 c4c1797f02      vmovdqa xmmword ptr [r10],xmm0
00007ffe`3993d21d 4883c010        add     rax,10h
00007ffe`3993d221 493bc0          cmp     rax,r8
00007ffe`3993d224 76cf            jbe     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c49d85 (00007ffe`3993d1f5)

; TEST B - Move the 'lea' instruction outside the loop, increment both pointers when loop continues

00007ffe`3945d1f5 4c8d1442        lea     r10,[rdx+rax*2]
00007ffe`3945d1f9 c5fa6f0401      vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fe c579d7c8        vpmovmskb r9d,xmm0
00007ffe`3945d202 4585c9          test    r9d,r9d
00007ffe`3945d205 7521            jne     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9db8 (00007ffe`3945d228)
00007ffe`3945d207 c5f960d1        vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d20b c4c1797f12      vmovdqa xmmword ptr [r10],xmm2
00007ffe`3945d210 c5f968c1        vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d214 c4c1797f4210    vmovdqa xmmword ptr [r10+10h],xmm0
00007ffe`3945d21a 4883c010        add     rax,10h
00007ffe`3945d21e 4983c220        add     r10,20h
00007ffe`3945d222 493bc0          cmp     rax,r8
00007ffe`3945d225 76d2            jbe     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d89 (00007ffe`3945d1f9)

; TEST C - Same as (B), but keeps the vpunpck[h|l]bw and vmovdqa instructions together

00007ffe`3945d1f5 4c8d1442        lea     r10,[rdx+rax*2]
00007ffe`3945d1f9 c5fa6f0401      vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fe c579d7c8        vpmovmskb r9d,xmm0
00007ffe`3945d202 4585c9          test    r9d,r9d
00007ffe`3945d205 7521            jne     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9da8 (00007ffe`3945d228)
00007ffe`3945d207 c5f960d1        vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d20b c5f968c1        vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d20f c4c1797f12      vmovdqa xmmword ptr [r10],xmm2
00007ffe`3945d214 c4c1797f4210    vmovdqa xmmword ptr [r10+10h],xmm0
00007ffe`3945d21a 4883c010        add     rax,10h
00007ffe`3945d21e 4983c220        add     r10,20h
00007ffe`3945d222 493bc0          cmp     rax,r8
00007ffe`3945d225 76d2            jbe     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d79 (00007ffe`3945d1f9)

; TEST D - Remove the 'lea' entirely, use the memory addressing trick mentioned in this PR

00007ffe`3945d1f5 c5fa6f0401      vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3945d1fa c579d7c8        vpmovmskb r9d,xmm0
00007ffe`3945d1fe 4585c9          test    r9d,r9d
00007ffe`3945d201 751d            jne     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9da0 (00007ffe`3945d220)
00007ffe`3945d203 c5f960d1        vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3945d207 c5f97f1442      vmovdqa xmmword ptr [rdx+rax*2],xmm2
00007ffe`3945d20c c5f968c1        vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3945d210 c5f97f444210    vmovdqa xmmword ptr [rdx+rax*2+10h],xmm0
00007ffe`3945d216 4883c010        add     rax,10h
00007ffe`3945d21a 493bc0          cmp     rax,r8
00007ffe`3945d21d 76d6            jbe     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a1bb9d75 (00007ffe`3945d1f5)

; TEST E - Same as (D), but keeps the vpunpck[h|l]bw and vmovdqa instructions together

00007ffe`3995d1f5 c5fa6f0401      vmovdqu xmm0,xmmword ptr [rcx+rax]
00007ffe`3995d1fa c579d7c8        vpmovmskb r9d,xmm0
00007ffe`3995d1fe 4585c9          test    r9d,r9d
00007ffe`3995d201 751d            jne     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c69da0 (00007ffe`3995d220)
00007ffe`3995d203 c5f960d1        vpunpcklbw xmm2,xmm0,xmm1
00007ffe`3995d207 c5f968c1        vpunpckhbw xmm0,xmm0,xmm1
00007ffe`3995d20b c5f97f1442      vmovdqa xmmword ptr [rdx+rax*2],xmm2
00007ffe`3995d210 c5f97f444210    vmovdqa xmmword ptr [rdx+rax*2+10h],xmm0
00007ffe`3995d216 4883c010        add     rax,10h
00007ffe`3995d21a 493bc0          cmp     rax,r8
00007ffe`3995d21d 76d6            jbe     System_Private_CoreLib!System.Text.ASCIIUtility.WidenAsciiToUtf16_Sse2(Byte*, Char*, UInt64)+0xffffffff`a0c69d75 (00007ffe`3995d1f5)

// Narrows a vector of words [ w0 w1 w2 w3 ] to a vector of bytes
// [ b0 b1 b2 b3 b0 b1 b2 b3 ], then writes 4 bytes (32 bits) to the destination.

Vector128<short> vecWide = Sse2.X64.ConvertScalarToVector128UInt64(value).AsInt16();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have LoadScalarVector128(ulong*) for 32-bit and I believe value is unlikely to be split between 2 registers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time this method is called we already have a ulong (as a value), not a ulong*. I tried using &value hoping that the JIT would optimize it since the method is marked as aggressive inlining, but it still results in a stack spill.

Is there a similar method that takes a ulong directly as a value? For context, this method is currently hit in x64 and arm64 code paths, but not x86 or arm32. Though there's nothing that would inherently prevent it from working if called by a 32-bit process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it still results in a stack spill.

Is this true for x86 as well? I would guess there is a higher probability of it already being on the stack due to it being 64-bits and therefore not fitting into register.

{
Debug.Assert(AllCharsInUInt32AreLatin1(value));

if (BitConverter.IsLittleEndian)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was vectorizing this one not worthwhile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The current implementation compiles to three instructions:

mov byte ptr [outputBuffer], value
shr value, 16
mov byte ptr [outputBuffer + 1], value

It was hard to beat that with SIMD.

// this method is running.

return (Sse2.IsSupported)
? GetIndexOfFirstNonLatin1Char_Sse2(pBuffer, bufferLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these split due to method size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was done to make the method more friendly to human readers. (AsciiUtility reads largely the same way.) Did I inadvertently defeat JIT or AOT optimizations?

// Before we drain off char-by-char, try a generic vectorized loop.
// Only run the loop if we have at least two vectors we can pull out.

if (Vector.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector<ushort>.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses Vector<T> and I noticed you called out wanting to use instructions like pmovmskb.

Do you still get similar wins if you utilize the Vector<T> to/from Vector128/256<T> conversion APIs for just when those instructions are needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try this, but I'm not quite sure how to do this cleanly. It's not just pmovmskb. I'm also using ptest (if allowed) and paddusw (if ptest is unavailable). I also have different alignment requirements between the two methods: the SSE2-specific code path uses only 128-bit instructions because I didn't see any perf gains (and in fact saw a loss for small inputs) when switching to 256-bit instructions. This also means that the SSE2-specific code paths only require 128-bit alignment, whereas the generalized code path has no particular alignment requirements.

@GrabYourPitchforks
Copy link
Member Author

Most recent commit responds to most of the PR feedback. I also changed both the ASCII and the Latin-1 code paths to be a little less sensitive to CSE optimizations, as per #32994 (comment) this was a bit fragile. The reworked logic retains most of the perf gains that the earlier more fragile logic was trying to achieve.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorization logic looks good

@stephentoub
Copy link
Member

@GrabYourPitchforks, other than the conflict, can this be merged?

@GrabYourPitchforks GrabYourPitchforks merged commit 775bbf0 into dotnet:master Jun 11, 2020
@GrabYourPitchforks GrabYourPitchforks deleted the latin1 branch June 11, 2020 06:44
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calling Latin1 / iso-8859-1 Encoder Convert method iteratively produces different results than Encoding GetBytes on "naughty" string

6 participants