Speed up Convert.ToBase64String()#69884
Conversation
|
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
|
Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsIt is 30% performance optimization: Benchmarked on 1000 random byte arrays of lengths 0..999 Used tricks:
|
Co-authored-by: Jeff Handley <jeffhandley@users.noreply.github.com>
| j += 4; | ||
| a = *inData++; | ||
| b = *inData++; | ||
| *outPairs++ = base64Pairs[(a << 4) | (b >> 4)]; |
There was a problem hiding this comment.
@tannergooding
Why is the codegen soo different to LLVM? https://godbolt.org/z/T8eqE9chv
LLVM
shr sil, 4
shl dil, 4
or dil, sil
movzx eax, dil
ret.NET JIT
movzx rax, dil
shl eax, 4
movzx rdi, sil
sar edi, 4
or eax, edi
ret There was a problem hiding this comment.
@deeprobin feel free to file an issue. Related one is #13816 (you can use PR that closed it as a foundation for your PR if you want to contribute 🙂)
|
There was a PR to use SSE for this API - dotnet/coreclr#21833 |
Why didn't we finish it? |
It got lost during coreclr->runtime migration 🙂 I can port it to crossplat intrinsics after(if) this PR lands to avoid conflicts |
A quick skim of that PR suggests it doesn't require a 16K lookup table? If that's the case, with appreciation to Sergei, I'd prefer we just start with that PR as something that's faster and smaller. |
I agree, vector intrinsics are always preferable. This 16K-price implementation may be fallback for platforms without SSE/AVX-like instructions. |
|
@EgorBo, are you still going to look at bringing that PR back to life? |
Sure, will take a look in an hour |
So I took a look - there was a small bug in the impl, but the problem that it won't be trivial to extend it to support Arm - it has to many non-shared intrinsics, I'll try to port it in coming days |
|
@EgorBo, @SergeiPavlov, can this be closed now that @EgorBo's change went in? |
|
Yes. The function is optimized by a959c3e |
|
Thanks, @SergeiPavlov. |
It is 30% performance optimization:
Benchmarked on 1000 random byte arrays of lengths 0..999
Used tricks:
intto process two char values simultaneously.inDataarray twice.*ptr++instead ofptrBase[index]; ++index;. We are savingaddinstruction.offsetparameter fromConvertToBase64Array().inDataisarray base + offset.insertLineBreaksimpacts on number of loop rounds before next complex condition check.BitConverter.IsLittleEndianfor correct work on Big-endian platforms.