In dotnet/coreclr#18398, JIT got support for the bswap intrinsic. There's a comment in that issue that it would also be nice if the JIT could fold a "read-then-bswap" or a "bswap-then-write" instruction pair into a single movbe instruction on supported platforms. For certain application code which performs such reads and writes with high frequency, it can result in significant codegen savings. See in particular the codegen at https://github.com/dotnet/coreclr/issues/26729#issuecomment-531862356, where each "mov / bswap" pair could be a single "movbe".
There are a few ways to go about this. We could add a movbe intrinsic and have callers (like BinaryPrimitives) special-case that intrinsic if it is available. Though it'd probably be more convenient to leave all of the call sites unchanged and instead to have the JIT special-case these instruction pairs.
This could end up being particularly helpful for writes, as it could end up saving not just codegen size, but also the total number of allocated registers. Consider the following.
int i = GetInt();
BinaryPrimitives.WriteBigEndian(theSpan, i);
// keep using 'i' here
Since the bswap instruction mutates its register, the JIT currently needs to make a temporary copy of the value if i is intended to be used elsewhere in the method.
mov eax, ebx ; ebx = 'i', eax = copy of 'i'
bswap eax ; eax = swapped(i)
mov dword ptr [rcx], eax ; theSpan[0] = swapped(i)
All three of those instructions could be folded into a single movbe with no temporary registers required.
One complication is that there's no 16-bit bswap (it's instead a 16-bit rol), but there is a 16-bit movbe, so the JIT would likely need to special-case that particular pattern.
category:cq
theme:intrinsics
skill-level:expert
cost:medium
In dotnet/coreclr#18398, JIT got support for the
bswapintrinsic. There's a comment in that issue that it would also be nice if the JIT could fold a "read-then-bswap" or a "bswap-then-write" instruction pair into a singlemovbeinstruction on supported platforms. For certain application code which performs such reads and writes with high frequency, it can result in significant codegen savings. See in particular the codegen at https://github.com/dotnet/coreclr/issues/26729#issuecomment-531862356, where each "mov / bswap" pair could be a single "movbe".There are a few ways to go about this. We could add a
movbeintrinsic and have callers (likeBinaryPrimitives) special-case that intrinsic if it is available. Though it'd probably be more convenient to leave all of the call sites unchanged and instead to have the JIT special-case these instruction pairs.This could end up being particularly helpful for writes, as it could end up saving not just codegen size, but also the total number of allocated registers. Consider the following.
Since the
bswapinstruction mutates its register, the JIT currently needs to make a temporary copy of the value if i is intended to be used elsewhere in the method.All three of those instructions could be folded into a single
movbewith no temporary registers required.One complication is that there's no 16-bit
bswap(it's instead a 16-bitrol), but there is a 16-bitmovbe, so the JIT would likely need to special-case that particular pattern.category:cq
theme:intrinsics
skill-level:expert
cost:medium