Add support for BSWAP intrinsic by GrabYourPitchforks · Pull Request #18398 · dotnet/coreclr

GrabYourPitchforks · 2018-06-09T08:13:16Z

Addresses https://github.com/dotnet/coreclr/issues/14122

The general idea is that we expose APIs (for internal consumption) which are JIT intrinsics, replaced with BSWAP instructions instead of method calls at runtime. I've put them on BitConverter for now but haven't exposed them via the reference API. The System.Memory package already exposes BinaryPrimitives.ReverseEndianness APIs, and those can call through to these implementations due to the special relationship between corefx packages and System.Private.CoreLib.

Here's the x64 codegen of several test_bswap routines that I compiled and tested.

test_bswap(UInt32):
00007ffd`ac470080 8bc1            mov     eax,ecx
00007ffd`ac470082 0fc8            bswap   eax
00007ffd`ac470084 c3              ret

test_bswap(UInt64):
00007ffd`ab6a0080 488bc1          mov     rax,rcx
00007ffd`ab6a0083 480fc8          bswap   rax
00007ffd`ab6a0086 c3              ret

test_bswap(UInt16):
00007ffd`ab6d0080 0fb7c1          movzx   eax,cx
00007ffd`ab6d0083 66c1c808        ror     ax,8
00007ffd`ab6d0087 c3              ret

Open issues:

How would I even begin to test this in a sane fashion? Right now I tested this by using a DynamicMethod to reference these intrinsic APIs, then compiling and observing the codegen in windbg. It works as far as seeing what's going on, but I can't use this for things like perf runs.
An alternative implementation would be to recognize byte swap IL patterns directly, similar to how rotation IL patterns are recognized directly. I didn't pursue this because I thought it would be more error-prone than using special intrinsic methods, but perhaps somebody knows a more foolproof way of doing this.
I didn't attempt to make this work on ARM because I don't have a test box. The ARM equivalents are REV / REV16 / REV64. I may need to suppress the intrinsic entirely on ARM for now in order to avoid JIT breaks.
Calling ReverseEndianness(<some const uint64>) works but isn't optimized properly. I think I need to update morph but I'll need assistance knowing where to look.
Ideally if we detect a GT_BSWAP instruction immediately before or after a memory access, we can replace both operations with movbe on compatible architectures. I don't think this would have much impact on runtime, but it could save a scratch register and help avoid stack spillage in some scenarios.

GrabYourPitchforks · 2018-06-09T08:52:39Z

I managed to get some perf numbers from before and after the change. The baselines here are the existing implementations of BinaryPrimitives.ReverseEndianness. Measurements are total time taken to call the method 1 billion times in a loop. I confirmed via code gen inspection that no intermediate calculations were being discarded.

Before bswap enablement

ushort: 915 ms
uint: 1120 ms
ulong: 2120 ms

After bswap enablement

ushort: 330 ms
uint: 400 ms
ulong: 590 ms

mikedn · 2018-06-09T09:58:49Z

+//
+void CodeGen::genCodeForBswap(GenTree* tree)
+{
+    // TODO: If we're swapping immediately after a read from memory or immediately before


Presumably that's why you didn't just use genCodeForNegNot?

i.e. because the handling of a memory operand (containment) is different in this case because MOVBE has both a source memory location and a register destination (unlike neg and not which take a single operand that's modified).

To handle the case where the operand is in memory, you would add support in Lowering to allow the operand to be contained (or regOptional) if it is the right kind of node (local or GT_IND with an address of a form that movbe can handle), which would cause it to be used directly from memory. Although this is a unary operation, the containment analysis is different from GT_NEG and GT_NOT because the memory form has the result in a register (so the source and dest don't have to be the same to do it from memory, which we don't really support for GT_NEG/GT_NOT currently). So it would be more like the check for whether the/a source operand of a binary op can be contained in Lowering::ContainCheckBinary().

@mikedn Per your other comment, I lifted knowledge of the INS_bswap instruction directly to this method, so now I think the logic is different enough that it's worth keeping negnot and bswap split. Is there still a preference to merge the two?

@CarolEidt This is useful feedback, thanks! I'm going to open a different issue for movbe support since I think it's complex enough to warrant a different PR. (It requires a cpuid capability check so it affects AoT scenarios, and I haven't yet created a C++ benchmark app demonstrating that it's a worthwhile optimization.)

mikedn · 2018-06-09T09:59:08Z

+    }
+
+    instruction ins = genGetInsForOper(tree->OperGet(), targetType);
+    inst_RV(ins, targetReg, targetType);


You should just use INS_bswap here.

mikedn · 2018-06-09T09:59:58Z

+                }
+
+                // The BSWAP instruction is undefined for 16-bit inputs. Instead, use ROR 8 on the
+                // low word register bits. Additionally, BSWAP is undefined for 64-bit inputs on


We could probably make decomposition generate 2 BSWAPs from a long BSWAP on x86.

That's how this works internally. If the developer targets x86 but calls swap(ulong), the call site remains a normal method call. The implementation of swap(ulong) is written in terms of swap(uint), which gets correctly flagged as an intrinsic and converted to bswap.

The flip side is that this is the same reason swap(<const ulong>) doesn't currently get optimized properly, unfortunately. :(

I see. I might make more sense to rely on JIT's decomposition on x86 though, if only because it avoids such ifdefs and inconsistencies between architectures. But it's not obvious what the best solution is, we'll see what the team thinks about it.

I agree with @mikedn that it might be cleaner to handle the "longs on 32-bit target" in the JIT. However, I'm interested in what the perf issues are that you're seeing - is it because it interferes with inlining? Or that it generates extra copies? Or something else? The decomposition approach means that the front-end (importer and optimizations) of the JIT will treat these as operations on 64-bit values. Then, the backend "decomposes" these into the appropriate sequence of operations on 32-bit values. In general, this allows the JIT to do a better job of optimizing these in the front-end (though sometimes the decomposition itself can expose code sequences that might have been optimized if they had been expanded earlier; the classic "phase ordering" problem).

The perf issue I mentioned was due to not enlightning morph's const folding logic to account for GT_BSWAP. This has been resolved with the latest iteration of the PR.

If you both still believe the best design is to have the JIT fully handle 64-bit byte swaps rather than falling back to the managed decomposition implementation, I can investigate that, but I need some assistance being pointed in the right direction. Is CodeGen::genCodeForShiftLong a good source of inspiration to turn to?

mikedn · 2018-06-09T10:02:51Z

+                {
+                    case TYP_SHORT:
+                    case TYP_USHORT:
+                        retNode = gtNewOperNode(GT_ROR, callType, impPopStack().val, gtNewIconNode(8));


That's probably risky business. With some exceptions, the JIT does not use short/ushort nodes and evaluation is usually done using int/long.

Can you elaborate a bit on the risk? I believe I'm switching on the target method's return type, not the local value node type.

You're switching on call type but calls are one of those few nodes that may use small integer types. The other are indirs and lclvar nodes (with indirs being the most common). And those end up producing int values by automatically extending from small int to int.

The net effect is that you may end up with bugs for various reasons:

ror codegen uses the "actual" node type and emits a 32 bit instruction

ror codegen use the node type and emits a 16 bit instruction but the consuming node expects a 32 bit value. The upper 16 bits of the 32 bit value may be garbage.

constant folding may mishandle small int nodes (assuming it pays attention to ror)

I think the biggest danger is that this may appear to work in most cases and fail in some rare and odd circumstances.

I was hesitant to add this particular overload in the first place, but it turns out that byte swapping 16-bit values is fairly common in networking stacks. So it seems like this scenario might deserve first-class support in some manner. I used the ror reg_low16, 8 idiom here because it's the way MSVC++ implements byte swaps over WORD data types. In my experiments, there was always a movzx reg_32, reg_low16 either immediately before or immediately after the ror reg_low16, 8 instruction, so that should account for the garbage bits in the high WORD.

But admittedly this is outside my area of expertise, so I'll defer to your judgment here. If you have suggestions for a safer way to do this, I'm all ears. :)

If you have suggestions for a safer way to do this, I'm all ears. :)

Right now, the only idea that comes to mind is to have a GT_BSWAP16 that is defined as "swaps the lowest 2 bytes of an integer value". Though it's kind of peculiar, I don't think there's any similar case in the JIT today.

cc @CarolEidt Any thoughts one how a 16 bit byte swap could be represented in the JIT's IR? It's somewhat problematic due to JIT's tendency to do everything with 32/64 bit integers.

I also think that the right way to deal with the 16-bit case is to use something like GT_BSWAP16. The question is whether it should implicitly clear the upper 16 bits, or whether to have the JIT always generate an explicit cast to cause those bits to be cleared. In the latter case, the cast could theoretically be eliminated if the consumer uses only the low bits.

jkotas · 2018-06-10T00:41:43Z

System.Memory package already exposes BinaryPrimitives.ReverseEndianness APIs, and those can call through to these implementation

We have tried similar hops in the past (e.g. with Span), but they never survived for long. It may be better to recognize BinaryPrimitives.ReverseEndianness as intrinsic directly - perhaps move BinaryPrimitives to CoreLib.

GrabYourPitchforks · 2018-07-12T21:22:39Z

Rebased and responded to some PR comments. This should address the majority of the reliability feedback that reviews have given (thanks all!), but it's still WIP.

Summary of changes:

Condition the bswap intrinsic logic on a check for x86 / x64
Removed GT_BSWAP logic from genGetInsForOper and moved it all into genCodeForBswap
Use GT_BSWAP16 instead of GT_BSWAP for 16-bit byte swap operations
Enlighten morph's const folding to understand GT_BSWAP / GT_BSWAP16

This fixed a handful of problems the original prototype had, including mishandling of 64-bit const values. I tested both consts and non-consts across all three overloads on both x86 and x64, and the behavior was correct in all cases. If a const is provided, regardless of arch, constant folding will take over. If a non-const is provided, it emits a ror reg.16, 8 or bswap reg instruction as appropriate, except for the 64-bit overload on 32-bit archs, which gets decomposed into two bswap reg instructions.

To-do:

Not expose these methods on BitConverter; instead consider moving BinaryPrimitives into corelib (needs API review?)
Add tests (how?)
Add support for movbe (probably not as part of this PR? significantly impacts AoT compilation.)

jkotas · 2018-07-13T06:09:07Z

Not expose these methods on BitConverter; instead consider moving BinaryPrimitives into corelib (needs API review?)

Moving implementations of existing APIs to CoreLib does not need API review. It should be done in separate PR that just moves the implementation, and does not have any other changes.

GrabYourPitchforks · 2018-10-09T01:44:39Z

I believe all open issues are resolved as of the latest iteration of the PR. Here's the current status:

This causes the JIT to special-case BinaryPrimitives.ReverseEndianness instead of introducing a new internal method BitConverter.ByteSwap.
On 32-bit x86, the 16-bit and 32-bit overloads of ReverseEndianness are replaced with intrinsics. The 64-bit overload calls into the 32-bit overloads as helper methods.
On x64, the 16-bit, 32-bit, and 64-bit overloads of ReverseEndianness are treated as intrinsics.
On arm, none of the overloads are treated as intrinsics, and the runtime falls back to the existing software implementation.
I verified that constant folding is handled as expected by the changes in gentree.cpp.
There are tests both here and in corefx verifying the correct behavior of these APIs.

The only thing I wasn't able to get to was providing the 64-bit fallback implementation entirely in the VM. If this is still important, I hope it's acceptable to allow this PR to go through as-is (modulo any new feedback) and to open a new work item for providing the VM fallback.

Removing WIP from title.

AndyAyersMS · 2018-10-09T18:17:59Z

cc @dotnet/jit-contrib

GrabYourPitchforks · 2018-10-10T21:16:04Z

I rebased the PR since there were some recent commits into master that addressed flaky CI + unit tests.

JIT recognizes calls to BinaryPrimitives.ReverseEndianness and emits a bswap instruction - Only works for x86 / x64; ARM still uses fallback logic

GrabYourPitchforks · 2018-11-01T23:25:23Z

@mikedn I also found this code in morph.cpp's Compiler::fgMorphSmpOp routine.

        case GT_NOT:
        case GT_NEG:

            /* Any constant cases should have been folded earlier */
            noway_assert(!op1->OperIsConst() || !opts.OptEnabled(CLFLG_CONSTANTFOLD) || optValnumCSE_phase);
            break;

Should this be enlightened with GT_BSWAP[16] support? It seems like it would fit the right pattern - we'd want the assert to fire if constant folding wasn't properly performed.

GrabYourPitchforks · 2018-11-05T19:07:57Z

@dotnet/jit-contrib Any final feedback on this? I have an open question re: Compiler::fgMorphSmpOp but it doesn't seem to affect correctness.

mikedn · 2018-11-05T19:13:17Z

Sorry, I missed your earlier question. I don't think that the fgMorphSmpOp is really useful. It's not like having a constant op node is breaking some kind of rule. It's surprising that a noway_assert is used there.

AndyAyersMS

Just one small comment on a comment.

GrabYourPitchforks · 2018-11-06T00:10:57Z

Thanks @AndyAyersMS @CarolEidt @mikedn for your guidance over the past few months! Much appreciated :)

With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

adamsitnik · 2018-11-06T14:32:13Z

+                Console.WriteLine($"Expected: 0x{nonConstUInt64Expected:X16}");
+            }
+
+            return Pass;


@GrabYourPitchforks I might be wrong, but it looks like this test never fails, even if something goes wrong. The code prints to the output but anyway returns Pass and Fail is never used.

Besides that this is a great PR and a great example for introducing new Intrinsics!

With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures.

With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Commit migrated from dotnet/coreclr@f72025c

mikedn reviewed Jun 9, 2018

View reviewed changes

GrabYourPitchforks force-pushed the bswap branch from f37b843 to e1453e0 Compare July 12, 2018 21:14

GrabYourPitchforks commented Jul 12, 2018

View reviewed changes

Comment thread src/jit/codegenxarch.cpp Outdated

benaadams mentioned this pull request Aug 7, 2018

Create optimized TryRead<T>. dotnet/corefxlab#2419

Merged

GrabYourPitchforks mentioned this pull request Oct 5, 2018

Move BinaryPrimitives into coreclr shared source #20288

Merged

GrabYourPitchforks force-pushed the bswap branch from 78e2133 to 39b3a47 Compare October 9, 2018 01:38

GrabYourPitchforks changed the title ~~[WIP] Add support for BSWAP intrinsic~~ Add support for BSWAP intrinsic Oct 9, 2018

GrabYourPitchforks force-pushed the bswap branch from 39b3a47 to c9aefc2 Compare October 10, 2018 21:15

mikedn reviewed Oct 11, 2018

View reviewed changes

Comment thread src/jit/gentree.cpp

GrabYourPitchforks added 2 commits November 1, 2018 14:29

Add JIT intrinsic support for x86 'bswap' instruction

31af895

JIT recognizes calls to BinaryPrimitives.ReverseEndianness and emits a bswap instruction - Only works for x86 / x64; ARM still uses fallback logic

Commit changed file missing from last commit

5ea4ac3

Add missing switches to compiler.hpp and optcse.cpp

80a0115

GrabYourPitchforks force-pushed the bswap branch from 11cd79d to 80a0115 Compare November 2, 2018 00:10

AndyAyersMS approved these changes Nov 5, 2018

View reviewed changes

Comment thread src/jit/importer.cpp Outdated

PR feedback - fix comment

62a7dbb

GrabYourPitchforks merged commit f72025c into dotnet:master Nov 6, 2018

GrabYourPitchforks deleted the bswap branch November 6, 2018 00:09

adamsitnik reviewed Nov 6, 2018

View reviewed changes

This was referenced Feb 28, 2019

Remove unsafe code from UuidHandler npgsql/npgsql#2310

Merged

Use BinaryPrimitives to leverage JIT intrinsics npgsql/npgsql#2344

Merged

GrabYourPitchforks mentioned this pull request Jan 31, 2020

JIT should emit "movbe" instead of "mov / bswap" on compatible hardware dotnet/runtime#13427

Closed

Conversation

GrabYourPitchforks commented Jun 9, 2018

Uh oh!

GrabYourPitchforks commented Jun 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas commented Jun 10, 2018

Uh oh!

GrabYourPitchforks commented Jul 12, 2018

Uh oh!

Uh oh!

jkotas commented Jul 13, 2018

Uh oh!

GrabYourPitchforks commented Oct 9, 2018

Uh oh!

AndyAyersMS commented Oct 9, 2018

Uh oh!

GrabYourPitchforks commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GrabYourPitchforks commented Nov 1, 2018

Uh oh!

GrabYourPitchforks commented Nov 5, 2018

Uh oh!

mikedn commented Nov 5, 2018

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GrabYourPitchforks commented Nov 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

GrabYourPitchforks commented Oct 10, 2018 •

edited

Loading