Add support for BSWAP intrinsic#18398
Conversation
|
I managed to get some perf numbers from before and after the change. The baselines here are the existing implementations of Before
After
|
| // | ||
| void CodeGen::genCodeForBswap(GenTree* tree) | ||
| { | ||
| // TODO: If we're swapping immediately after a read from memory or immediately before |
There was a problem hiding this comment.
Presumably that's why you didn't just use genCodeForNegNot?
There was a problem hiding this comment.
i.e. because the handling of a memory operand (containment) is different in this case because MOVBE has both a source memory location and a register destination (unlike neg and not which take a single operand that's modified).
There was a problem hiding this comment.
To handle the case where the operand is in memory, you would add support in Lowering to allow the operand to be contained (or regOptional) if it is the right kind of node (local or GT_IND with an address of a form that movbe can handle), which would cause it to be used directly from memory. Although this is a unary operation, the containment analysis is different from GT_NEG and GT_NOT because the memory form has the result in a register (so the source and dest don't have to be the same to do it from memory, which we don't really support for GT_NEG/GT_NOT currently). So it would be more like the check for whether the/a source operand of a binary op can be contained in Lowering::ContainCheckBinary().
There was a problem hiding this comment.
@mikedn Per your other comment, I lifted knowledge of the INS_bswap instruction directly to this method, so now I think the logic is different enough that it's worth keeping negnot and bswap split. Is there still a preference to merge the two?
@CarolEidt This is useful feedback, thanks! I'm going to open a different issue for movbe support since I think it's complex enough to warrant a different PR. (It requires a cpuid capability check so it affects AoT scenarios, and I haven't yet created a C++ benchmark app demonstrating that it's a worthwhile optimization.)
| } | ||
|
|
||
| instruction ins = genGetInsForOper(tree->OperGet(), targetType); | ||
| inst_RV(ins, targetReg, targetType); |
| } | ||
|
|
||
| // The BSWAP instruction is undefined for 16-bit inputs. Instead, use ROR 8 on the | ||
| // low word register bits. Additionally, BSWAP is undefined for 64-bit inputs on |
There was a problem hiding this comment.
We could probably make decomposition generate 2 BSWAPs from a long BSWAP on x86.
There was a problem hiding this comment.
That's how this works internally. If the developer targets x86 but calls swap(ulong), the call site remains a normal method call. The implementation of swap(ulong) is written in terms of swap(uint), which gets correctly flagged as an intrinsic and converted to bswap.
The flip side is that this is the same reason swap(<const ulong>) doesn't currently get optimized properly, unfortunately. :(
There was a problem hiding this comment.
I see. I might make more sense to rely on JIT's decomposition on x86 though, if only because it avoids such ifdefs and inconsistencies between architectures. But it's not obvious what the best solution is, we'll see what the team thinks about it.
There was a problem hiding this comment.
I agree with @mikedn that it might be cleaner to handle the "longs on 32-bit target" in the JIT. However, I'm interested in what the perf issues are that you're seeing - is it because it interferes with inlining? Or that it generates extra copies? Or something else? The decomposition approach means that the front-end (importer and optimizations) of the JIT will treat these as operations on 64-bit values. Then, the backend "decomposes" these into the appropriate sequence of operations on 32-bit values. In general, this allows the JIT to do a better job of optimizing these in the front-end (though sometimes the decomposition itself can expose code sequences that might have been optimized if they had been expanded earlier; the classic "phase ordering" problem).
There was a problem hiding this comment.
The perf issue I mentioned was due to not enlightning morph's const folding logic to account for GT_BSWAP. This has been resolved with the latest iteration of the PR.
If you both still believe the best design is to have the JIT fully handle 64-bit byte swaps rather than falling back to the managed decomposition implementation, I can investigate that, but I need some assistance being pointed in the right direction. Is CodeGen::genCodeForShiftLong a good source of inspiration to turn to?
| { | ||
| case TYP_SHORT: | ||
| case TYP_USHORT: | ||
| retNode = gtNewOperNode(GT_ROR, callType, impPopStack().val, gtNewIconNode(8)); |
There was a problem hiding this comment.
That's probably risky business. With some exceptions, the JIT does not use short/ushort nodes and evaluation is usually done using int/long.
There was a problem hiding this comment.
Can you elaborate a bit on the risk? I believe I'm switching on the target method's return type, not the local value node type.
There was a problem hiding this comment.
You're switching on call type but calls are one of those few nodes that may use small integer types. The other are indirs and lclvar nodes (with indirs being the most common). And those end up producing int values by automatically extending from small int to int.
The net effect is that you may end up with bugs for various reasons:
- ror codegen uses the "actual" node type and emits a 32 bit instruction
- ror codegen use the node type and emits a 16 bit instruction but the consuming node expects a 32 bit value. The upper 16 bits of the 32 bit value may be garbage.
- constant folding may mishandle small int nodes (assuming it pays attention to ror)
I think the biggest danger is that this may appear to work in most cases and fail in some rare and odd circumstances.
There was a problem hiding this comment.
I was hesitant to add this particular overload in the first place, but it turns out that byte swapping 16-bit values is fairly common in networking stacks. So it seems like this scenario might deserve first-class support in some manner. I used the ror reg_low16, 8 idiom here because it's the way MSVC++ implements byte swaps over WORD data types. In my experiments, there was always a movzx reg_32, reg_low16 either immediately before or immediately after the ror reg_low16, 8 instruction, so that should account for the garbage bits in the high WORD.
But admittedly this is outside my area of expertise, so I'll defer to your judgment here. If you have suggestions for a safer way to do this, I'm all ears. :)
There was a problem hiding this comment.
If you have suggestions for a safer way to do this, I'm all ears. :)
Right now, the only idea that comes to mind is to have a GT_BSWAP16 that is defined as "swaps the lowest 2 bytes of an integer value". Though it's kind of peculiar, I don't think there's any similar case in the JIT today.
cc @CarolEidt Any thoughts one how a 16 bit byte swap could be represented in the JIT's IR? It's somewhat problematic due to JIT's tendency to do everything with 32/64 bit integers.
There was a problem hiding this comment.
I also think that the right way to deal with the 16-bit case is to use something like GT_BSWAP16. The question is whether it should implicitly clear the upper 16 bits, or whether to have the JIT always generate an explicit cast to cause those bits to be cleared. In the latter case, the cast could theoretically be eliminated if the consumer uses only the low bits.
We have tried similar hops in the past (e.g. with Span), but they never survived for long. It may be better to recognize |
f37b843 to
e1453e0
Compare
|
Rebased and responded to some PR comments. This should address the majority of the reliability feedback that reviews have given (thanks all!), but it's still WIP. Summary of changes:
This fixed a handful of problems the original prototype had, including mishandling of 64-bit const values. I tested both consts and non-consts across all three overloads on both x86 and x64, and the behavior was correct in all cases. If a const is provided, regardless of arch, constant folding will take over. If a non-const is provided, it emits a To-do:
|
Moving implementations of existing APIs to CoreLib does not need API review. It should be done in separate PR that just moves the implementation, and does not have any other changes. |
78e2133 to
39b3a47
Compare
|
I believe all open issues are resolved as of the latest iteration of the PR. Here's the current status:
The only thing I wasn't able to get to was providing the 64-bit fallback implementation entirely in the VM. If this is still important, I hope it's acceptable to allow this PR to go through as-is (modulo any new feedback) and to open a new work item for providing the VM fallback. Removing WIP from title. |
|
cc @dotnet/jit-contrib |
39b3a47 to
c9aefc2
Compare
|
I rebased the PR since there were some recent commits into master that addressed flaky CI + unit tests. |
JIT recognizes calls to BinaryPrimitives.ReverseEndianness and emits a bswap instruction - Only works for x86 / x64; ARM still uses fallback logic
|
@mikedn I also found this code in morph.cpp's Compiler::fgMorphSmpOp routine. case GT_NOT:
case GT_NEG:
/* Any constant cases should have been folded earlier */
noway_assert(!op1->OperIsConst() || !opts.OptEnabled(CLFLG_CONSTANTFOLD) || optValnumCSE_phase);
break;Should this be enlightened with |
11cd79d to
80a0115
Compare
|
@dotnet/jit-contrib Any final feedback on this? I have an open question re: |
|
Sorry, I missed your earlier question. I don't think that the |
AndyAyersMS
left a comment
There was a problem hiding this comment.
Just one small comment on a comment.
|
Thanks @AndyAyersMS @CarolEidt @mikedn for your guidance over the past few months! Much appreciated :) |
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
| Console.WriteLine($"Expected: 0x{nonConstUInt64Expected:X16}"); | ||
| } | ||
|
|
||
| return Pass; |
There was a problem hiding this comment.
@GrabYourPitchforks I might be wrong, but it looks like this test never fails, even if something goes wrong. The code prints to the output but anyway returns Pass and Fail is never used.
Besides that this is a great PR and a great example for introducing new Intrinsics!
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures.
With this change, the JIT will recognize a call to BinaryPrimitives.ReverseEndianness and will emit a bswap instruction. This logic is currently only hooked up for x86 and x64; ARM still uses fallback logic. If the JIT can't emit a bswap instruction (for example, trying to emit a 64-bit bswap in a 32-bit process), it will fall back to a software implementation, so the APIs will work across all architectures. Commit migrated from dotnet/coreclr@f72025c
Addresses https://github.com/dotnet/coreclr/issues/14122
The general idea is that we expose APIs (for internal consumption) which are JIT intrinsics, replaced with
BSWAPinstructions instead of method calls at runtime. I've put them onBitConverterfor now but haven't exposed them via the reference API. The System.Memory package already exposesBinaryPrimitives.ReverseEndiannessAPIs, and those can call through to these implementations due to the special relationship between corefx packages and System.Private.CoreLib.Here's the x64 codegen of several
test_bswaproutines that I compiled and tested.Open issues:
How would I even begin to test this in a sane fashion? Right now I tested this by using a
DynamicMethodto reference these intrinsic APIs, then compiling and observing the codegen in windbg. It works as far as seeing what's going on, but I can't use this for things like perf runs.An alternative implementation would be to recognize byte swap IL patterns directly, similar to how rotation IL patterns are recognized directly. I didn't pursue this because I thought it would be more error-prone than using special intrinsic methods, but perhaps somebody knows a more foolproof way of doing this.
I didn't attempt to make this work on ARM because I don't have a test box. The ARM equivalents are
REV/REV16/REV64. I may need to suppress the intrinsic entirely on ARM for now in order to avoid JIT breaks.Calling
ReverseEndianness(<some const uint64>)works but isn't optimized properly. I think I need to update morph but I'll need assistance knowing where to look.Ideally if we detect a
GT_BSWAPinstruction immediately before or after a memory access, we can replace both operations withmovbeon compatible architectures. I don't think this would have much impact on runtime, but it could save a scratch register and help avoid stack spillage in some scenarios.