[Wasm RyuJIT]: Fix LIR Semantics in Stackifier Output#127412
[Wasm RyuJIT]: Fix LIR Semantics in Stackifier Output#127412adamperlin wants to merge 7 commits intodotnet:mainfrom
Conversation
By LIR semantics, we can't always reuse temporaries that appear to be available due to interference between nodes which share the same root tree.
There was a problem hiding this comment.
Pull request overview
Fixes a Wasm RyuJIT stackifier correctness issue where temporaries introduced during stackification could be released and then reused too early (within the same root tree), producing invalid LIR due to store/read interference.
Changes:
- Introduce a “pending release” bitset to defer releasing stackifier temporaries until a full root tree finishes processing.
- Replace immediate temporary release with
AddTemporariesForPendingRelease+RemovePendingTemporariesat the end of root processing. - Add dynamic growth logic for the pending-release bitset capacity.
|
|
||
| Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool. | ||
| if (local == nullptr) | ||
| JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node)); |
There was a problem hiding this comment.
lclNum is an unsigned, but the JITDUMP uses %d in the format string. Using %u avoids incorrect output if the value is large and matches the variable's type.
| JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node)); | |
| JITDUMP("Stackifier pending release of lclNum: %u temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node)); |
| // However, we don't know precisely where the liftime ends here, because uses of locals happen at their position | ||
| // in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree |
There was a problem hiding this comment.
Typo in comment: "liftime" should be "lifetime" (and there appears to be trailing whitespace on these comment lines, which is worth removing to keep diffs clean).
| // However, we don't know precisely where the liftime ends here, because uses of locals happen at their position | |
| // in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree | |
| // However, we don't know precisely where the lifetime ends here, because uses of locals happen at their position | |
| // in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree |
|
|
||
| constexpr int tmpToLvaNum(unsigned tmpNum) | ||
| { | ||
| assert(tmpNum >= 0); |
There was a problem hiding this comment.
tmpToLvaNum takes an unsigned tmpNum, so assert(tmpNum >= 0) is always true and doesn't add value. Consider removing it or changing the parameter type if negative values are meaningful here.
| assert(tmpNum >= 0); |
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
AndyAyersMS
left a comment
There was a problem hiding this comment.
This seems plausible.
@SingleAccretion please take a look.
| , m_compiler(lower->m_compiler) | ||
| , m_stack(m_compiler->getAllocator(CMK_Lower)) | ||
| , m_minimumTempLclNum(m_compiler->lvaCount) | ||
| // initially allocate 32 temp local slots for "pending release" |
There was a problem hiding this comment.
You might as well make this 64, since we'll be on a 64 bit host and the bitwise operations cost the same at 64 as they do at 32.
|
|
||
| void EnsurePendingReleaseCapacity(unsigned needed) | ||
| { | ||
| if (needed < BitVecTraits::GetSize(&m_pendingReleaseTempTraits)) |
There was a problem hiding this comment.
How often do we come anywhere near needing 32 simultaneously live store temps?
There was a problem hiding this comment.
(or 64, with my suggested change above).
There was a problem hiding this comment.
Anecdotally I'd say very rarely, though I don't have hard numbers on this! This was a pretty generous upper bound. I do think Single's suggestion of removing all temporaries at root boundaries would work, and that would avoid the need for this kind of tracking, so we may not need to track live temps in the end for a conservative approach.
There was a problem hiding this comment.
It is unfortunate we have to compromise CQ a bit to retain this LIR invariant, even though it doesn't correspond to codegen constraints (stack operands really are used at the LIR position, unlike register operands).
But I wonder if we can simplify the fix to just do:
if (initialDepth == 0)
ReleaseAllTemps();
I. e. only release the temporaries at statement boundaries.
Have you thought about what a "precise" fix would look like? A temporary can be used if it doesn't have refs between the current 'prev' position and 'use's parent. Tracking the parent on the stack is easy enough, tracking 'busy' temps considering the shifting position of both 'prev' and 'parent' seems trickier.
I do think this approach would work and this would remove the need for tracking, so I'm going to give this approach a try.
I haven't given this much thought since it seemed tricky to get right as you mention! I do think this would be nice to have if it turns out not to be too difficult. If you have any thoughts on how we might do this efficiently, I'd definitely be interested! |
No, not really. It seems it would require quite careful tracking for what in the end is still going to be a suboptimal result. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/coreclr/jit/lowerwasm.cpp:604
- This change addresses a subtle LIR correctness issue in the wasm stackifier; it would be good to add a targeted regression test (likely under src/tests/JIT/Regression) that produces a call with a non-movable target/arg ordering similar to the motivating example, and validates the method compiles/runs correctly under wasm RyuJIT. As-is, the fix is not protected against future refactors.
GenTree* StackifyTree(GenTree* root)
{
int initialDepth = m_stack.Height();
// Simple greedy algorithm working backwards. The invariant is that the stack top must be placed right next
// to (in normal linear order - before) the node we last stackified.
m_stack.Push(&root);
GenTree* lastStackified = root->gtNext;
while (m_stack.Height() != initialDepth)
{
GenTree** use = m_stack.Pop();
GenTree* node = *use;
GenTree* prev = (lastStackified != nullptr) ? lastStackified->gtPrev : root;
while (node != prev)
{
// Maybe this is an intervening void-equivalent node that we can also just stackify.
if (IsDataFlowRoot(prev))
{
prev = StackifyTree(prev);
continue;
}
// At this point, we'll have to modify the IR in some way. In general, these cases should be quite
// rare, introduced in lowering only. All HIR-induced cases (such as from "gtSetEvalOrder") should
// instead be ifdef-ed out for WASM.
INDEBUG(const char* reason);
if (CanMoveForward(node DEBUGARG(&reason)))
{
MoveForward(node, prev DEBUGARG(reason));
}
else
{
node = ReplaceWithTemporary(use, prev);
}
m_anyChanges = true;
| @@ -540,6 +541,7 @@ void Lowering::AfterLowerBlocks() | |||
| , m_compiler(lower->m_compiler) | |||
| , m_stack(m_compiler->getAllocator(CMK_Lower)) | |||
| , m_minimumTempLclNum(m_compiler->lvaCount) | |||
| , m_maximumTempLclNum(m_compiler->lvaCount) | |||
| { | |||
There was a problem hiding this comment.
m_maximumTempLclNum is introduced and initialized but never used. This is dead state and may trigger unused-private-field warnings on some toolchains; either remove it or use it to bound/restrict which temps are released/recycled (as originally intended).
| node = StackifyTree(node); | ||
| // We've finished processing the current root tree, so | ||
| // we can release any temps used in stackification of the tree, | ||
| // since there is no more risk of interference between tree operands. | ||
| ReleaseTemporaries(); | ||
| } |
There was a problem hiding this comment.
ReleaseTemporaries() is called after every dataflow root, but it rebuilds the entire available-temp lists by iterating from m_minimumTempLclNum up to lvaCount each time. This changes the algorithm from freeing a single temp to O(totalTemps) work per root and could regress JIT throughput on large methods; consider tracking only temps used/created while stackifying the current root (e.g., with the intended bitset / min-max range) and releasing just those.
| void ReleaseTemporaries() | ||
| { | ||
| // We rely in this function on the lifetime of temporaries beginning (recall this is backwards traversal) | ||
| // at exactly "node"'s position, and not shrinking or extending after this call. This is currently true | ||
| // because we never move dataflow roots, and we only begin processing them after all subsequent nodes | ||
| // have already been stackified and thus won't move either. | ||
| assert(IsDataFlowRoot(node)); | ||
| if (!node->OperIs(GT_STORE_LCL_VAR)) | ||
| if (m_minimumTempLclNum == m_compiler->lvaCount) | ||
| { | ||
| // No temporaries were created | ||
| return; | ||
| } | ||
| assert(m_minimumTempLclNum < m_compiler->lvaCount); | ||
|
|
||
| unsigned lclNum = node->AsLclVar()->GetLclNum(); | ||
| if (lclNum < m_minimumTempLclNum) | ||
| // Recycle all available temporaries as unused nodes | ||
| for (int i = 0; i < TYP_COUNT; i++) | ||
| { | ||
| return; | ||
| while (m_availableTemps[i] != nullptr) | ||
| { | ||
| Temporary* temp = Remove(&m_availableTemps[i]); | ||
| Append(&m_unusedTempNodes, temp); | ||
| } | ||
| } | ||
|
|
||
| Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool. | ||
| if (local == nullptr) | ||
| for (unsigned lclNum = m_minimumTempLclNum; lclNum < m_compiler->lvaCount; lclNum++) | ||
| { | ||
| local = new (m_compiler, CMK_Lower) Temporary(); | ||
| } | ||
| local->LclNum = lclNum; | ||
| Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool. | ||
| if (local == nullptr) | ||
| { | ||
| local = new (m_compiler, CMK_Lower) Temporary(); | ||
| } | ||
| local->LclNum = lclNum; | ||
|
|
||
| JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum); | ||
| Append(&m_availableTemps[genActualType(node->TypeGet())], local); | ||
| JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum); | ||
| Append(&m_availableTemps[genActualType(m_compiler->lvaGetDesc(lclNum)->TypeGet())], local); | ||
| } | ||
| } |
There was a problem hiding this comment.
PR description says this fix “adds a bit set which tracks temporaries that can be freed after tree processing completes”, but the current implementation doesn’t add such a bitset and instead recycles all temps in [m_minimumTempLclNum, lvaCount) on every root. Either update the description to match the implementation, or implement the described per-tree tracking to avoid unintended behavior/perf costs.
This is a fix for an issue that came up in #126778, and is probably easiest to explain with a motivating example.
Consider the following case, where NOMOVE is a gentree operation we aren't allowed to move.
The stackifier will first introduce a store to put
t0aftert1:And then recursively stackify the new STORE to tmp0, since it is a dataflow root.
The stackifier then marks tmp0 as free here, since it IS free in linear data flow order. Then, when the next operands to the call are
stackified, the stackifier introduces a temporary again, but reuses t0
because we freed it.
This produces invalid LIR; there is a store to tmp0 before one of its reads (t2) is consumed.
The simplest fix is to not release temporaries for reuse until all operands of a root tree have been processed, so this PR adds a bit set which tracks temporaries that can be freed after tree processing completes.