Skip to content

[Wasm RyuJIT]: Fix LIR Semantics in Stackifier Output#127412

Open
adamperlin wants to merge 7 commits intodotnet:mainfrom
adamperlin:adamperlin/wasm-fix-stackify-lir-semantics
Open

[Wasm RyuJIT]: Fix LIR Semantics in Stackifier Output#127412
adamperlin wants to merge 7 commits intodotnet:mainfrom
adamperlin:adamperlin/wasm-fix-stackify-lir-semantics

Conversation

@adamperlin
Copy link
Copy Markdown
Contributor

@adamperlin adamperlin commented Apr 24, 2026

This is a fix for an issue that came up in #126778, and is probably easiest to explain with a motivating example.

Consider the following case, where NOMOVE is a gentree operation we aren't allowed to move.

t2 = ... NOMOVE OP
t3 = ... OP
t0 = ... NOMOVE OP
t1 = ... OP
 * t3 (arg1)
 * t2 (arg2)
 * t1 (arg3)
 * t0 (target)
 CALL

The stackifier will first introduce a store to put t0 after t1:

t2 = ... NOMOVE OP
t3 = ... OP
t0 = ... OP
    +** STORE_LCL_VAR tmp0
t1 = ... OP
t0 = LCL_VAR tmp0
 * t3 (arg1)
 * t2 (arg2)
 * t1 (arg3)
 * t0 (call target)
 CALL

And then recursively stackify the new STORE to tmp0, since it is a dataflow root.
The stackifier then marks tmp0 as free here, since it IS free in linear data flow order. Then, when the next operands to the call are
stackified, the stackifier introduces a temporary again, but reuses t0
because we freed it.

t2 = ... OP
 +** STORE_LCL_VAR tmp0
t3 = ... OP
t2 = LCL_VAR tmp0
t0 = ... OP
    +** STORE_LCL_VAR tmp0
t1 = ... OP
t0 = LCL_VAR tmp0
 * t3
 * t2
 * t1
 * t0 (target)
 CALL

This produces invalid LIR; there is a store to tmp0 before one of its reads (t2) is consumed.

The simplest fix is to not release temporaries for reuse until all operands of a root tree have been processed, so this PR adds a bit set which tracks temporaries that can be freed after tree processing completes.

By LIR semantics, we can't always reuse temporaries that appear to be
available due to interference between nodes which share the same root tree.
Copilot AI review requested due to automatic review settings April 24, 2026 22:54
@adamperlin adamperlin added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-VM-coreclr labels Apr 24, 2026
@adamperlin adamperlin added this to the 11.0.0 milestone Apr 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Wasm RyuJIT stackifier correctness issue where temporaries introduced during stackification could be released and then reused too early (within the same root tree), producing invalid LIR due to store/read interference.

Changes:

  • Introduce a “pending release” bitset to defer releasing stackifier temporaries until a full root tree finishes processing.
  • Replace immediate temporary release with AddTemporariesForPendingRelease + RemovePendingTemporaries at the end of root processing.
  • Add dynamic growth logic for the pending-release bitset capacity.

Comment thread src/coreclr/jit/lowerwasm.cpp Outdated

Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
if (local == nullptr)
JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lclNum is an unsigned, but the JITDUMP uses %d in the format string. Using %u avoids incorrect output if the value is large and matches the variable's type.

Suggested change
JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));
JITDUMP("Stackifier pending release of lclNum: %u temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));

Copilot uses AI. Check for mistakes.
Comment thread src/coreclr/jit/lowerwasm.cpp Outdated
Comment on lines +733 to +734
// However, we don't know precisely where the liftime ends here, because uses of locals happen at their position
// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "liftime" should be "lifetime" (and there appears to be trailing whitespace on these comment lines, which is worth removing to keep diffs clean).

Suggested change
// However, we don't know precisely where the liftime ends here, because uses of locals happen at their position
// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree
// However, we don't know precisely where the lifetime ends here, because uses of locals happen at their position
// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree

Copilot uses AI. Check for mistakes.
Comment thread src/coreclr/jit/lowerwasm.cpp Outdated

constexpr int tmpToLvaNum(unsigned tmpNum)
{
assert(tmpNum >= 0);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmpToLvaNum takes an unsigned tmpNum, so assert(tmpNum >= 0) is always true and doesn't add value. Consider removing it or changing the parameter type if negative values are meaningful here.

Suggested change
assert(tmpNum >= 0);

Copilot uses AI. Check for mistakes.
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems plausible.

@SingleAccretion please take a look.

Comment thread src/coreclr/jit/lowerwasm.cpp Outdated
, m_compiler(lower->m_compiler)
, m_stack(m_compiler->getAllocator(CMK_Lower))
, m_minimumTempLclNum(m_compiler->lvaCount)
// initially allocate 32 temp local slots for "pending release"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might as well make this 64, since we'll be on a 64 bit host and the bitwise operations cost the same at 64 as they do at 32.

Comment thread src/coreclr/jit/lowerwasm.cpp Outdated

void EnsurePendingReleaseCapacity(unsigned needed)
{
if (needed < BitVecTraits::GetSize(&m_pendingReleaseTempTraits))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we come anywhere near needing 32 simultaneously live store temps?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or 64, with my suggested change above).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotally I'd say very rarely, though I don't have hard numbers on this! This was a pretty generous upper bound. I do think Single's suggestion of removing all temporaries at root boundaries would work, and that would avoid the need for this kind of tracking, so we may not need to track live temps in the end for a conservative approach.

Copy link
Copy Markdown
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate we have to compromise CQ a bit to retain this LIR invariant, even though it doesn't correspond to codegen constraints (stack operands really are used at the LIR position, unlike register operands).

But I wonder if we can simplify the fix to just do:

if (initialDepth == 0)
   ReleaseAllTemps();

I. e. only release the temporaries at statement boundaries.

Have you thought about what a "precise" fix would look like? A temporary can be used if it doesn't have refs between the current 'prev' position and 'use's parent. Tracking the parent on the stack is easy enough, tracking 'busy' temps considering the shifting position of both 'prev' and 'parent' seems trickier.

@adamperlin
Copy link
Copy Markdown
Contributor Author

But I wonder if we can simplify the fix to just do:

if (initialDepth == 0)
   ReleaseAllTemps();

I. e. only release the temporaries at statement boundaries.

I do think this approach would work and this would remove the need for tracking, so I'm going to give this approach a try.

Have you thought about what a "precise" fix would look like? A temporary can be used if it doesn't have refs between the current 'prev' position and 'use's parent. Tracking the parent on the stack is easy enough, tracking 'busy' temps considering the shifting position of both 'prev' and 'parent' seems trickier.

I haven't given this much thought since it seemed tricky to get right as you mention! I do think this would be nice to have if it turns out not to be too difficult. If you have any thoughts on how we might do this efficiently, I'd definitely be interested!

@SingleAccretion
Copy link
Copy Markdown
Contributor

If you have any thoughts on how we might do this efficiently, I'd definitely be interested!

No, not really. It seems it would require quite careful tracking for what in the end is still going to be a suboptimal result.

Copilot AI review requested due to automatic review settings April 28, 2026 20:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/coreclr/jit/lowerwasm.cpp:604

  • This change addresses a subtle LIR correctness issue in the wasm stackifier; it would be good to add a targeted regression test (likely under src/tests/JIT/Regression) that produces a call with a non-movable target/arg ordering similar to the motivating example, and validates the method compiles/runs correctly under wasm RyuJIT. As-is, the fix is not protected against future refactors.
        GenTree* StackifyTree(GenTree* root)
        {
            int initialDepth = m_stack.Height();

            // Simple greedy algorithm working backwards. The invariant is that the stack top must be placed right next
            // to (in normal linear order - before) the node we last stackified.
            m_stack.Push(&root);

            GenTree* lastStackified = root->gtNext;
            while (m_stack.Height() != initialDepth)
            {
                GenTree** use  = m_stack.Pop();
                GenTree*  node = *use;
                GenTree*  prev = (lastStackified != nullptr) ? lastStackified->gtPrev : root;
                while (node != prev)
                {
                    // Maybe this is an intervening void-equivalent node that we can also just stackify.
                    if (IsDataFlowRoot(prev))
                    {
                        prev = StackifyTree(prev);
                        continue;
                    }

                    // At this point, we'll have to modify the IR in some way. In general, these cases should be quite
                    // rare, introduced in lowering only. All HIR-induced cases (such as from "gtSetEvalOrder") should
                    // instead be ifdef-ed out for WASM.
                    INDEBUG(const char* reason);
                    if (CanMoveForward(node DEBUGARG(&reason)))
                    {
                        MoveForward(node, prev DEBUGARG(reason));
                    }
                    else
                    {
                        node = ReplaceWithTemporary(use, prev);
                    }
                    m_anyChanges = true;

Comment on lines 532 to 545
@@ -540,6 +541,7 @@ void Lowering::AfterLowerBlocks()
, m_compiler(lower->m_compiler)
, m_stack(m_compiler->getAllocator(CMK_Lower))
, m_minimumTempLclNum(m_compiler->lvaCount)
, m_maximumTempLclNum(m_compiler->lvaCount)
{
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_maximumTempLclNum is introduced and initialized but never used. This is dead state and may trigger unused-private-field warnings on some toolchains; either remove it or use it to bound/restrict which temps are released/recycled (as originally intended).

Copilot uses AI. Check for mistakes.
Comment on lines 556 to 561
node = StackifyTree(node);
// We've finished processing the current root tree, so
// we can release any temps used in stackification of the tree,
// since there is no more risk of interference between tree operands.
ReleaseTemporaries();
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseTemporaries() is called after every dataflow root, but it rebuilds the entire available-temp lists by iterating from m_minimumTempLclNum up to lvaCount each time. This changes the algorithm from freeing a single temp to O(totalTemps) work per root and could regress JIT throughput on large methods; consider tracking only temps used/created while stackifying the current root (e.g., with the intended bitset / min-max range) and releasing just those.

Copilot uses AI. Check for mistakes.
Comment on lines +710 to 741
void ReleaseTemporaries()
{
// We rely in this function on the lifetime of temporaries beginning (recall this is backwards traversal)
// at exactly "node"'s position, and not shrinking or extending after this call. This is currently true
// because we never move dataflow roots, and we only begin processing them after all subsequent nodes
// have already been stackified and thus won't move either.
assert(IsDataFlowRoot(node));
if (!node->OperIs(GT_STORE_LCL_VAR))
if (m_minimumTempLclNum == m_compiler->lvaCount)
{
// No temporaries were created
return;
}
assert(m_minimumTempLclNum < m_compiler->lvaCount);

unsigned lclNum = node->AsLclVar()->GetLclNum();
if (lclNum < m_minimumTempLclNum)
// Recycle all available temporaries as unused nodes
for (int i = 0; i < TYP_COUNT; i++)
{
return;
while (m_availableTemps[i] != nullptr)
{
Temporary* temp = Remove(&m_availableTemps[i]);
Append(&m_unusedTempNodes, temp);
}
}

Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
if (local == nullptr)
for (unsigned lclNum = m_minimumTempLclNum; lclNum < m_compiler->lvaCount; lclNum++)
{
local = new (m_compiler, CMK_Lower) Temporary();
}
local->LclNum = lclNum;
Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
if (local == nullptr)
{
local = new (m_compiler, CMK_Lower) Temporary();
}
local->LclNum = lclNum;

JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum);
Append(&m_availableTemps[genActualType(node->TypeGet())], local);
JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum);
Append(&m_availableTemps[genActualType(m_compiler->lvaGetDesc(lclNum)->TypeGet())], local);
}
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says this fix “adds a bit set which tracks temporaries that can be freed after tree processing completes”, but the current implementation doesn’t add such a bitset and instead recycles all temps in [m_minimumTempLclNum, lvaCount) on every root. Either update the description to match the implementation, or implement the described per-tree tracking to avoid unintended behavior/perf costs.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants