[Wasm RyuJIT]: Fix LIR Semantics in Stackifier Output by adamperlin · Pull Request #127412 · dotnet/runtime

adamperlin · 2026-04-24T22:54:56Z

This is a fix for an issue that came up in #126778, and is probably easiest to explain with a motivating example.

Consider the following case, where NOMOVE is a gentree operation we aren't allowed to move.

t2 = ... NOMOVE OP
t3 = ... OP
t0 = ... NOMOVE OP
t1 = ... OP
 * t3 (arg1)
 * t2 (arg2)
 * t1 (arg3)
 * t0 (target)
 CALL

The stackifier will first introduce a store to put t0 after t1:

t2 = ... NOMOVE OP
t3 = ... OP
t0 = ... OP
    +** STORE_LCL_VAR tmp0
t1 = ... OP
t0 = LCL_VAR tmp0
 * t3 (arg1)
 * t2 (arg2)
 * t1 (arg3)
 * t0 (call target)
 CALL

And then recursively stackify the new STORE to tmp0, since it is a dataflow root.
The stackifier then marks tmp0 as free here, since it IS free in linear data flow order. Then, when the next operands to the call are
stackified, the stackifier introduces a temporary again, but reuses t0
because we freed it.

t2 = ... OP
 +** STORE_LCL_VAR tmp0
t3 = ... OP
t2 = LCL_VAR tmp0
t0 = ... OP
    +** STORE_LCL_VAR tmp0
t1 = ... OP
t0 = LCL_VAR tmp0
 * t3
 * t2
 * t1
 * t0 (target)
 CALL

This produces invalid LIR; there is a store to tmp0 before one of its reads (t2) is consumed.

The simplest fix is to not release temporaries for reuse until all operands of a root tree have been processed, so this PR adds a bit set which tracks temporaries that can be freed after tree processing completes.

By LIR semantics, we can't always reuse temporaries that appear to be available due to interference between nodes which share the same root tree.

Copilot

Pull request overview

Fixes a Wasm RyuJIT stackifier correctness issue where temporaries introduced during stackification could be released and then reused too early (within the same root tree), producing invalid LIR due to store/read interference.

Changes:

Introduce a “pending release” bitset to defer releasing stackifier temporaries until a full root tree finishes processing.
Replace immediate temporary release with AddTemporariesForPendingRelease + RemovePendingTemporaries at the end of root processing.
Add dynamic growth logic for the pending-release bitset capacity.

Copilot · 2026-04-24T22:59:59Z


-            Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
-            if (local == nullptr)
+            JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));


lclNum is an unsigned, but the JITDUMP uses %d in the format string. Using %u avoids incorrect output if the value is large and matches the variable's type.

Suggested change

JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));

JITDUMP("Stackifier pending release of lclNum: %u temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));

Copilot · 2026-04-24T22:59:59Z

+            // However, we don't know precisely where the liftime ends here, because uses of locals happen at their position 
+            // in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree   


Typo in comment: "liftime" should be "lifetime" (and there appears to be trailing whitespace on these comment lines, which is worth removing to keep diffs clean).

Suggested change

// However, we don't know precisely where the liftime ends here, because uses of locals happen at their position

// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree

// However, we don't know precisely where the lifetime ends here, because uses of locals happen at their position

// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree

Copilot · 2026-04-24T23:00:00Z

+
+        constexpr int tmpToLvaNum(unsigned tmpNum)
+        {
+            assert(tmpNum >= 0);


tmpToLvaNum takes an unsigned tmpNum, so assert(tmpNum >= 0) is always true and doesn't add value. Consider removing it or changing the parameter type if negative values are meaningful here.

Suggested change

assert(tmpNum >= 0);

dotnet-policy-service · 2026-04-24T23:00:05Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS

This seems plausible.

@SingleAccretion please take a look.

AndyAyersMS · 2026-04-25T01:16:53Z

            , m_compiler(lower->m_compiler)
            , m_stack(m_compiler->getAllocator(CMK_Lower))
            , m_minimumTempLclNum(m_compiler->lvaCount)
+            // initially allocate 32 temp local slots for "pending release"


You might as well make this 64, since we'll be on a 64 bit host and the bitwise operations cost the same at 64 as they do at 32.

AndyAyersMS · 2026-04-25T01:17:40Z

+
+        void EnsurePendingReleaseCapacity(unsigned needed)
+        {
+            if (needed < BitVecTraits::GetSize(&m_pendingReleaseTempTraits))


How often do we come anywhere near needing 32 simultaneously live store temps?

(or 64, with my suggested change above).

Anecdotally I'd say very rarely, though I don't have hard numbers on this! This was a pretty generous upper bound. I do think Single's suggestion of removing all temporaries at root boundaries would work, and that would avoid the need for this kind of tracking, so we may not need to track live temps in the end for a conservative approach.

SingleAccretion

It is unfortunate we have to compromise CQ a bit to retain this LIR invariant, even though it doesn't correspond to codegen constraints (stack operands really are used at the LIR position, unlike register operands).

But I wonder if we can simplify the fix to just do:

if (initialDepth == 0)
   ReleaseAllTemps();

I. e. only release the temporaries at statement boundaries.

Have you thought about what a "precise" fix would look like? A temporary can be used if it doesn't have refs between the current 'prev' position and 'use's parent. Tracking the parent on the stack is easy enough, tracking 'busy' temps considering the shifting position of both 'prev' and 'parent' seems trickier.

adamperlin · 2026-04-28T01:01:11Z

But I wonder if we can simplify the fix to just do:
if (initialDepth == 0)
   ReleaseAllTemps();
I. e. only release the temporaries at statement boundaries.

I do think this approach would work and this would remove the need for tracking, so I'm going to give this approach a try.

Have you thought about what a "precise" fix would look like? A temporary can be used if it doesn't have refs between the current 'prev' position and 'use's parent. Tracking the parent on the stack is easy enough, tracking 'busy' temps considering the shifting position of both 'prev' and 'parent' seems trickier.

I haven't given this much thought since it seemed tricky to get right as you mention! I do think this would be nice to have if it turns out not to be too difficult. If you have any thoughts on how we might do this efficiently, I'd definitely be interested!

SingleAccretion · 2026-04-28T18:44:20Z

If you have any thoughts on how we might do this efficiently, I'd definitely be interested!

No, not really. It seems it would require quite careful tracking for what in the end is still going to be a suboptimal result.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/coreclr/jit/lowerwasm.cpp:604

This change addresses a subtle LIR correctness issue in the wasm stackifier; it would be good to add a targeted regression test (likely under src/tests/JIT/Regression) that produces a call with a non-movable target/arg ordering similar to the motivating example, and validates the method compiles/runs correctly under wasm RyuJIT. As-is, the fix is not protected against future refactors.

        GenTree* StackifyTree(GenTree* root)
        {
            int initialDepth = m_stack.Height();

            // Simple greedy algorithm working backwards. The invariant is that the stack top must be placed right next
            // to (in normal linear order - before) the node we last stackified.
            m_stack.Push(&root);

            GenTree* lastStackified = root->gtNext;
            while (m_stack.Height() != initialDepth)
            {
                GenTree** use  = m_stack.Pop();
                GenTree*  node = *use;
                GenTree*  prev = (lastStackified != nullptr) ? lastStackified->gtPrev : root;
                while (node != prev)
                {
                    // Maybe this is an intervening void-equivalent node that we can also just stackify.
                    if (IsDataFlowRoot(prev))
                    {
                        prev = StackifyTree(prev);
                        continue;
                    }

                    // At this point, we'll have to modify the IR in some way. In general, these cases should be quite
                    // rare, introduced in lowering only. All HIR-induced cases (such as from "gtSetEvalOrder") should
                    // instead be ifdef-ed out for WASM.
                    INDEBUG(const char* reason);
                    if (CanMoveForward(node DEBUGARG(&reason)))
                    {
                        MoveForward(node, prev DEBUGARG(reason));
                    }
                    else
                    {
                        node = ReplaceWithTemporary(use, prev);
                    }
                    m_anyChanges = true;

Copilot · 2026-04-28T21:06:42Z

@@ -540,6 +541,7 @@ void Lowering::AfterLowerBlocks()
            , m_compiler(lower->m_compiler)
            , m_stack(m_compiler->getAllocator(CMK_Lower))
            , m_minimumTempLclNum(m_compiler->lvaCount)
+            , m_maximumTempLclNum(m_compiler->lvaCount)
        {


m_maximumTempLclNum is introduced and initialized but never used. This is dead state and may trigger unused-private-field warnings on some toolchains; either remove it or use it to bound/restrict which temps are released/recycled (as originally intended).

Copilot · 2026-04-28T21:06:43Z

                node = StackifyTree(node);
+                // We've finished processing the current root tree, so
+                // we can release any temps used in stackification of the tree,
+                // since there is no more risk of interference between tree operands.
+                ReleaseTemporaries();
            }


ReleaseTemporaries() is called after every dataflow root, but it rebuilds the entire available-temp lists by iterating from m_minimumTempLclNum up to lvaCount each time. This changes the algorithm from freeing a single temp to O(totalTemps) work per root and could regress JIT throughput on large methods; consider tracking only temps used/created while stackifying the current root (e.g., with the intended bitset / min-max range) and releasing just those.

Copilot · 2026-04-28T21:06:43Z

+        void ReleaseTemporaries()
        {
-            // We rely in this function on the lifetime of temporaries beginning (recall this is backwards traversal)
-            // at exactly "node"'s position, and not shrinking or extending after this call. This is currently true
-            // because we never move dataflow roots, and we only begin processing them after all subsequent nodes
-            // have already been stackified and thus won't move either.
-            assert(IsDataFlowRoot(node));
-            if (!node->OperIs(GT_STORE_LCL_VAR))
+            if (m_minimumTempLclNum == m_compiler->lvaCount)
            {
+                // No temporaries were created
                return;
            }
+            assert(m_minimumTempLclNum < m_compiler->lvaCount);

-            unsigned lclNum = node->AsLclVar()->GetLclNum();
-            if (lclNum < m_minimumTempLclNum)
+            // Recycle all available temporaries as unused nodes
+            for (int i = 0; i < TYP_COUNT; i++)
            {
-                return;
+                while (m_availableTemps[i] != nullptr)
+                {
+                    Temporary* temp = Remove(&m_availableTemps[i]);
+                    Append(&m_unusedTempNodes, temp);
+                }
            }

-            Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
-            if (local == nullptr)
+            for (unsigned lclNum = m_minimumTempLclNum; lclNum < m_compiler->lvaCount; lclNum++)
            {
-                local = new (m_compiler, CMK_Lower) Temporary();
-            }
-            local->LclNum = lclNum;
+                Temporary* local = Remove(&m_unusedTempNodes); // See if we have any free nodes in the pool.
+                if (local == nullptr)
+                {
+                    local = new (m_compiler, CMK_Lower) Temporary();
+                }
+                local->LclNum = lclNum;

-            JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum);
-            Append(&m_availableTemps[genActualType(node->TypeGet())], local);
+                JITDUMP("Temporary V%02u is now free and can be re-used\n", lclNum);
+                Append(&m_availableTemps[genActualType(m_compiler->lvaGetDesc(lclNum)->TypeGet())], local);
+            }
        }


PR description says this fix “adds a bit set which tracks temporaries that can be freed after tree processing completes”, but the current implementation doesn’t add such a bitset and instead recycles all temps in [m_minimumTempLclNum, lvaCount) on every root. Either update the description to match the implementation, or implement the described per-tree tracking to avoid unintended behavior/perf costs.

adamperlin added 3 commits April 24, 2026 13:20

Fix stackifier temporary reuse

7ac9a35

By LIR semantics, we can't always reuse temporaries that appear to be available due to interference between nodes which share the same root tree.

Update comment

3a9fb5b

Remove extra JITDUMP

2096c11

Copilot AI review requested due to automatic review settings April 24, 2026 22:54

dotnet-policy-service Bot assigned adamperlin Apr 24, 2026

github-actions Bot added the area-VM-coreclr label Apr 24, 2026

Copilot started reviewing on behalf of adamperlin April 24, 2026 22:55 View session

adamperlin requested review from AndyAyersMS and SingleAccretion April 24, 2026 22:56

adamperlin mentioned this pull request Apr 24, 2026

[Wasm RyuJit] Call Codegen Fixes for R2R #126778

Open

adamperlin added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-VM-coreclr labels Apr 24, 2026

adamperlin added this to the 11.0.0 milestone Apr 24, 2026

Copilot AI reviewed Apr 24, 2026

View reviewed changes

jit-format

244f535

AndyAyersMS reviewed Apr 25, 2026

View reviewed changes

build-analysis Bot mentioned this pull request Apr 25, 2026

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

SingleAccretion reviewed Apr 25, 2026

View reviewed changes

adamperlin added 3 commits April 28, 2026 13:30

Release all temporaries after processing a root tree in stackifier

2fbab95

Remove debugging code

3227554

Add early return and assert

b01beb4

Copilot AI review requested due to automatic review settings April 28, 2026 20:59

Copilot started reviewing on behalf of adamperlin April 28, 2026 20:59 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

	JITDUMP("Stackifier pending release of lclNum: %d temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));
	JITDUMP("Stackifier pending release of lclNum: %u temporary defined by [%06u]\n", lclNum, Compiler::dspTreeID(node));

		// However, we don't know precisely where the liftime ends here, because uses of locals happen at their position
		// in tree order, and not the LIR stream. So conservatively, we wait until we've processed an entire root gentree

Conversation

adamperlin commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

dotnet-policy-service Bot commented Apr 24, 2026

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

adamperlin Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

SingleAccretion left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamperlin commented Apr 28, 2026

Uh oh!

SingleAccretion commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adamperlin commented Apr 24, 2026 •

edited

Loading

SingleAccretion left a comment •

edited

Loading