[WIP] Stop using LIST nodes for SIMD operand lists #1141

mikedn · 2019-12-24T08:34:37Z

Contributes to https://github.com/dotnet/coreclr/issues/19876

These 2 intrinsics don't have a lot in common.

mikedn · 2019-12-26T20:38:51Z

This completes the removal of LIST by changing the way SIMD and HWINTRINSIC operands are stored. I still have a few things to double check and this also has to be split into multiple PRs (I'd say 3 - one for SIMD, one for HWINTRINSIC and one for LIST leftovers removal) but I'm interested in some early feedback about the way operands are stored.

For CALL, PHI and FIELD_LIST I continued using linked lists to store operands, partly because sometimes we need to insert new operands and partly because accessing operands by index is uncommon. For SIMD/HWINTRINSIC the situation is exactly the opposite - no need to insert new operands and operand access by index is rather common - so an array of operands looks like the better choice.

We have space for 3 operands inside the SIMD node itself, that's good because it covers 99% of the intrinsic needs. For 4 operands or more (e.g. new Vector4(1, 2, 3, 4)) all the operands are stored in a separately allocated array and a pointer to this array is stored inside of the node in place of those 3 operand "inline" array:

union {
    Use  m_inlineUses[3];
    Use* m_uses;
};

For some HW intrinsics the number of operands isn't fixed and in the current implementation it is computed by counting the number of nodes in the list. That doesn't work in the array implementation so we also need to store the number of operands in the node.

uint16_t m_numOps;

With these, the operands can be accessed using the following API:

unsigned GetNumOps();
GenTree* GetOp(unsigned index);
void SetOp(unsigned index, GenTree* node);
UseArray Uses(); // range-based for loop support
Use& GetUse(unsigned index);

So:

Having to use a separate array is a bit unfortunate for just 4 operands. It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE. Not sure how feasible is that and not sure how far we can go with it - if we want to allow a Create intrinsic with 32 operands do we really want to place all 32 inside the node?
Operands are accessed by index. 0-based index. So it's GetOp(0) & GetOp(1) instead of gtGetOp1() & gtGetOp2(). That may be a bit confusing at first. I suppose I can try to use 1-based indices but I'm not convinced that doing so isn't without drawbacks.
GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.
GetOp(1) asserts rather that returning null like gtGetOp2 does when the node is unary. I think that's the right way to do this but unfortunately the current HW intrinsic codegen is poorly structured - code is grouped by ISA rather than the intrinsic shape/arity and that complicates things. SIMD codegen is better in this regard, with its genSIMDIntrinsicUnOp and genSIMDIntrinsicBinOp.

Other issues worth mentioning:
Unlike PHI and other previous LIST users, SIMD and HWINTRINSIC do support GTF_REVERSE_OPS. This needs to continue to work in the new operand representation so some logic needs to be copied. Not big deal, and maybe it's for the best. gtSetEvalOrder support for intrinsics is minimal anyway.

Biggest remaining issue:
All the operand logic is duplicated in GenTreeSIMD and GenTreeHWIntrinsic. It really should be common (in GenTreeJitIntrinsic base class) but I can't put it there due to the fact that they use different intrinsic enumerations and alignment holes prevent placing some data members in the base class and some in the derived class.
I think that the only way out of this is to put all data members in the base class but thanks to the widespread bad practice of making data members public this is more difficult than it needs to be.

Comments?

CarolEidt · 2019-12-27T18:07:14Z

General Comments (will do more detailed review next):

It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE.

I'm not sure how much simplification this would buy us, and would certainly cost some non-trivial implementation work, if not additional complexity.

On the issue of 0-based vs 1-based indexing, I would strongly favor maintaining 0-based indexing. I'm sure that I would be more confused by having to remember that the indexed form is 1-based to match the operand names. But I'd be interested in others' opinions on this.

GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.

The only thing that's unfortunate about this is more conceptual than practical. That is, one would like to consider (most of?) these to have functional operator semantics, which to me seems to be implied by GenTreeOp but since that is already more of a structural characteristic than a semantic one I'm not sure it's even conceptually useful.

On the surface, the idea of sharing more between GenTreeSIMD and GenTreeHWIntrinsic is appealing, but I've not considered it in great depth.

I think that the only way out of this is to put all data members in the base class but thanks to the widespread bad practice of making data members public this is more difficult than it needs to be.

Agreed. I'm sure I'm one of the guilty parties in getting us here.

mikedn · 2019-12-27T19:04:13Z

The only thing that's unfortunate about this is more conceptual than practical. That is, one would like to consider (most of?) these to have functional operator semantics, which to me seems to be implied by GenTreeOp but since that is already more of a structural characteristic than a semantic one I'm not sure it's even conceptually useful.

I think it would have been useful to represent true unary/binary intrinsics using GenTreeUnOp/GenTreeOp but the only reasonable way to do that seems to be to add GT_SIMD_UNARY/ GT_SIMD_BINARY to avoid the current weirdness - hey, this is a GTK_BINOP but the second operand is actually null and the first is a list of 3 operands!?! And we still need to deal with intrinsics with 3 operands and more somehow. I pondered this for a while but for some reason attempting to throw GT_SIMD_BINARY into the mix seems like a far more crazier change than the current one.

What's there to lose by not using GenTreeOp/GTK_BINOP? The obvious problem is that we need to duplicate the reverse ops logic. But I don't think it's a big deal, mainly because this logic isn't very good even today - for many 3 operand HWINTRINSICs the last operand is a constant so for Sethi–Ullman numbering they're really 2 operand nodes. But the current implementation basically treats these as unary and is unable to reverse the first 2 operands if needed.

mikedn · 2019-12-27T19:15:09Z

Agreed. I'm sure I'm one of the guilty parties in getting us here.

Ha ha, everyone is, me included :). Sometimes it's difficult to figure out when to move away from existing code base practices. Doing so can make new code better at the cost of becoming inconsistent with the old code.

CarolEidt

Overall the direction looks good to me, with some comments, suggestions and questions.

CarolEidt · 2019-12-27T18:39:48Z

src/coreclr/src/jit/gentree.h

In order to reduce confusion it might be good to give this a different name such as GetIndexedOp or GetOpAtIndex, even though they're verbose it would at least reduce confusion with GetOp1 and GetOp2

Makes sense but I'm a bit concerned about the longer names as these are commonly used functions.

Right, I get that - I'd be interested in others' thoughts on this. @dotnet/jit-contrib

src/coreclr/src/jit/codegencommon.cpp

src/coreclr/src/jit/gentree.cpp

src/coreclr/src/jit/lsraxarch.cpp

src/coreclr/src/jit/rationalize.cpp

CarolEidt · 2019-12-28T00:18:42Z

What's there to lose by not using GenTreeOp/GTK_BINOP?

I agree; there's not really a lot of actual value there.

All the operand logic is duplicated in GenTreeSIMD and GenTreeHWIntrinsic. It really should be common (in GenTreeJitIntrinsic base class) but I can't put it there due to the fact that they use different intrinsic enumerations and alignment holes prevent placing some data members in the base class and some in the derived class.

The fact that the SIMD and HWIntrinsic enums are distinct seems like something that could be distinguished based on opcode. That is, if I have a GT_Intrinsic then it uses CorInfoIntrinsics, if it is GT_SIMD then it uses SIMDIntrinsicID and if it's GT_HWINTRINSIC then it uses NamedIntrinsic (though IMO there should be a different enum for the HWIntrinsics).

mikedn · 2019-12-28T06:20:20Z

The fact that the SIMD and HWIntrinsic enums are distinct seems like something that could be distinguished based on opcode

Well, yes, this problem is solvable:

struct GenTreeJitIntrinsic {
private:
    union {
        Use  m_inlineUses[3];
        Use* m_uses;
    };
    uint16_t m_numOps;
protected:
    uint16_t m_intrinsic;
    uint32_t m_unused;
};

struct GenTreeSIMD : public GenTreeJitIntrinsic {
    SIMDIntrinsicID GetIntrinsic() const {
        return static_cast<SIMDIntrinsicID>(m_intrinsic);
    }

    unsigned GetSIMDSize() const {
        return (m_unused & 0xFFFF);
    }
    
    var_types GetSIMDBaseType() const {
        return static_cast<var_types>((m_unused >> 16) & 0xFF);
    }
};

and similar GenTreeHWIntrinsic class but with NamedIntrinsic instead of SIMDIntrinsicID and an extra GetIndexBaseType function.

The only problem is that I need to change a couple more hundreds of lines to replace gtSIMDIntrinsicID & co. with GetIntrinsic() & co. But if that avoids a bunch of duplication in the many IR traversal functions that only care about operands and not then intrinsic then this might be a win for the size of the change.

mikedn · 2019-12-28T06:46:13Z

That is, if I have a GT_Intrinsic then it uses CorInfoIntrinsics

Speaking of GT_INTRINSIC - another potential problem with this change is that it uses up all the available space in a small node. At best we can steal some bits from gtSIMDSize, gtSIMDIntrinsicID and even gtSIMDBaseType but there's no room left for a method handle like GenTreeIntrinsic has. Or limit the "inline" uses to 2 instead of 3 but that doesn't so great in the case of HWINTRINSIC where ternary operations are somewhat common.

Also, I don't like the current intrinsic situation very much:

There's 3 different kinds of intrinsics
That would probably make sense if the split was done along characteristics such as SIMD vs. scalar but that's not the case. SIMDIntrinsicAdd and NI_SSE2_Add represent the exact same operation in different ways. This already results in a ton of code duplication (import, lowering, LSRA, codegen) and the problem will get worse if we try to implement some SIMD optimizations (e.g. folding, CSE etc).
General purpose scalar operations are represented as intrinsics for no good reason, only because of the "if all you have is a hammer, everything looks like a nail" approach. Trivial/common operations such as ANDN or POPCNT should probably be normal genTreeOps so they can easily participate in existing optimizations such as constant folding.

I've no idea if and when this situation could be improved. In the meantime we should ensure we're not making it worse somehow.

src/coreclr/src/jit/lowerarmarch.cpp

mikedn · 2019-12-28T07:01:11Z

src/coreclr/src/jit/gentree.cpp

Need to double check this, it's not equivalent to the old gtSetListOrder code. On the other hand it's not clear if what gtSetListOrder did makes sense for SIMDIntrinsicInitN. Also, when looking more closely at the way SIMDIntrinsicInitN and gtSetEvalOrder work it seems that things are quite messy:

SIMDIntrinsicInitN has horrible register requirements - up to 5 registers - because it needs to first evaluate all operands and then it stitches everything together. This approach will never fly if we try to make intrinsics for Create methods like Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)...

gtSetEvalOrder runs before other optimizations so it has no idea that some trees may turn into constants or variables thus significantly reducing register requirements. This includes SIMDIntrinsicInitN becoming a pseudo-constant in lowering...

We can't control the evaluation order for nodes with more than 2 operands unless we add more information to GenTreeSIMD. And we're more or less out of space for any new information (e.g. an array of integers that would indicate the execution order of eveyr operand).

I tweaked "level" and costs a bit but "level" is still different from what gtSetListOrder does. But what gtSetListOrder does is dubious anyway:

runtime/src/coreclr/src/jit/gentree.cpp

Lines 2522 to 2533 in c1a51e2

if (lvl < 1)

{

level = nxtlvl;

}

else if (lvl == nxtlvl)

{

level = lvl + 1;

}

else

{

level = lvl;

}

Sethi-Ullman number is basically (l1 == l2) ? (l1 + 1) : max(l1, l2) and there's no trace of max anywhere in gtSetListOrder. I suspect max was supposed to be result of setting GTF_REVERSE_OPS and swapping levels:

runtime/src/coreclr/src/jit/gentree.cpp

Lines 2501 to 2508 in c1a51e2

if (list->gtFlags & GTF_REVERSE_OPS)

{

unsigned tmpl;

tmpl = lvl;

lvl = nxtlvl;

nxtlvl = tmpl;

}

Except that GTF_REVERSE_OPS is never set on list nodes.
Whatever. Doesn't really matter and gtSetEvalOrder is a mess anyway.

mikedn · 2019-12-30T07:59:02Z

I think that for now I'm going to ignore the SIMD/HWINTRINSIC duplication. It is primarily a pre-existing issue and the amount of new duplicated code is not that great, especially when compared with all the import/lower/lsra/codegen SIMD/HWINTRINSIC code.

Instead I'm going to work on removing the far worse pre-existing duplication produced by the bajillion custom tree traversals that exist today. With LIST gone it's easier to clean this up because I no longer have to deal with LIST related inconsistencies - included or not included in traversal. For example I included 2 commits that show how fgSetTreeSeq and fgGetFirstNode can be rewritten to use existing traversal machinery, getting rid of custom GTF_REVERSE_OPS handling in the process.

CarolEidt · 2019-12-30T20:36:09Z

I think that for now I'm going to ignore the SIMD/HWINTRINSIC duplication.

Seems reasonable for now.

I included 2 commits that show how fgSetTreeSeq and fgGetFirstNode can be rewritten to use existing traversal machinery, getting rid of custom GTF_REVERSE_OPS handling in the process.

Those look quite promising, though when you're ready for final review it would be nice to limit this PR to a more minimal set of changes.

sandreenko · 2020-01-07T20:21:26Z

GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.

Speaking of GT_INTRINSIC - another potential problem with this change is that it uses up all the available space in a small node. At best we can steal some bits from gtSIMDSize, gtSIMDIntrinsicID and even gtSIMDBaseType but there's no room left for a method handle like GenTreeIntrinsic has. Or limit the "inline" uses to 2 instead of 3 but that doesn't so great in the case of HWINTRINSIC where ternary operations are somewhat common.

With the fact that GenTreeHWIntrinsic are not longer GenTreeOp, can't we forbid bashing and changing them to other tree operands completely? It will allow us to allocate exact size for each HWIntrinsic node and we will be able to make a template parameter that means number of arguments up to 32.

Having to use a separate array is a bit unfortunate for just 4 operands. It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE. Not sure how feasible is that and not sure how far we can go with it - if we want to allow a Create intrinsic with 32 operands do we really want to place all 32 inside the node?

I am not sure why the new sizes will be between SMALL and LARGE, can they be smaller than SMALL and larger than LARGE?

mikedn · 2020-01-07T20:59:15Z

With the fact that GenTreeHWIntrinsic are not longer GenTreeOp, can't we forbid bashing and changing them to other tree operands completely? It will allow us to allocate exact size for each HWIntrinsic node and we will be able to make a template parameter that means number of arguments up to 32.

I suppose we could but I'm not sure we want to go to such an extreme. Currently SIMD/HWINTRINSIC trees are not optimized but perhaps in the future we want do something about that. At a minimum, we should be able to convert such nodes to LCL_VAR and a hypotetical CNS_SIMD to support CSE and constant folding.

I am not sure why the new sizes will be between SMALL and LARGE, can they be smaller than SMALL and larger than LARGE?

Hmm, I don't think there's any real reason to limit the node size to LARGE, not sure what I was thinking. I think all nodes should be at least SMALL size so we can convert anything to LCL_VAR/CNS_XYZ as mentioned above.

sandreenko · 2020-01-07T21:02:34Z

At a minimum, we should be able to convert such nodes to LCL_VAR and a hypotetical CNS_SIMD to support CSE and constant folding.

We always can do a replacement instead of bashing.

mikedn · 2020-01-07T21:09:23Z

We always can do a replacement instead of bashing.

It depends. The JIR IR wasn't really designed for that, there are cases where you'll need to use gtGetParent/TryGetUse to perform such replacements which isn't ideal. Bashing avoids that.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 24, 2019

mikedn added 3 commits December 25, 2019 00:10

Delete GenTreeJitIntrinsic

568cde9

Make SIMDIntrinsicID uint16_t

7260ac5

Better pack GenTreeSIMD and GenTreeHWIntrinsic

e23aed5

mikedn force-pushed the simd-no-list branch from b88c4fb to 154ec9f Compare December 26, 2019 09:26

mikedn added 4 commits December 26, 2019 14:09

Reorder gtNewSIMDNode parameters and eliminate null operands

c8eabbc

Split SIMDIntrinsicInit/SIMDIntrinsicInitN import code

82250a8

These 2 intrinsics don't have a lot in common.

Cleanup SIMDIntrinsicInit/SIMDIntrinsicInitN import code

b634843

Simplify SIMDIntrinsicInitN import

c569337

mikedn force-pushed the simd-no-list branch 2 times, most recently from 782b105 to 7c30d0d Compare December 26, 2019 20:08

mikedn force-pushed the simd-no-list branch from 7c30d0d to 197080f Compare December 27, 2019 06:09

Change GenTreeSIMD operand storage

a5f0827

mikedn force-pushed the simd-no-list branch 2 times, most recently from 958fa15 to 2c1feee Compare December 27, 2019 15:56

CarolEidt reviewed Dec 28, 2019

View reviewed changes

mikedn commented Dec 28, 2019

View reviewed changes

src/coreclr/src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved

mikedn commented Dec 28, 2019

View reviewed changes

mikedn force-pushed the simd-no-list branch from edbabbb to 6b13de9 Compare December 29, 2019 21:28

mikedn added 2 commits January 3, 2020 20:28

Make GenTreeUse noncopyable

43f1204

Fix AdvanceSIMDReverseOp

d06ffdd

mikedn added 2 commits January 3, 2020 20:29

Cleanup SIMDIntrinsicInitArray LEA generation

7c74ee1

Delete broken byte handling from BuildRMWUsesSIMD

059dd21

mikedn force-pushed the simd-no-list branch from 6b13de9 to 04dd8e7 Compare January 3, 2020 19:00

mikedn added 2 commits January 4, 2020 11:38

Add SIMD RMW comments

45d9986

Tweak SIMD costs

00ad22d

mikedn force-pushed the simd-no-list branch from 04dd8e7 to 00ad22d Compare January 4, 2020 09:39

mikedn closed this Jan 8, 2020

mikedn mentioned this pull request Jan 31, 2020

Improve/cleanup RyuJIT's IR handling of operand lists #11058

Closed

6 tasks

mikedn deleted the simd-no-list branch August 30, 2020 08:10

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

	if (lvl < 1)
	{
	level = nxtlvl;
	}
	else if (lvl == nxtlvl)
	{
	level = lvl + 1;
	}
	else
	{
	level = lvl;
	}

	if (list->gtFlags & GTF_REVERSE_OPS)
	{
	unsigned tmpl;

	tmpl = lvl;
	lvl = nxtlvl;
	nxtlvl = tmpl;
	}

[WIP] Stop using LIST nodes for SIMD operand lists #1141

[WIP] Stop using LIST nodes for SIMD operand lists #1141

Uh oh!

Conversation

mikedn commented Dec 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikedn commented Dec 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarolEidt commented Dec 27, 2019

Uh oh!

mikedn commented Dec 27, 2019

Uh oh!

mikedn commented Dec 27, 2019

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

CarolEidt Dec 27, 2019

Choose a reason for hiding this comment

Uh oh!

mikedn Dec 28, 2019

Choose a reason for hiding this comment

Uh oh!

CarolEidt Dec 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CarolEidt commented Dec 28, 2019

Uh oh!

mikedn commented Dec 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikedn commented Dec 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mikedn Dec 28, 2019

Choose a reason for hiding this comment

Uh oh!

mikedn Jan 7, 2020

Choose a reason for hiding this comment

Uh oh!

mikedn commented Dec 30, 2019

Uh oh!

CarolEidt commented Dec 30, 2019

Uh oh!

sandreenko commented Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikedn commented Jan 7, 2020

Uh oh!

sandreenko commented Jan 7, 2020

Uh oh!

mikedn commented Jan 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikedn commented Dec 24, 2019 •

edited

Loading

mikedn commented Dec 26, 2019 •

edited

Loading

mikedn commented Dec 28, 2019 •

edited

Loading

mikedn commented Dec 28, 2019 •

edited

Loading

sandreenko commented Jan 7, 2020 •

edited

Loading