-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Allow O2K_CONST_VEC to handle some TYP_SIMD32/64 #127390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth the complixity to detect SIMD32/SIMD64 that we can fit into simd16_t or we just add
simd_t*to this union and allocate it on heap? Less changes + will support all kinds of constantsUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it is and may even be worthwhile to do for
VEC_CNSas well. -- This is based on past throughput and allocation considerations we saw for VN and other phases.Particularly on xarch, we are most typically hitting the
TYP_SIMD32path for hardware and so this would mean a lot more constants are heap allocating when they don't need to be.We'd have to have a
if (size > 16) { speciallyHandleTheHeapAlloc }anyways with such a scenario, so this approach shouldn't be adding extra overhead. Rather, it just optimizes and avoids the allocation for the most common cases. -- and I'd actually expect assertion prop doesn't care about most constants, particularly "arbitrary" ones, so it may be unimportant to ever add the heap allocation support.We may even want to consider doing the same for
GT_CNS_VECso that we aren't forced to have a large node for all constants and to avoid needing to do repeat checks like "is this a broadcast".Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I am not sure I follow. Your PR tries to fit CNS_VEC for SIMD32 and SIMD64 into
simd16_tinAssertionDsc2for things where 16 bytes components are the same. I just don't see why if we can support all kinds of constants naturally, just that SIMD32 and SIMD64 will require a small arena allocation because 99% of assertions are used for non-SIMD stuff and we don't want to increase the struct size for this nieche case (pay to play). Basically, I suggest this change: https://github.com/EgorBo/runtime-1/pull/new/jit/wide-simd-assertions (just vibe-coded, probably can be simplified). I doubt we will see any noticieable regression anywhere (both TP and alloc size).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example, my commit can fold this:
while your cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work too. The downside is that it's forcing an allocation for the most common cases when we can trivially avoid it for the majority scenario (most SIMD constants are broadcastable or repeating the V128 value).
Based on what we had seen in VN, there were enough simd32/64 constants that this added up and pessimized throughput. We fixed that by splitting it out into separate maps. We had similarly seen even in
GenTreeVecConand other areas that the cost of the size check and doing different things for each vector size was significantly less than always touching a full cache line.I would expect that the "optimization" this PR is doing is still beneficial even with your full fix and that we may even want to do something similar for
GenTreeVecConitself to reduce the amount of data that has to be checked for the most common patterns/optimizations.I'm fine with waiting for the "full fix" to go in and then comparing SPMI metrics or factoring it into this PR so we can compare and choose. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's small "now" but may matter more as this expands. It can be useful for optimizing out a lot of edge cases for floating-point, conversions, and other scenarios when we know a valid is within a given constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have numbers on the throughput (I'm guessing near 0%) and allocation impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to submit it as a PR to grab real numbers, i'm 100% sure TP is 0, alloc rate might be visible (>0) but not big enough to care IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: this pr produced
-3,260diff on win-x64 andsimd_t*is-8,949There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract it to #127401 to see the metrics and get full differences. We can take either or both, depending on what the numbers report.