Skip to content

egraph: Ensure eclass members are all available in the same block#8006

Merged
jameysharp merged 5 commits intobytecodealliance:mainfrom
jameysharp:egraph-available-block
Mar 7, 2024
Merged

egraph: Ensure eclass members are all available in the same block#8006
jameysharp merged 5 commits intobytecodealliance:mainfrom
jameysharp:egraph-available-block

Conversation

@jameysharp
Copy link
Contributor

@jameysharp jameysharp commented Feb 28, 2024

During elaboration, we want the property that it's safe to pick any member of an eclass to implement that eclass. To establish that property, this commit defines a notion of the "available block" for every value, and enforces that all values in an eclass have the same available block.

Closes #7891 where we devised this plan.

TODO: this needs more comments, especially documenting the available_block structure and the linear-time filtering loop in optimize_pure_enode. If somebody else wants to write those I would really appreciate it.

Future work: immediately after optimized_values.push(orig_value), we should try optimized_values.sort_unstable() and then optimized_values.dedup(). That avoids creating union nodes for duplicate values, which should save a little time downstream in the compiler pipeline, but causes values to get renumbered in some filetests so they need more inspection. It also means we can delete the if optimized_value == orig_value test in the loop afterward. But it's an O(n log n) pass over a potentially large list so we should benchmark the change too.

cc: @elliottt @cfallin @lpereira

During elaboration, we want the property that it's safe to pick any
member of an eclass to implement that eclass. To establish that
property, this commit defines a notion of the "available block" for
every value, and enforces that all values in an eclass have the same
available block.

Closes bytecodealliance#7891 where we devised this plan.

Co-authored-by: L Pereira <l.pereira@fastly.com>
@jameysharp jameysharp force-pushed the egraph-available-block branch from f410c1f to 211c5f9 Compare February 28, 2024 02:25
@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Feb 28, 2024
// maybe just one new value marked as "subsuming" the
// original, if present.)
let mut union_value = orig_value;
let mut union_value = optimized_values.pop().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than the indirect pushing and then popping, this could be

Suggested change
let mut union_value = optimized_values.pop().unwrap();
let mut union_value = optimized_values.pop().unwrap_or(orig_value);

Then, as a reader, I don't have to wonder if the push is going to trigger a realloc or not (probably not because it is a smallvec and we limit the number of returns from ISLE in ISLE, but I think it is possible if we return the max ISLE returns and then push orig_vlaue?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the ability to filter out the original value, not just the optimized values. We could add an extra check after the loop to compare the original value's available block against best_block, but that means pretty much duplicating the body of the loop, and I think the cleanest thing to do then would still involve pushing the original value sometimes.

Besides, this smallvec is allocated with size MATCHES_LIMIT, which is 5, but ISLE truncates at MAX_ISLE_RETURNS, which is 8, so we already might heap-allocate. Now that I've learned that I have a suggestion for simplifying ISLE and fixing that discrepancy at the same time, but that should be a separate issue.

The simplest way to ensure that this push can't cause a realloc is to push before calling constructor_simplify, which does not require that the provided vector is empty on entry. I can do that if you prefer.

Comment on lines +537 to +538
let mut available_block: SecondaryMap<Value, Block> =
SecondaryMap::with_default(Block::reserved_value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pre-reserving space here, since IIUC we are going to end up filling this in for every single value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-allocating space for this map is not totally trivial for two reasons.

  • We add more values to the data-flow graph while the egraph pass is running and I don't know what bound to use if we want to avoid reallocating later.
  • SecondaryMap has with_default and with_capacity constructors, but no constructor that does both, and types like Block don't implement Default. This is mostly just an annoyance since we can call resize after construction, but resize initializes all of the backing memory, while with_capacity doesn't, so it might make a difference.

I'll go ahead and do an initial resize to the number of values present before the start of the egraph pass, but if you have any suggestions for estimating a reasonable larger size, let me know.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice overall! Just some comment nits below.

Also, could we have a test case that triggers the trimming behavior, perhaps with our subsume-for-correctness disabled? Actually, a thought on that: we could (with this PR, not before!) add an option disable_egraph_subsume to Cranelift; keep it off by default, but add it to a test we know triggers this auto-subsume; then fuzz with the option; then eventually turn it on and remove our careful subsume cases.

if self.domtree.dominates(x, y) {
Ordering::Less
} else {
debug_assert!(self.domtree.dominates(y, x));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here about the SSA property we're relying on? Something like: "Note that the def-point of all arguments to an instruction in SSA lie on a line of direct ancestors in the domtree, and so do their available-blocks. This means that either x dom y or y dom x; they cannot be unordered wrt each other."


optimized_values.push(orig_value);

let mut best_block = self.available_block[*optimized_values.last().unwrap()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop looks correct to me (very slick in-place compaction!) but could we have a comment describing what it's doing? It took me a little bit, with knowledge of what it's supposed to be doing, to grok it. Perhaps something like: "Remove any values from optimized_values that do not have the highest possible available block in the domtree." at the top, and "This value's available block is higher; all higher indices we've already scanned have lower blocks and need to be removed, so truncate after this element." in the first case, and "This value's available block is lower, and we want to skip it; remove it from the list and keep going (order does not matter so swap_remove is OK)." in the second?

Copy link
Contributor

@lpereira lpereira Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things I'm not sure here is that, when swap_remove() is called, the element that was at the end of optimized_values is put in the idx-th position, but then it's never used again to lookup this_block for the next iteration. Is this the desirable behavior, or am I missing something here?

Edit: yeah, this seems fine because the loop goes in reverse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Re: your edit) yep, indeed; I think it might be good to add a comment with the loop invariant here, for clarity. Something like: "Loop scans in reverse; all values at indices >= idx have the same available block, which is the best (highest in domtree) available block seen so far."

- add documentation comments
- push orig_value onto optimized_values before calling simplify so it
  can't cause a heap reallocation
Copy link
Contributor Author

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed all the current review comments and added comments in all the places I intended to add them. I expect this will still fail CI as I haven't investigated why it failed before yet.

// maybe just one new value marked as "subsuming" the
// original, if present.)
let mut union_value = orig_value;
let mut union_value = optimized_values.pop().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the ability to filter out the original value, not just the optimized values. We could add an extra check after the loop to compare the original value's available block against best_block, but that means pretty much duplicating the body of the loop, and I think the cleanest thing to do then would still involve pushing the original value sometimes.

Besides, this smallvec is allocated with size MATCHES_LIMIT, which is 5, but ISLE truncates at MAX_ISLE_RETURNS, which is 8, so we already might heap-allocate. Now that I've learned that I have a suggestion for simplifying ISLE and fixing that discrepancy at the same time, but that should be a separate issue.

The simplest way to ensure that this push can't cause a realloc is to push before calling constructor_simplify, which does not require that the provided vector is empty on entry. I can do that if you prefer.

Comment on lines +537 to +538
let mut available_block: SecondaryMap<Value, Block> =
SecondaryMap::with_default(Block::reserved_value());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-allocating space for this map is not totally trivial for two reasons.

  • We add more values to the data-flow graph while the egraph pass is running and I don't know what bound to use if we want to avoid reallocating later.
  • SecondaryMap has with_default and with_capacity constructors, but no constructor that does both, and types like Block don't implement Default. This is mostly just an annoyance since we can call resize after construction, but resize initializes all of the backing memory, while with_capacity doesn't, so it might make a difference.

I'll go ahead and do an initial resize to the number of values present before the start of the egraph pass, but if you have any suggestions for estimating a reasonable larger size, let me know.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM wrt my comments at least; thanks very much for working on this!

Copy link
Contributor Author

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about Chris' request for some new test cases, so I guess I haven't actually addressed all review comments yet.

And I think I don't want to address Nick's request, to avoid a possible realloc, in this PR. It's an easy change but it changes a bunch of filetests unnecessarily because it means that the optimized_values and orig_value are visited in a different order.

Also, and more importantly, there's something wrong with GVN.

Comment on lines +172 to +175
debug_assert_eq!(
self.get_available_block(inst),
self.available_block[orig_result]
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug-assert is failing in a few of the wasm spec tests and I don't understand why yet.

In every case that's failed, the GVN map says the instruction we looked up is equivalent to an existing instruction, but the available-block for the existing instruction is the entry block, while the available-block for the new instruction is somewhere deeper in the dominator tree.

I believed that GVN would always yield the same available-block for pure instructions so this is concerning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reasoning through this. For GVN to match, both instructions must be identical, except that each operand is only required to be in the same set in the union-find rather than actually equal. The assertion failing means that for some operand in the instruction, the two instructions consume values which are in the same union-find set but have different available-blocks.

I had been concerned about this specific situation but Chris and I reasoned that it was okay (#7891 (comment)) and now I'm confused about whether we were right and the assertion is overly strict. If so, what is the correct available-block to use here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect what is happening here is that the instruction gets rewritten the first time it is seen and the GVN-map entry points to the rewritten result (which could be e.g. an iconst available everywhere), even though the instruction per its arguments is only available in one block. So I think the assertion that we want is that the available block of orig_result (which we're going to use as our value here) dominates the available block of inst, rather than is exactly the same. I think that resolves the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make the assert pass in all the tests we have. I think you're correct but the reason why was obscured by the surrounding code.

First off, the GVN-hit case did not always return orig_result. If we were looking up a NewOrExistingInst::Existing instruction, insert_pure_enode actually returned the result of that instruction. But that case only happens when this function was called from remove_pure_and_optimize, which ignores the return value. The return value only matters for make_inst_ctor, which always passes NewOrExistingInst::New, in which case we were returning orig_result.

Therefore it's safe to change this function to always return orig_result, which I think is what we probably intended in the first place, and based on your comment is what you thought we were already doing. 😁

The next observation is that if the result of inst never appears in the values of the value_to_opt_value map, then it will never get looked up in available_block (or in the union-find, for that matter). So we don't actually need an answer to the question of what block this instruction should be available in. In fact it's better to leave it at the "reserved" default value so we'll get assertion failures if we accidentally do look it up.

I've made these changes; could you review again when you have a chance? Hopefully CI will pass now too.

Because the duplicate instruction is never used, it never matters what
block it might be available in. For the same reason it also doesn't need
to be in the union-find.
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. OK to merge with a test-case added that exercises the "auto-subsume" behavior (or if you've seen that it's already covered by one of the existing tests, note that here and that's fine too). Thanks!

@cfallin
Copy link
Member

cfallin commented Mar 6, 2024

(Or, I guess that could be slightly tricky with our current rules around subsume; OK to punt test to followup removal of the subsumes if you like.)

@jameysharp jameysharp marked this pull request as ready for review March 7, 2024 17:37
@jameysharp jameysharp requested a review from a team as a code owner March 7, 2024 17:37
@jameysharp jameysharp requested review from elliottt and removed request for a team March 7, 2024 17:37
@jameysharp jameysharp enabled auto-merge March 7, 2024 17:39
@jameysharp jameysharp added this pull request to the merge queue Mar 7, 2024
Merged via the queue into bytecodealliance:main with commit 8341738 Mar 7, 2024
@jameysharp jameysharp deleted the egraph-available-block branch March 7, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants