Store stack maps in an ELF section by alexcrichton · Pull Request #10404 · bytecodealliance/wasmtime

alexcrichton · 2025-03-15T15:42:26Z

This commit moves the storage of stack maps from being embedded within serde-encoded information to instead being stored in a separate ELF section in the final executable. The motivation for this is to make this more easily debuggable with a wasmtime objdump command in the future but this additionally should have the nice side effect of making non-stack-maps modules have smaller encoded information (no need to encode an empty list) and additionally make stack-maps-using-modules faster to decode (no serde decoding, it's already "decoded").

This implements a scheme similar to the address map section where there's a "builder" for the section and then a separate half to decode the section. The same basic encoding, a bit map, is used. This is likely going to make accessing stack maps slightly slower, but if that's an issue we can tweak the representation and align things and/or use usize or such.

This commit moves the storage of stack maps from being embedded within serde-encoded information to instead being stored in a separate ELF section in the final executable. The motivation for this is to make this more easily debuggable with a `wasmtime objdump` command in the future but this additionally should have the nice side effect of making non-stack-maps modules have smaller encoded information (no need to encode an empty list) and additionally make stack-maps-using-modules faster to decode (no serde decoding, it's already "decoded"). This implements a scheme similar to the address map section where there's a "builder" for the section and then a separate half to decode the section. The same basic encoding, a bit map, is used. This is likely going to make accessing stack maps slightly slower, but if that's an issue we can tweak the representation and align things and/or use `usize` or such.

github-actions · 2025-03-15T16:45:35Z

Subscribe to Label Action

cc @saulecabrera

Details

This issue or pull request has been labeled: "cranelift", "wasmtime:api", "winch"

Thus the following users have been cc'd because of the following labels:

saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

abrown

Makes sense to me! I just noted some doubts I had while reading this for those who are more familiar with stack maps and their usage.

crates/environ/src/compile/stack_maps.rs

abrown · 2025-03-17T16:50:27Z

crates/environ/src/stack_map.rs

-    bits: CompoundBitSet,
+pub struct StackMap<'a> {
    frame_size: u32,
+    data: &'a [U32Bytes<LittleEndian>],


Are we sure we want to re-export U32Bytes in the public API? I suspect the answer is "yes" but just wanted to note this.

Yeah it's fine here in that wasmtime-environ is a "private crate" where we don't consider it part of Wasmtime's public API, just an internal implementation detail. We can probably do more to document that at the crate itself though.

Co-authored-by: Andrew Brown <andrew.brown@intel.com>

fitzgen

Overall, I really like this approach. Have opinions on some of the details tho.

cranelift/bitset/src/compound.rs

fitzgen · 2025-03-17T18:27:01Z

cranelift/bitset/src/compound.rs

+pub struct CompoundBitSet<T = usize> {
+    elems: Box<[ScalarBitSet<T>]>,
    max: Option<u32>,
 }


The idea here is that you want to use something of fixed size, rather than usize, for the bitsets serialized into the stack map section?

It shouldn't actually matter, since we can only ever run .cwasms that have the same size usize as the host (whether native or pulley) right? Given that, is it worth the complications that become necessary here to support non-usize scalars inside a compound bit set?

Or alternatively, should we just fix the compound bit set's internal scalar bit set to u64 or u32 scalars, instead of usize?

This is an attempt to multiplex the (IMO correct) default behavior of pointer-sized-by-default with the cross-compilation behavior of "always use fixed-size things as it's easier". I don't want to use usize in the compiled ELF as that makes cross-compilation and debugging cross-compiled artifacts a bit harder.

cranelift/bitset/src/compound.rs

crates/cranelift/src/compiler.rs

crates/environ/src/compile/stack_maps.rs

fitzgen · 2025-03-17T18:49:14Z

crates/environ/src/compile/stack_maps.rs

+/// The "offset" is an offset into the "stack map data" field which are encoded
+/// as:
+///
+/// * A 4-byte little-endian frame size
+/// * A 4-byte little-endian count of remaining bits: `M`
+/// * `M` 4-byte little-endian integers as a bit map.


Ah so the * Stack map data above is just a bag of bytes of the rest of the section, and when you get an offset into that bag of bytes, it is guaranteed to point at one of these? Could you clarify these binary format docs that the trailing data is a bunch of variably-sized data and that the initial PC and offset fields are essentially an index for searching through it without decoding every single stack map?

I've tried to document this better with some more art/boxes, but let me know if I can improve anything

crates/environ/src/compile/stack_maps.rs

fitzgen · 2025-03-17T19:04:13Z

crates/environ/src/compile/stack_maps.rs

+        // Sanity-check to ensure that functions are pushed in-order, otherwise
+        // the `pcs` array won't be sorted which is our goal.
+        assert!(pc >= self.last_offset);
+        self.last_offset = pc;


Something to consider for the future: if we frequently have multiple sequential entries for different PCs but which have the same stack slots, eg

... 0x1dc: offset of [8, 12] 0x124: offset of [8, 12, 24] copy 1 0x142: offset of [8, 12, 24] copy 2 0x15a: offset of [8, 12, 24] copy 3 0x166: offset of [8] ...

then it may make sense for each entry in the index to store non-overlapping PC ranges, rather than exact PCs, and we could effectively dedupe the index entries and the stack map data. That is, the previous example would become

... 0x1dc..0x1dd: offset of [8, 12] 0x124..0x15b: offset of [8, 12, 24] (only copy) 0x166..0x167: offset of [8] ...

The downsides are that

We would need to change Cranelift to actually emit empty stack maps for safepoints without any live GC refs, otherwise if we have (pc=0x1234, [8]); (pc=0x1238, []); (pc=0x123b, [8]) and we don't see that middle entry in this builder, then we risk using [8] as our stack map at pc 0x1238, which is extending a dead gc ref's lifetime at best and is giving the collector uninitialized data at worst.

Relatedly, we lose our ability to catch bugs where the return address PC we are tracing isn't an exact match for a stack map entry.

These are actually pretty scary, so maybe we don't want to do this, even if it would let us make these binary search indices much smaller.

All that said, we can actually already dedupe the stack map data if we want to, and have multiple index entries point to the same stack map data (even if they aren't contiguous!) with the encoding scheme already in use in this PR. We just need to hash cons and cache stack-map-data to encoded offset in this builder. This doesn't have any of the downsides from above. Seems like it would be a pure win.

I like your latter idea, having a "pool of sets" and when we insert a stack map we shove it in a hash map and dedupe based on that. Agreed it's a pure win with no real degree of complexity overhead. Are you thinking that should be done here? (I'd sort of prefer to defer that to analyze a "real world module" with a lot of stack maps)

Deduping the stack map data can definitely be a separate PR

fitzgen · 2025-03-17T19:23:39Z

crates/environ/src/compile/stack_maps.rs

+        let count = bits.iter_words().count();
+        self.stack_map_data
+            .push(U32Bytes::new(LittleEndian, count as u32));
+        for word in bits.iter_words() {
+            self.stack_map_data.push(U32Bytes::new(LittleEndian, word));
+        }


So back to iter_words: the usage here makes me think we should instead just expose a method like

impl CompoundBitSet { pub fn as_raw(&self) -> &[u32] { ... } }

method or something, have it always operate on u32, and maybe (I haven't looked at the updated usage yet) add something like the following to allow zero-copy usage:

impl CompoundBitSet { pub fn ref_from_raw<F, T>(raw: &[u32], f: F) -> T where F: for<'a> FnOnce(&'a CompoundBitSet) -> T { // unsafely create a CompoundBitSet wrapping the raw data, // pass a ref to the bitset into the closure, // mem::forget the bitset to avoid Vec dtors } }

rather than making it generic over the internal storage and doing a whole iter_words thing.

Could avoid the branded lifetime (so that the thing could be stored in the usage site's iterator) by doing a std::cell::Ref-esque thing instead:

pub struct BorrowedCompoundBitSet<'a> { bitset: core::mem::ManuallyDrop<CompoundBitSet>, _borrow: core::marker::PhantomData<&'a [u32]>, } impl core::ops::Deref for BorrowedCompoundBitSet<'_> { type Target = CompoundBitSet; // ... }

I realize I can probably use ScalarBitSet::iter to simplify a lot of this. I'd prefer to avoid too wonky of an abstraction to use CompoundBitSet literally somehow, but do you think that's an acceptable complexity tradeoff?

Er, sorry.

To answer your original thoughts on this comment, I'd sort of defer to my response above. My literal response just now above this one is actually a response to your thought below

fitzgen · 2025-03-17T19:29:17Z

crates/environ/src/stack_map.rs

+        // Here `self.data` is a bit set of offsets divided by 4, so iterate
+        // over all the bits in `self.data` and multiply their position by 4.
+        let bit_positions = self.data.iter().enumerate().flat_map(|(i, word)| {
+            let i = i as u32;
+            let mut remaining = word.get(LittleEndian);
+            core::iter::from_fn(move || {
+                if remaining == 0 {
+                    None
+                } else {
+                    let pos = remaining.trailing_zeros();
+                    let bit = 1 << pos;
+                    remaining ^= bit;
+                    Some(i * 32 + pos)
+                }
+            })
+        });
+
+        bit_positions.map(|pos| pos * 4)


The CompoundBitSet::ref_from_raw thing would help tidy up some of this stuff too, so that it doesn't have to have a copy of the bitset logic inline here.

alexcrichton · 2025-03-19T19:17:20Z

@fitzgen did you want to review this over again after the recent changes? Or have more thoughts?

fitzgen

Thanks

Follow-up to bytecodealliance#10404 and bytecodealliance#10405

* Add `--stack-maps` to `wasmtime objdump` Follow-up to #10404 and #10405 * Enable traps/stack maps by default in objdump

alexcrichton requested review from a team as code owners March 15, 2025 15:42

alexcrichton requested review from abrown and pchickey and removed request for a team March 15, 2025 15:42

github-actions bot added cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself winch Winch issues or pull requests labels Mar 15, 2025

alexcrichton mentioned this pull request Mar 15, 2025

Add a wasmtime objdump subcommand #10405

Merged

pchickey approved these changes Mar 17, 2025

View reviewed changes

abrown approved these changes Mar 17, 2025

View reviewed changes

Update crates/environ/src/compile/stack_maps.rs

af5e966

Co-authored-by: Andrew Brown <andrew.brown@intel.com>

fitzgen requested changes Mar 17, 2025

View reviewed changes

alexcrichton added 2 commits March 17, 2025 15:21

Review comments

b92a5b9

More review comments

a1aa733

fitzgen approved these changes Mar 20, 2025

View reviewed changes

fitzgen added this pull request to the merge queue Mar 20, 2025

fitzgen mentioned this pull request Mar 20, 2025

Deduplicate encoded stack map data inside .cwasm sections #10431

Open

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2025

Fix MIRI test by enabling unaligned object feature

68879a6

alexcrichton enabled auto-merge March 20, 2025 19:31

alexcrichton added this pull request to the merge queue Mar 20, 2025

Merged via the queue into bytecodealliance:main with commit 452086f Mar 20, 2025
40 checks passed

alexcrichton deleted the stack-maps-out-of-line branch March 20, 2025 20:08

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Mar 20, 2025

Add --stack-maps to wasmtime objdump

024de96

Follow-up to bytecodealliance#10404 and bytecodealliance#10405

alexcrichton mentioned this pull request Mar 20, 2025

Add --stack-maps to wasmtime objdump #10439

Merged

github-merge-queue bot pushed a commit that referenced this pull request Mar 20, 2025

Add --stack-maps to wasmtime objdump (#10439)

5b07689

* Add `--stack-maps` to `wasmtime objdump` Follow-up to #10404 and #10405 * Enable traps/stack maps by default in objdump

Conversation

alexcrichton commented Mar 15, 2025

Uh oh!

github-actions bot commented Mar 15, 2025

Subscribe to Label Action

Uh oh!

abrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fitzgen Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Mar 19, 2025

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fitzgen Mar 17, 2025 •

edited

Loading