Skip to content

Store stack maps in an ELF section#10404

Merged
alexcrichton merged 5 commits intobytecodealliance:mainfrom
alexcrichton:stack-maps-out-of-line
Mar 20, 2025
Merged

Store stack maps in an ELF section#10404
alexcrichton merged 5 commits intobytecodealliance:mainfrom
alexcrichton:stack-maps-out-of-line

Conversation

@alexcrichton
Copy link
Member

This commit moves the storage of stack maps from being embedded within serde-encoded information to instead being stored in a separate ELF section in the final executable. The motivation for this is to make this more easily debuggable with a wasmtime objdump command in the future but this additionally should have the nice side effect of making non-stack-maps modules have smaller encoded information (no need to encode an empty list) and additionally make stack-maps-using-modules faster to decode (no serde decoding, it's already "decoded").

This implements a scheme similar to the address map section where there's a "builder" for the section and then a separate half to decode the section. The same basic encoding, a bit map, is used. This is likely going to make accessing stack maps slightly slower, but if that's an issue we can tweak the representation and align things and/or use usize or such.

This commit moves the storage of stack maps from being embedded within
serde-encoded information to instead being stored in a separate ELF
section in the final executable. The motivation for this is to make this
more easily debuggable with a `wasmtime objdump` command in the future
but this additionally should have the nice side effect of making
non-stack-maps modules have smaller encoded information (no need to
encode an empty list) and additionally make stack-maps-using-modules
faster to decode (no serde decoding, it's already "decoded").

This implements a scheme similar to the address map section where
there's a "builder" for the section and then a separate half to decode
the section. The same basic encoding, a bit map, is used. This is likely
going to make accessing stack maps slightly slower, but if that's an
issue we can tweak the representation and align things and/or use
`usize` or such.
@alexcrichton alexcrichton requested review from a team as code owners March 15, 2025 15:42
@alexcrichton alexcrichton requested review from abrown and pchickey and removed request for a team March 15, 2025 15:42
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself winch Winch issues or pull requests labels Mar 15, 2025
@github-actions
Copy link

Subscribe to Label Action

cc @saulecabrera

Details This issue or pull request has been labeled: "cranelift", "wasmtime:api", "winch"

Thus the following users have been cc'd because of the following labels:

  • saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! I just noted some doubts I had while reading this for those who are more familiar with stack maps and their usage.

bits: CompoundBitSet,
pub struct StackMap<'a> {
frame_size: u32,
data: &'a [U32Bytes<LittleEndian>],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to re-export U32Bytes in the public API? I suspect the answer is "yes" but just wanted to note this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's fine here in that wasmtime-environ is a "private crate" where we don't consider it part of Wasmtime's public API, just an internal implementation detail. We can probably do more to document that at the crate itself though.

Co-authored-by: Andrew Brown <andrew.brown@intel.com>
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like this approach. Have opinions on some of the details tho.

Comment on lines +48 to 51
pub struct CompoundBitSet<T = usize> {
elems: Box<[ScalarBitSet<T>]>,
max: Option<u32>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that you want to use something of fixed size, rather than usize, for the bitsets serialized into the stack map section?

It shouldn't actually matter, since we can only ever run .cwasms that have the same size usize as the host (whether native or pulley) right? Given that, is it worth the complications that become necessary here to support non-usize scalars inside a compound bit set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, should we just fix the compound bit set's internal scalar bit set to u64 or u32 scalars, instead of usize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an attempt to multiplex the (IMO correct) default behavior of pointer-sized-by-default with the cross-compilation behavior of "always use fixed-size things as it's easier". I don't want to use usize in the compiled ELF as that makes cross-compilation and debugging cross-compiled artifacts a bit harder.

Comment on lines +18 to +23
/// The "offset" is an offset into the "stack map data" field which are encoded
/// as:
///
/// * A 4-byte little-endian frame size
/// * A 4-byte little-endian count of remaining bits: `M`
/// * `M` 4-byte little-endian integers as a bit map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so the * Stack map data above is just a bag of bytes of the rest of the section, and when you get an offset into that bag of bytes, it is guaranteed to point at one of these? Could you clarify these binary format docs that the trailing data is a bunch of variably-sized data and that the initial PC and offset fields are essentially an index for searching through it without decoding every single stack map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to document this better with some more art/boxes, but let me know if I can improve anything

// Sanity-check to ensure that functions are pushed in-order, otherwise
// the `pcs` array won't be sorted which is our goal.
assert!(pc >= self.last_offset);
self.last_offset = pc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider for the future: if we frequently have multiple sequential entries for different PCs but which have the same stack slots, eg

...
0x1dc: offset of [8, 12]
0x124: offset of [8, 12, 24] copy 1
0x142: offset of [8, 12, 24] copy 2
0x15a: offset of [8, 12, 24] copy 3
0x166: offset of [8]
...

then it may make sense for each entry in the index to store non-overlapping PC ranges, rather than exact PCs, and we could effectively dedupe the index entries and the stack map data. That is, the previous example would become

...
0x1dc..0x1dd: offset of [8, 12]
0x124..0x15b: offset of [8, 12, 24] (only copy)
0x166..0x167: offset of [8]
...

The downsides are that

  1. We would need to change Cranelift to actually emit empty stack maps for safepoints without any live GC refs, otherwise if we have (pc=0x1234, [8]); (pc=0x1238, []); (pc=0x123b, [8]) and we don't see that middle entry in this builder, then we risk using [8] as our stack map at pc 0x1238, which is extending a dead gc ref's lifetime at best and is giving the collector uninitialized data at worst.
  2. Relatedly, we lose our ability to catch bugs where the return address PC we are tracing isn't an exact match for a stack map entry.

These are actually pretty scary, so maybe we don't want to do this, even if it would let us make these binary search indices much smaller.


All that said, we can actually already dedupe the stack map data if we want to, and have multiple index entries point to the same stack map data (even if they aren't contiguous!) with the encoding scheme already in use in this PR. We just need to hash cons and cache stack-map-data to encoded offset in this builder. This doesn't have any of the downsides from above. Seems like it would be a pure win.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your latter idea, having a "pool of sets" and when we insert a stack map we shove it in a hash map and dedupe based on that. Agreed it's a pure win with no real degree of complexity overhead. Are you thinking that should be done here? (I'd sort of prefer to defer that to analyze a "real world module" with a lot of stack maps)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduping the stack map data can definitely be a separate PR

Comment on lines +70 to +75
let count = bits.iter_words().count();
self.stack_map_data
.push(U32Bytes::new(LittleEndian, count as u32));
for word in bits.iter_words() {
self.stack_map_data.push(U32Bytes::new(LittleEndian, word));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So back to iter_words: the usage here makes me think we should instead just expose a method like

impl CompoundBitSet {
    pub fn as_raw(&self) -> &[u32] { ... }
}

method or something, have it always operate on u32, and maybe (I haven't looked at the updated usage yet) add something like the following to allow zero-copy usage:

impl CompoundBitSet {
    pub fn ref_from_raw<F, T>(raw: &[u32], f: F) -> T
    where
        F: for<'a> FnOnce(&'a CompoundBitSet) -> T
    {
        // unsafely create a CompoundBitSet wrapping the raw data,
        // pass a ref to the bitset into the closure,
        // mem::forget the bitset to avoid Vec dtors
    }
}

rather than making it generic over the internal storage and doing a whole iter_words thing.

Copy link
Member

@fitzgen fitzgen Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could avoid the branded lifetime (so that the thing could be stored in the usage site's iterator) by doing a std::cell::Ref-esque thing instead:

pub struct BorrowedCompoundBitSet<'a> {
    bitset: core::mem::ManuallyDrop<CompoundBitSet>,
    _borrow: core::marker::PhantomData<&'a [u32]>,
}

impl core::ops::Deref for BorrowedCompoundBitSet<'_> {
    type Target = CompoundBitSet;
    // ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I can probably use ScalarBitSet::iter to simplify a lot of this. I'd prefer to avoid too wonky of an abstraction to use CompoundBitSet literally somehow, but do you think that's an acceptable complexity tradeoff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, sorry.

To answer your original thoughts on this comment, I'd sort of defer to my response above. My literal response just now above this one is actually a response to your thought below

Comment on lines +134 to +151
// Here `self.data` is a bit set of offsets divided by 4, so iterate
// over all the bits in `self.data` and multiply their position by 4.
let bit_positions = self.data.iter().enumerate().flat_map(|(i, word)| {
let i = i as u32;
let mut remaining = word.get(LittleEndian);
core::iter::from_fn(move || {
if remaining == 0 {
None
} else {
let pos = remaining.trailing_zeros();
let bit = 1 << pos;
remaining ^= bit;
Some(i * 32 + pos)
}
})
});

bit_positions.map(|pos| pos * 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CompoundBitSet::ref_from_raw thing would help tidy up some of this stuff too, so that it doesn't have to have a copy of the bitset logic inline here.

@alexcrichton
Copy link
Member Author

@fitzgen did you want to review this over again after the recent changes? Or have more thoughts?

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@fitzgen fitzgen added this pull request to the merge queue Mar 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2025
@alexcrichton alexcrichton enabled auto-merge March 20, 2025 19:31
@alexcrichton alexcrichton added this pull request to the merge queue Mar 20, 2025
Merged via the queue into bytecodealliance:main with commit 452086f Mar 20, 2025
40 checks passed
@alexcrichton alexcrichton deleted the stack-maps-out-of-line branch March 20, 2025 20:08
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Mar 20, 2025
github-merge-queue bot pushed a commit that referenced this pull request Mar 20, 2025
* Add `--stack-maps` to `wasmtime objdump`

Follow-up to #10404 and #10405

* Enable traps/stack maps by default in objdump
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator wasmtime:api Related to the API of the `wasmtime` crate itself winch Winch issues or pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants