Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented May 13, 2025

Closes: #333

This change updates our syscall-handling plumbing so that any kernel-side failure during a liveslots syscall is no longer silently ignored. Instead, we:

  • Detect syscall failures
  • Roll back the database to the savepoint taken at the start of the current crank on failure.
  • Terminate vats: For now, vat errors both rewind changes and terminate the vat.

@sirtimid sirtimid changed the title feat(kernel): Rollback on crank failure feat(kernel): Handle syscall failures May 21, 2025
@sirtimid sirtimid force-pushed the sirtimid/rollback-on-delivery-failures branch from 4508691 to 1ce1511 Compare May 21, 2025 17:28
@sirtimid sirtimid marked this pull request as ready for review May 21, 2025 17:29
@sirtimid sirtimid requested a review from a team as a code owner May 21, 2025 17:29
FUDCo
FUDCo previously requested changes May 21, 2025
Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first read this looks very good.

I've clicked the Github checkbox that says "Request changes" but I'm not actually requesting changes per se, but just that you take another look with the following concern in mind (which I fear requires some long winded explanation):

A key difference between our implementation and SwingSet is the way we handle the vatstore. Rather than updating the vatstore via syscalls, all vatstore changes are delivered as a single unit as part of the crank completion record that gets sent back to the kernel after the vat supervisor has processed a delivery.

The main job that SwingSet's transaction rollback machinery did was to prevent vatstore changes (caused by vatstore syscalls) from becoming permanent if there was a crank failure, but our crank completion mechanism already does that simply by not sending the changes if there's a failure. Nevertheless, we still have vat-specific kernel state changes that happen on the kernel side as consequences of both syscall handling and the the act of delivery itself, e.g., clist changes, kernel promise resolutions, run queue updates, etc. The rollback stuff implemented in this PR should take care of those.

However, this means we now have two different kinds of "abandon persistent state in the event of error" mechanism that operate side by side. It's worth double checking to make sure that they aren't doing anything to interfere with each other's correct operation (in particular, due to the order of operations at the end of a crank). I tried to follow the logic of this through myself and I think it's all OK, but the changes here are sufficiently extensive that I'm not 100% confident of my analysis. Given that you presumably have a more complete and accurate mental model of what you've done here than I'm able to readily swap into my brain, I'd appreciate it if you'd take a look back at things with the above described concern in mind.

If your analysis turns up any problems we can deal with them then. Otherwise, if after checking this over you conclude that everything is copacetic (the likely outcome imho), let me know and I'll approve (if somebody else doesn't beat me to it).

@sirtimid sirtimid force-pushed the sirtimid/rollback-on-delivery-failures branch from 1ce1511 to f5ad6db Compare May 21, 2025 22:27
@sirtimid
Copy link
Contributor Author

sirtimid commented May 22, 2025

On first read this looks very good.

The main job that SwingSet's transaction rollback machinery did was to prevent vatstore changes (caused by vatstore syscalls) from becoming permanent if there was a crank failure, but our crank completion mechanism already does that simply by not sending the changes if there's a failure. Nevertheless, we still have vat-specific kernel state changes that happen on the kernel side as consequences of both syscall handling and the the act of delivery itself, e.g., clist changes, kernel promise resolutions, run queue updates, etc. The rollback stuff implemented in this PR should take care of those.

@FUDCo This was a vert spot on comment. With this change we now prevent the vat kvstore from being updated on syscall failures. Let me know if you think this is sufficient.

@sirtimid sirtimid requested a review from FUDCo May 22, 2025 09:04
@sirtimid sirtimid requested a review from rekmarks May 28, 2025 14:48
Copy link
Member

@rekmarks rekmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few minor things, including some items that could / should be handled in a follow-up.

Comment on lines +306 to 309
async waitForSyscallsToComplete(): Promise<void> {
while (this.pendingSyscalls > 0) {
await delay(10);
}
Copy link
Member

@rekmarks rekmarks May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up issue: Thinking about the possibilities of interleaving here makes me a little bit queasy. As a follow-up, we should create a device that stores a promise which resolves whenever the pending syscall count reaches 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possibilities of interleaving? What about if there aren't any syscalls? Whatever I can think of is still relying on that pendingSyscalls value but feel free to suggest another solution.

@sirtimid sirtimid requested a review from rekmarks May 28, 2025 19:48
Copy link
Member

@rekmarks rekmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sirtimid sirtimid dismissed FUDCo’s stale review May 28, 2025 21:08

Dismissing to merge it :)

@sirtimid sirtimid merged commit 0a3b0bd into main May 28, 2025
22 checks passed
@sirtimid sirtimid deleted the sirtimid/rollback-on-delivery-failures branch May 28, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make vats handle syscall failures better (or at all)

4 participants