Skip to content

What's the deal with partial cancellation #889

@njsmith

Description

@njsmith

Recently we were discussing what happens if a subprocess call gets cancelled, but you want to at least find out what the subprocess said before it got killed. And @oremanj wrote:

I appreciate the simplicity of Trio's stance that "a cancelled operation didn't happen", but it doesn't necessarily compose very well -- if an operation is built out of multiple other underlying operations that can't readily be rolled back, either the "cancelled = didn't happen" rule has to break or the entire higher-level operation has to be uncancellable once started. I don't think we want to propose the latter, so maybe we should think about a user-friendly way to talk about the circumstances in which the rule gets bent?

It's a fair point! The "cancelled operation didn't happen" thing was only ever supposed to apply to low-level, primitive operations. In that context, it's a pretty important rule, because without it you can't ever hope to build anything sensible on top. But it's never made any sense for higher-level operations (i.e., the ones that working programmers are actually interacting with 99.99% of the time). Of course, at the time the initial docs were being written, I was struggling to figure out how to get the primitive operations to work at all and there were no higher-level operations. So that rule probably gets more prominence then it should :-). But things have changed and we should have a better story here.

Recently in a discussion of how to talk about cancellation in the docs, @smurfix wrote:

The result of cancelling something is either (a) the called code didn't do anything, raising a Cancelled exception, or (b) the called code did what it was supposed to do, returning its result normally. Of course there's also the possibility of (c) the called code got part way through and left whatever it tried to accomplish in an inconsistent state.

It's probably out of Trio's scope to signal that state to the caller; there should be an attribute "is this object still usable", and/or the object should raise an InconsistentStateError when it's used again. We might want to document that as best practice, and maybe add that exception to Trio as a sensible default for trio-using libraries to raise.

So that's one idea for how Trio could provide concrete advice to users about how to work with partial cancellation.

I don't have any organized thoughts here, so I'm just going to dump a bunch of unorganized ones.


There were two concrete proposals that @oremanj made in the subprocess discussion (unless there were more and I'm forgetting some :-)):

  • Add timeout and deadline arguments to trio.run_process. These would have a similar effect to wrapping a cancel scope around run_process, except that if the timeout expires, then run_process wouldn't raise Cancelled, it would raise CalledProcessError, which would be a special exception with attributes recording whatever partial output, return code, etc., we got from the process.

    The downside of this is that it's extremely specific to subprocesses, which feels weird. The problem is really "what do you do if an operation times out and you want partial results?" – I actually have no idea what makes subprocesses special here, as compared to, I don't know, calling some docker API or something. So a solution that's specific to subprocesses doesn't feel natural. OTOH it would work, and maybe there's some reason that people need partial results from subprocesses a lot, and don't in other cases, so something simple and specific is fine.

  • Give run_process a special (optional) semantics, where if while running it say a Cancelled exception materialize, it would automatically replace it with CalledProcessError.

    This is a really intriguing idea, but makes me uncomfortable because we have no idea where that Cancelled is coming from – in particular, we don't know whether the code that was going to process the partial results is also cancelled, or not.

I don't actually know why @oremanj is so eager to get at partial results in this case; I gather he has some use case where he needs this feature, but I don't know what it is.


Another notorious example where cancellation loses information in an important way is Stream.send_all. Right now, if send_all gets cancelled, you effectively have to throw away that stream and give up, because you have no idea what data you have or haven't sent.

It wasn't always like this: originally, if send_all was cancelled, there was a hack where we'd attach an attribute to the Cancelled exception recording how many bytes we'd sent, and a sufficiently clever caller could potentially use that to reconstruct the state of the stream.

Then I added SSLStream and it quickly became clear that this design was no good. There are two major issues:

  1. exceptions may start out in some nice well-defined operation like SocketStream.send_all, but they propagate. That's what exceptions do! Right across abstraction boundaries. So, for example, if you called SSLStream.send_all, and it called SocketStream.send_all, then if you weren't careful then you could get an exception out of SSLStream.send_all that has metadata attached saying how many bytes SocketStream.send_all sent, which is catastrophically misleading.

  2. SSLStream actually has some pretty complicated internal state, because, well, you know. Cryptography. In particular, cancellation is very different: with something like SocketStream, if send_all is cancelled in the middle, that's pretty simple: you sent the first N bytes, but not the rest. With SSLStream, though, then send_all immediately commits to sending all the bytes, before it sends any of them. So if it gets cancelled, then we're in this weird state where it's sent some of the bytes, but it's committed to sending the rest of the bytes, but it hasn't yet. Oh, and we don't even know how many user-level bytes have actually been transmitted in a way that the other side can read them. (Like, we might know sent 500 bytes on the underlying socket, but maybe 100 of those are protocol framing, and then the last 50 are actual application data but it's application data that the other side can't decrypt until we send another 50 bytes to complete that frame, ... it's really messy.) There just is no useful way to communicate the state of an SSLStream after send_all is cancelled, no matter what metadata we attach to what exceptions.

So, instead, we've been going ahead with the rule that once a send_all is cancelled, your stream is doomed. We haven't done anything to detect this and e.g. raise an error if you try calling send_all again after a cancelled send_all, like in @smurfix's suggestion.... maybe we should?

And then as a consequence, for downstream users, like trio-websocket, what we've been converging on is basically the rule that only one task should "own" a Stream for sending at a time – if you want to a stream to survive sending from multiple tasks, then you create a background task that handles the send_all calls, and the other tasks send stuff to that task over some kind of channel. As @mehaase recently pointed out in #328 (comment), we might want to start documenting this more thoroughly? (#328 is generally relevant to these issues – it's ostensibly about send_all and locking, but really it's about sharing a stream between multiple tasks, and cancellation turns out to be a major consideration there.)

This does seem to be working out pretty well. So I guess the moral is that at least in this area, "partial results" just aren't an important case to think about. All the cases we care about are either "leaves the state inconsistent" or "atomic", and you can build the latter on top of the former (!) by using a background task + a channel, b/c the channel's send operation is atomic.


Some of this comment also feels relevant, especially the bit about "what does cancellation mean" near the end: #147 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions