Skip to content

Conversation

@jasnell
Copy link
Collaborator

@jasnell jasnell commented Dec 10, 2025

Scratching a bit of a long term itch here. The bulk of the implementation here was done by claude under heavy supervision and iteration.

Throughout the code base we make use of kj::OneOf for state machines. The typical pattern is something like:

struct Open { /* ... */ };
struct Ended { /* ... */ };
kj::OneOf<Open, Ended, kj::Exception> state;

This pattern is simple and effective but suffers from a lack of guard rails. For instance, state transitions can be made out of order, or can be made while there are still dangling references. Over the years we've had numerous cases of UAF's and other issues due to invalidated references after incorrectly timed state changes, etc. The direct use of kj::OneOf is effective but can be error prone.

This PR introduces a new state-machine.h utility that thinly wraps around kj::OneOf while putting some guard rails in place to protect against state transitions occuring at the wrong time, reducing the likelihood of UAFs and other memory issues, and generally just improving the quality of the code. An example use in compression.c++ is included as a proof point.

  ComposableStateMachine<
      TerminalStates<Ended>,
      ErrorState<kj::Exception>,
      ActiveState<Open>,
      Open,
      Ended,
      kj::Exception> state;

The intention would be to first transition the code in api/streams to use this, then apply usage more generally across the code base.

The specific implementation of the tool was an exercise in using claude. It required a decent amount of back and forth and refinement. Claude definitely did not get everything right in the first, second, or even third pass so the session was highly interactive, requiring quite a bit of back and forth with the tool to generate the right behavior. Test coverage should be complete and the tests detail how the utility should be used. There are also extensive doc comments throughout. See the changes in compression.c++ for an illustration of how existing code would be migrated.

The code comments go into significant detail on how the mechanism works, and even includes a migration guide. A significant portion of the LOC changed in the PR are just comments. The test provides full coverage and plenty of examples on how the API would be used. For review, I recommend starting with the code comments and tests in order to best understand what the code is doing. Then, move on to the actual implementation detail.

Overall: this is part of the streams cleanup effort. One of the issues we've had with the streams implementation are unclear/incorrect streams state transitions leading to things like UAFs. This is part of the effort to clean up that implementation to make it safer overall.

It's a large PR, yes, but there's exceedingly little risk here. The only functional changes are to compression.c++, which already has good test coverage. The changes there are minimal and shouldn't introduce any risk.

@jasnell jasnell requested review from a team as code owners December 10, 2025 18:06
@jasnell jasnell changed the title Mark span time adjustment warning as nosentry. Implement state-machine utility Dec 10, 2025
@jasnell jasnell added technical debt Label used for changes that address technical debt claude Labe used for changes primarily made by claude labels Dec 11, 2025
@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from 2529892 to cd6eb1a Compare December 17, 2025 16:46
anonrig
anonrig previously approved these changes Dec 17, 2025
@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from cd6eb1a to a990c72 Compare December 18, 2025 22:16
Copy link
Collaborator

@harrishancock harrishancock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got partway through state-machine.h. It appears to include a lot of unused code. It looks like the only class we actually use in this PR is ComposableStateMachine. Can we delete everything but that class, and add the other, more tightly-scoped state machines when we have use cases for them? (I have a sneaking suspicion that we won't...) I don't want to review code that won't ever be used, and I don't want to approve the PR without having reviewed the code in it, so that seems like the fastest way to get this merged.

In terms of correctness, the main thing that stood out to me so far is const-correctness of the transition lock.

Copy link
Contributor

@guybedford guybedford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only KJ supported pattern matching I guess! I appreciate how the state machine solves a very real UAF issue, and that is well motivated.

Good decision keeping the compression refactor straightforward enough to verify that it is correct.

It is still a little bit worrying having so much state machine code without it being integrated, but as long as all follow-ons are suitably vetted and tested this approach can work for iteration further.

I wasn't able to fully read all state machine internals to fully understand the transition models - I would also be curious to understand how it handles more highly connected state transitions with constraints.

Agreed with @harrishancock that the naming isn't quite on first glance with withState. Noting for the main functions exactly what they do ("calls the callback only if the state is the given state?") would help readability either way.

Great to see systematic improvements to underlying state models.

@jasnell jasnell requested a review from guybedford December 19, 2025 22:56
@jasnell jasnell dismissed anonrig’s stale review December 19, 2025 22:56

Significant changes since signoff.

@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from 55996ea to 37ee448 Compare December 19, 2025 22:57
@jasnell
Copy link
Collaborator Author

jasnell commented Dec 19, 2025

Significant updates and usages. I recommend starting with the final state of the PR rather than going through commit-by-commit. Will be far less confusing that way.

Copy link
Collaborator

@harrishancock harrishancock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing state-machine.h. I have only nits and little suggestions that can be deferred at this point, but I'm holding off approving until I can read the rest of the changes in the PR.

@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from 54b874f to 9d1860b Compare December 20, 2025 18:52
@jasnell
Copy link
Collaborator Author

jasnell commented Dec 20, 2025

@harrishancock ... added a commit addressing those review comments.

Copy link
Collaborator

@harrishancock harrishancock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through and left some comments, but nothing particularly urgent I think. I'd say let's merge, but maybe try to get this into an internal release with as few other changes as possible.

A few last thoughts:

  • I noticed that state.visitForGc() isn't used everywhere I'd expect it to be. There's a comment mentioning the kj::Own& conundrum. Are there other reasons it couldn't be used everywhere?
  • Some classes use a = StateMachine::create() field initializer, some classes use create() in a constructor initializer list, and still others leave their states uninitialized, then immediately call transitionTo(). An incremental improvement might be to delete StateMachine's default constructor, or make it private, forcing everyone to use the create() factory function. This would get rid of the possibility of uninitialized states, and make sure all states are properly initialized in initializer lists or field initializers. This would remove one barrier to implementing visitForGc() in terms of visit().
  • After seeing how frequently we need to use the getUnsafe() family of functions, I'm leaning more and more toward some sort of scoped accessor macro, maybe WD_IF_STATE, like in my previous comment philosophizing about KJ_IF_SOME integration. The first step would be to give TransitionLock a tryGet() member function, like @guybedford suggested. But if we do that, we'd need to figure out how to defer state transitions to when the TransitionLock is destroyed to support many of the places we call getUnsafe(), which raises the main confusing bit from my perspective: how to delineate the difference between OperationScope and TransitionLock, when to use one versus the other. The best answer isn't super clear to me.
  • Related to the previous point, we lose a little bit of safety by switching away from KJ_SWITCH_ONEOF, in that that macro forced us to handle every possible state. So, adding states now runs a greater risk of forgetting to adjust existing code to handle the new state. The verbosity of KJ_SWITCH_ONEOF is a burden, though, so I'm not saying it's wrong to switch away from it in the general case -- just that in some places we might want to guarantee that we've exhaustively handled all the states. A new WD_SWITCH_STATE macro could hide a TransitionLock, but still let us perform the exhaustive switch.

None of the above points are blocking, they're just suggestions. We can iterate more after we merge.

@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from df69240 to a8899d7 Compare December 22, 2025 18:35
@jasnell
Copy link
Collaborator Author

jasnell commented Dec 22, 2025

Ok... @guybedford and @harrishancock ... I'm going to split this PR up now that it's essentially done. It's too big to land safely as is. My plan is to split it across multiple PRS using autogates as much as possible but that's not always going to be easy and it's likely to force us to roll it out likely over multiple weeks.

jasnell added a commit that referenced this pull request Dec 22, 2025
Separated out from #5670
This was an already existing bug that when piping a readable
stream through a transform stream, and the transform's readable
side is canceled, the source readable would not be correctly
released and marked as closed. With this fix, verified that the
behavior is correct per the spec and matches other implementations.
Should not be a breaking change as this as previously broken and
would error in inconsistent ways (IdentityTransformStream would
error differently than a regular TransformStream), now both have
consistent and correct behavior.
@jasnell jasnell force-pushed the jasnell/state-machine-utility branch from e490fd5 to fbb6fb1 Compare December 23, 2025 16:56
@jasnell jasnell marked this pull request as draft December 23, 2025 16:57
@github-actions
Copy link

The generated output of @cloudflare/workers-types has been changed by this PR. If this is intentional, run just generate-types to update the snapshot. Alternatively, you can download the full generated types:

Full Type Diff

@jasnell
Copy link
Collaborator Author

jasnell commented Dec 23, 2025

@anonrig ... do not bother reviewing this PR. I'm in the process of splitting it into multiple other PRs since we cannot land the bulk of these changes without autogate(s).

@anonrig
Copy link
Member

anonrig commented Dec 23, 2025

@anonrig ... do not bother reviewing this PR. I'm in the process of splitting it into multiple other PRs since we cannot land the bulk of these changes without autogate(s).

Yes, I'm just sharing a funny fact about Github :-)


#include "nbytes.h"

#include <workerd/api/system-streams.h>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by: #5755

@@ -1,10 +1,52 @@
#include "identity-transform-stream.h"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5763

}
}
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

@@ -1,7 +1,9 @@
#include "common.h"
#include "readable-source.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

#include "common.h"
#include "writable-sink.h"

#include <workerd/api/util.h>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

}
}
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

@@ -1,6 +1,7 @@
#include "common.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

@@ -1,29 +1,49 @@
#include "writable-sink.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by #5767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude Labe used for changes primarily made by claude technical debt Label used for changes that address technical debt

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants