coordinator: add notion of staleness#1317
Merged
Merged
Conversation
davidweisse
approved these changes
Apr 1, 2025
katexochen
reviewed
Apr 1, 2025
3u13r
previously requested changes
Apr 3, 2025
75bd502 to
6f5d1b3
Compare
6f5d1b3 to
36f0193
Compare
katexochen
reviewed
Apr 8, 2025
268d39d to
b8f362a
Compare
katexochen
approved these changes
Apr 9, 2025
b8f362a to
addf431
Compare
katexochen
approved these changes
Apr 9, 2025
Co-authored-by: davidweisse <98460960+davidweisse@users.noreply.github.com>
Co-authored-by: Leonard Cohnen <lc@edgeless.systems> Co-authored-by: Paul Meyer <katexochen0@gmail.com>
addf431 to
c00f3f6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Coordinator has an internal in-memory state and an external state managed by a
Storeimplementation. There's currently a two-way synchronization between the two:SetManifestrequests result in store updates.syncState).The latter mechanism is not suitable for distributed Coordinators as specified in RFC010, because it derives new mesh certificates on persistency changes, although the Coordinator that wrote the update already derived its own mesh cert. Thus, we need to replace the store-to-state update happening in
syncStatewith a peer recovery attempt. It's important to note thatsyncStatewas misguided to begin with, even for a single coordinator, because an out-of-band update could not really happen to a running Coordinator.As a first step towards peer recovery, I'm replacing the
syncStatefunction with the concept of state staleness, which implements the recovery mode outlined in the RFC. A Coordinator starts out uninitialized, becomes ready through a first manifest set, a user recovery or a peer recovery, and eventually becomes stale because other coordinators receive manifest updates.stateDiagram-v2 state "out-of-band manifest update" as oob [*] --> uninit uninit --> userapi.Recover userapi.Recover --> ready uninit --> meshapi.Recover meshapi.Recover --> ready ready --> oob oob --> stale stale --> userapi.Recover stale --> meshapi.RecoverThe coordinator would be in recovery mode (i.e., should attempt to do peer recovery and should accept user recovery) when it is either uninitialized or stale. Staleness is a property of the state itself, and when a state becomes stale it never becomes fresh again (a new state object could be fresh, though). This is why it's safe to track staleness as a boolean field in State that is only ever flipped in one direction.
All API methods verify that their state is fresh before responding. However, state can become stale after the Coordinator started responding to a request. This is perfectly fine for the meshapi, where the client is just unlucky to initialize during a manifest update, but still gets certificates for the existing deployment. In the userapi, this means that there are concurrent requests.
GetManifestsreturns a state, but not necessarily the latest committed to storage, which is acceptable. Concurrent calls toSetManifestwill fail eventually due to theCompareAndSwaplogic in the store.Staleness is not checked for meshapi functions: they should work on the state they are invoked with, regardless of that state becoming stale.