Skip to content

coordinator: add notion of staleness#1317

Merged
burgerdev merged 2 commits into
mainfrom
burgerdev/recovery-mode
Apr 9, 2025
Merged

coordinator: add notion of staleness#1317
burgerdev merged 2 commits into
mainfrom
burgerdev/recovery-mode

Conversation

@burgerdev
Copy link
Copy Markdown
Member

@burgerdev burgerdev commented Mar 21, 2025

The Coordinator has an internal in-memory state and an external state managed by a Store implementation. There's currently a two-way synchronization between the two:

  • SetManifest requests result in store updates.
  • Out-of-band store updates result in state refreshes (implemented in syncState).

The latter mechanism is not suitable for distributed Coordinators as specified in RFC010, because it derives new mesh certificates on persistency changes, although the Coordinator that wrote the update already derived its own mesh cert. Thus, we need to replace the store-to-state update happening in syncState with a peer recovery attempt. It's important to note that syncState was misguided to begin with, even for a single coordinator, because an out-of-band update could not really happen to a running Coordinator.

As a first step towards peer recovery, I'm replacing the syncState function with the concept of state staleness, which implements the recovery mode outlined in the RFC. A Coordinator starts out uninitialized, becomes ready through a first manifest set, a user recovery or a peer recovery, and eventually becomes stale because other coordinators receive manifest updates.

stateDiagram-v2
    state "out-of-band manifest update" as oob
    [*] --> uninit
    uninit --> userapi.Recover
    userapi.Recover --> ready
    uninit --> meshapi.Recover
    meshapi.Recover -->  ready
    ready --> oob
    oob --> stale
    stale --> userapi.Recover
    stale --> meshapi.Recover
Loading

The coordinator would be in recovery mode (i.e., should attempt to do peer recovery and should accept user recovery) when it is either uninitialized or stale. Staleness is a property of the state itself, and when a state becomes stale it never becomes fresh again (a new state object could be fresh, though). This is why it's safe to track staleness as a boolean field in State that is only ever flipped in one direction.

All API methods verify that their state is fresh before responding. However, state can become stale after the Coordinator started responding to a request. This is perfectly fine for the meshapi, where the client is just unlucky to initialize during a manifest update, but still gets certificates for the existing deployment. In the userapi, this means that there are concurrent requests. GetManifests returns a state, but not necessarily the latest committed to storage, which is acceptable. Concurrent calls to SetManifest will fail eventually due to the CompareAndSwap logic in the store.

Staleness is not checked for meshapi functions: they should work on the state they are invoked with, regardless of that state becoming stale.

Copy link
Copy Markdown
Member

@davidweisse davidweisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Comment thread coordinator/internal/authority/userapi.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
@burgerdev burgerdev added the no changelog PRs not listed in the release notes label Apr 2, 2025
3u13r
3u13r previously requested changes Apr 3, 2025
Comment thread coordinator/internal/authority/authority_test.go Outdated
Comment thread coordinator/internal/authority/authority_test.go Outdated
Comment thread coordinator/internal/authority/userapi_test.go Outdated
Comment thread coordinator/internal/authority/userapi_test.go Outdated
@burgerdev burgerdev force-pushed the burgerdev/recovery-mode branch 2 times, most recently from 75bd502 to 6f5d1b3 Compare April 4, 2025 12:34
@burgerdev burgerdev requested review from 3u13r and katexochen April 4, 2025 12:36
@burgerdev burgerdev added this to the v1.8.0 milestone Apr 4, 2025
@burgerdev burgerdev force-pushed the burgerdev/recovery-mode branch from 6f5d1b3 to 36f0193 Compare April 7, 2025 15:34
@katexochen katexochen added the needs: review Someone needs to review to bring this forward label Apr 8, 2025
Comment thread coordinator/main.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
Comment thread coordinator/internal/authority/authority.go Outdated
Comment thread coordinator/internal/authority/authority.go
Comment thread coordinator/internal/authority/authority_test.go
Comment thread coordinator/internal/authority/authority_test.go Outdated
@katexochen katexochen added needs: rework PR author needs to work in feedback and removed needs: review Someone needs to review to bring this forward labels Apr 8, 2025
@burgerdev burgerdev force-pushed the burgerdev/recovery-mode branch from 268d39d to b8f362a Compare April 8, 2025 20:21
@katexochen katexochen removed the needs: rework PR author needs to work in feedback label Apr 9, 2025
@burgerdev burgerdev force-pushed the burgerdev/recovery-mode branch from b8f362a to addf431 Compare April 9, 2025 07:55
burgerdev and others added 2 commits April 9, 2025 10:55
Co-authored-by: davidweisse <98460960+davidweisse@users.noreply.github.com>
Co-authored-by: Leonard Cohnen <lc@edgeless.systems>

Co-authored-by: Paul Meyer <katexochen0@gmail.com>
@burgerdev burgerdev force-pushed the burgerdev/recovery-mode branch from addf431 to c00f3f6 Compare April 9, 2025 08:55
@burgerdev burgerdev merged commit 3c8a905 into main Apr 9, 2025
11 checks passed
@burgerdev burgerdev deleted the burgerdev/recovery-mode branch April 9, 2025 09:12
@katexochen katexochen modified the milestones: v1.8.0, v1.10.0, v1.9.0 May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog PRs not listed in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants