Skip to content

coordinator: distributed deployment with auto-recovery#1373

Merged
burgerdev merged 5 commits into
mainfrom
burgerdev/ha
May 21, 2025
Merged

coordinator: distributed deployment with auto-recovery#1373
burgerdev merged 5 commits into
mainfrom
burgerdev/ha

Conversation

@burgerdev
Copy link
Copy Markdown
Member

@burgerdev burgerdev commented Apr 11, 2025

This PR adds support for deploying multiple replicas of the Coordinator.

Further reading:


Testing:

  • nightly: TODO
  • manual:
    • just e2e openssl AKS-CLH-SNP

@burgerdev burgerdev added the feature Shiny new feature for our users label Apr 11, 2025
@burgerdev burgerdev added this to the v1.8.0 milestone Apr 11, 2025
@katexochen katexochen added enterprise feature and removed feature Shiny new feature for our users labels Apr 15, 2025
@katexochen katexochen modified the milestones: v1.8.0, v1.9.0 May 14, 2025
@burgerdev burgerdev force-pushed the burgerdev/ha branch 2 times, most recently from c2dd477 to 8e18fb3 Compare May 15, 2025 15:19
@burgerdev burgerdev marked this pull request as ready for review May 19, 2025 08:25
@burgerdev burgerdev requested a review from katexochen as a code owner May 19, 2025 08:25
@burgerdev burgerdev force-pushed the burgerdev/ha branch 2 times, most recently from f8f7351 to caff06d Compare May 19, 2025 13:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2025

PR Preview Action v1.6.1
Preview removed because the pull request was closed.
2025-05-21 13:08 UTC

Comment on lines +29 to +44
config, err := rest.InClusterConfig()
if err != nil {
return err
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
return err
}
namespace, err := os.ReadFile("/var/run/secrets/kubernetes.io/serviceaccount/namespace")
if err != nil {
return err
}
discovery := peerdiscovery.New(clientset, string(namespace))
return peerrecovery.New(c.guard, discovery, c.issuer, c.httpsGetter, c.logger).RunRecovery(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging missing

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a logging statement about the recovery routine starting, hope this is what you meant.

Copy link
Copy Markdown
Contributor

@katexochen katexochen May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant start and error logging, the error logging is still missing.

Comment thread e2e/peerrecovery/peerrecovery_test.go Outdated
Comment thread e2e/peerrecovery/peerrecovery_test.go Outdated
Comment thread e2e/peerrecovery/peerrecovery_test.go
Comment thread docs/docs/howto/coordinator-ha.md Outdated
Comment thread docs/docs/howto/coordinator-ha.md Outdated
Comment thread docs/docs/howto/coordinator-ha.md Outdated
Comment thread docs/docs/howto/coordinator-ha.md
@burgerdev burgerdev force-pushed the burgerdev/ha branch 3 times, most recently from 8a497f8 to 59040aa Compare May 20, 2025 19:23
@burgerdev burgerdev changed the title enterprise: distributed coordinator enterprise: distributed coordinator with auto-recovery May 20, 2025
Comment thread docs/docs/howto/coordinator-ha.md Outdated

## Workflow

<!-- TODO(burgerdev): this how-to anticipates planned changes to readiness which are not yet committed. -->
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in #1467

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, as that is already approved we can delete the comment here.

@katexochen katexochen added feature Shiny new feature for our users and removed enterprise feature labels May 21, 2025
@katexochen katexochen changed the title enterprise: distributed coordinator with auto-recovery coordinator: distributed deployment with auto-recovery May 21, 2025
Comment thread docs/docs/howto/coordinator-ha.md Outdated

```sh
kubectl rollout restart statefulset/coordinator
sleep 2m
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we wait for the rollout to finish here instead of sleeping?

```

This will create additional Coordinator instances, one after another, as described in the [Kubernetes documentation for `StatefulSet`](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees).
After some time, the additional Coordinators should enter the `Ready` state, too.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we add a command to wait for readiness?

Comment thread docs/docs/howto/coordinator-ha.md Outdated

## Workflow

<!-- TODO(burgerdev): this how-to anticipates planned changes to readiness which are not yet committed. -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, as that is already approved we can delete the comment here.

Co-authored-by: Paul Meyer <katexochen0@gmail.com>
@burgerdev burgerdev merged commit fb96eca into main May 21, 2025
18 checks passed
@burgerdev burgerdev deleted the burgerdev/ha branch May 21, 2025 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Shiny new feature for our users

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants