Skip to content

Updater and restart supervisor can fight each other #2242

@aaronlehmann

Description

@aaronlehmann
  1. Suppose the update order is set to "start, then stop", and restart condition is set to "always". A change to the service starts a rolling update. During this rolling update, one of the existing tasks fails, and gets restarted. When the updater gets to this task, it won't be able to shut down the "old" task, because it was already restarted, but it will have already started a replacement task by this point. The service ends up with one too many replicas.
  2. Suppose the update order is set to "stop, then start", and restart condition is set to "none". A change to the service starts a rolling update. During this rolling update, one of the existing tasks fails. When the updater gets to this task, it fails to shut down the "old" task, so it does not start up a new one. The service ends up with fewer replicas than expected, even though the update should have restored the desired number of replicas.

This is tricky to fix because the updater is designed around being passed a list of tasks (which lets it operate on abstract slots, which are filled in by the orchestrator). I can see a few possible approaches:

  1. Disable task-level reconciliation during a rolling update, since the updater is effectively taking over this function. However, this would mean that a task that fails early on during a rolling update would not be restarted until the rolling update is entirely finished. Also, a rolling update technically lasts past the last task being replaced, because the updater stays around to monitor for failures.
  2. Change the restarter and orchestrator to reconcile the contents of a slot, instead of unconditionally updating or restarting. Note that updates already have this kind of logic, but it lives in the orchestrator rather than the updater so that the updater can operate on abstract slot objects.
  3. If any task in a service fails during an update, first restart it if applicable, then kick off a new rolling update to replace the current one, so it has an up-to-date list of tasks.

cc @aluzzardi

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions