Put restarted tasks in READY state by aaronlehmann · Pull Request #1199 · moby/swarmkit

aaronlehmann · 2016-07-21T01:54:25Z

We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, #715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.

As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.

cc @aluzzardi @dongluochen

codecov-io · 2016-07-21T02:02:44Z

Current coverage is 54.99% (diff: 100%)

Merging #1199 into master will increase coverage by 0.01%

@@             master      #1199   diff @@
==========================================
  Files            77         77          
  Lines         12233      12213    -20   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           6726       6717     -9   
+ Misses         4581       4579     -2   
+ Partials        926        917     -9

Powered by Codecov. Last update 68a7ee6...303bca8

aluzzardi · 2016-07-21T06:06:00Z

LGTM

I'm sure you already did, but this PR needs a bunch of manual testing since it's in the middle of the critical path so close to GA and it's a very subtle code base.

aaronlehmann · 2016-07-21T16:39:59Z

Agreed. Here's the manual testing I did before submitting it:

Created a replicated service which references a nonexistent image. Observed that it respects the restart delay instead of restarting the replicas in a tight loop.
Created a replicated service that runs successfully, then did a few rolling updates with good and bad images and commands. The restart delay was always respected.

I just repeated the tests with a global service, and things look good in that case as well.

dongluochen · 2016-07-21T18:24:03Z

+	r.mu.Lock()
+	oldDelay, ok := r.delays[t.ID]
+	if ok {
+		go r.waitRestart(ctx, oldDelay, cluster, t.ID)


If Restart gets call N times for the same task, this would create N goroutines waiting for select. Is this expected behavior? I mean if we know restarting is already in progress, can we just return?

I thought this would not be an issue because because Restart only gets when a task transitions to an observed state > RUNNING (i.e., it has failed or completed). But I suppose it's possible that a task could first go to COMPLETED and then SHUTDOWN, for example, and that would trigger two goroutines. So I guess I'll add something here to limit it to one goroutine.

aluzzardi · 2016-07-21T18:49:19Z

LGTM when @dongluochen comment gets addressed

We used to put restarted tasks in READY state. This makes sense because then they can go ahead and pull an image while we wait for the restart delay to elapse. However, moby#715 changed the restart supervisor to put restarted tasks into ACCEPTED to work around a tight restart loop when an image doesn't exist. The problem was that the task would fail immediately, leading the orchestrator to request a new restart, which would cancel the ongoing restart delay. As a better fix for this, put tasks in READY, but when a restart is requested and there is already one in progress for the old task, we wait for that restart to complete before starting the new one. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-07-21T18:52:18Z

Added protection against multiple goroutines being launched for the same task.

PTAL

dongluochen · 2016-07-21T21:17:17Z

LGTM

GordonTheTurtle added the status/0-triage label Jul 21, 2016

aluzzardi mentioned this pull request Jul 21, 2016

RestartSupervisor should be decoupled from the orchestrator #1202

Open

aaronlehmann mentioned this pull request Jul 21, 2016

updates kill a container before trying to download the replacement image? #1104

Closed

dongluochen reviewed Jul 21, 2016
View reviewed changes

aaronlehmann force-pushed the put-restarted-tasks-in-ready branch from f6fc2ed to 303bca8 Compare July 21, 2016 18:51

dongluochen merged commit b3d7a48 into moby:master Jul 21, 2016

aaronlehmann deleted the put-restarted-tasks-in-ready branch July 21, 2016 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put restarted tasks in READY state#1199

Put restarted tasks in READY state#1199
dongluochen merged 1 commit into
moby:masterfrom
aaronlehmann:put-restarted-tasks-in-ready

aaronlehmann commented Jul 21, 2016

Uh oh!

codecov-io commented Jul 21, 2016 •

edited

Loading

Uh oh!

aluzzardi commented Jul 21, 2016

Uh oh!

aaronlehmann commented Jul 21, 2016

Uh oh!

dongluochen Jul 21, 2016 •

edited

Loading

Uh oh!

aaronlehmann Jul 21, 2016

Uh oh!

aluzzardi commented Jul 21, 2016

Uh oh!

aaronlehmann commented Jul 21, 2016

Uh oh!

dongluochen commented Jul 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aaronlehmann commented Jul 21, 2016

Uh oh!

codecov-io commented Jul 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 54.99% (diff: 100%)

Uh oh!

aluzzardi commented Jul 21, 2016

Uh oh!

aaronlehmann commented Jul 21, 2016

Uh oh!

dongluochen Jul 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann Jul 21, 2016

Choose a reason for hiding this comment

Uh oh!

aluzzardi commented Jul 21, 2016

Uh oh!

aaronlehmann commented Jul 21, 2016

Uh oh!

dongluochen commented Jul 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented Jul 21, 2016 •

edited

Loading

dongluochen Jul 21, 2016 •

edited

Loading