Skip to content

Put restarted tasks in READY state#1199

Merged
dongluochen merged 1 commit into
moby:masterfrom
aaronlehmann:put-restarted-tasks-in-ready
Jul 21, 2016
Merged

Put restarted tasks in READY state#1199
dongluochen merged 1 commit into
moby:masterfrom
aaronlehmann:put-restarted-tasks-in-ready

Conversation

@aaronlehmann
Copy link
Copy Markdown
Collaborator

We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, #715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.

As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.

cc @aluzzardi @dongluochen

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 21, 2016

Current coverage is 54.99% (diff: 100%)

Merging #1199 into master will increase coverage by 0.01%

@@             master      #1199   diff @@
==========================================
  Files            77         77          
  Lines         12233      12213    -20   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           6726       6717     -9   
+ Misses         4581       4579     -2   
+ Partials        926        917     -9   

Sunburst

Powered by Codecov. Last update 68a7ee6...303bca8

@aluzzardi
Copy link
Copy Markdown
Member

LGTM

I'm sure you already did, but this PR needs a bunch of manual testing since it's in the middle of the critical path so close to GA and it's a very subtle code base.

@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

Agreed. Here's the manual testing I did before submitting it:

  • Created a replicated service which references a nonexistent image. Observed that it respects the restart delay instead of restarting the replicas in a tight loop.
  • Created a replicated service that runs successfully, then did a few rolling updates with good and bad images and commands. The restart delay was always respected.

I just repeated the tests with a global service, and things look good in that case as well.

Comment thread manager/orchestrator/restart.go Outdated
r.mu.Lock()
oldDelay, ok := r.delays[t.ID]
if ok {
go r.waitRestart(ctx, oldDelay, cluster, t.ID)
Copy link
Copy Markdown
Contributor

@dongluochen dongluochen Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Restart gets call N times for the same task, this would create N goroutines waiting for select. Is this expected behavior? I mean if we know restarting is already in progress, can we just return?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would not be an issue because because Restart only gets when a task transitions to an observed state > RUNNING (i.e., it has failed or completed). But I suppose it's possible that a task could first go to COMPLETED and then SHUTDOWN, for example, and that would trigger two goroutines. So I guess I'll add something here to limit it to one goroutine.

@aluzzardi
Copy link
Copy Markdown
Member

LGTM when @dongluochen comment gets addressed

We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, moby#715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.

As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the put-restarted-tasks-in-ready branch from f6fc2ed to 303bca8 Compare July 21, 2016 18:51
@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

Added protection against multiple goroutines being launched for the same task.

PTAL

@dongluochen
Copy link
Copy Markdown
Contributor

LGTM

@dongluochen dongluochen merged commit b3d7a48 into moby:master Jul 21, 2016
@aaronlehmann aaronlehmann deleted the put-restarted-tasks-in-ready branch July 21, 2016 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants