Put restarted tasks in READY state#1199
Conversation
Current coverage is 54.99% (diff: 100%)@@ master #1199 diff @@
==========================================
Files 77 77
Lines 12233 12213 -20
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 6726 6717 -9
+ Misses 4581 4579 -2
+ Partials 926 917 -9
|
|
LGTM I'm sure you already did, but this PR needs a bunch of manual testing since it's in the middle of the critical path so close to GA and it's a very subtle code base. |
|
Agreed. Here's the manual testing I did before submitting it:
I just repeated the tests with a global service, and things look good in that case as well. |
| r.mu.Lock() | ||
| oldDelay, ok := r.delays[t.ID] | ||
| if ok { | ||
| go r.waitRestart(ctx, oldDelay, cluster, t.ID) |
There was a problem hiding this comment.
If Restart gets call N times for the same task, this would create N goroutines waiting for select. Is this expected behavior? I mean if we know restarting is already in progress, can we just return?
There was a problem hiding this comment.
I thought this would not be an issue because because Restart only gets when a task transitions to an observed state > RUNNING (i.e., it has failed or completed). But I suppose it's possible that a task could first go to COMPLETED and then SHUTDOWN, for example, and that would trigger two goroutines. So I guess I'll add something here to limit it to one goroutine.
|
LGTM when @dongluochen comment gets addressed |
We used to put restarted tasks in READY state. This makes sense because then they can go ahead and pull an image while we wait for the restart delay to elapse. However, moby#715 changed the restart supervisor to put restarted tasks into ACCEPTED to work around a tight restart loop when an image doesn't exist. The problem was that the task would fail immediately, leading the orchestrator to request a new restart, which would cancel the ongoing restart delay. As a better fix for this, put tasks in READY, but when a restart is requested and there is already one in progress for the old task, we wait for that restart to complete before starting the new one. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
f6fc2ed to
303bca8
Compare
|
Added protection against multiple goroutines being launched for the same task. PTAL |
|
LGTM |
We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, #715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.
As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.
cc @aluzzardi @dongluochen