Skip to content

restarts: Default delay and desired=accepted.#715

Merged
aluzzardi merged 1 commit into
moby:masterfrom
aluzzardi:orchestrator-default-restart-delay
Jun 3, 2016
Merged

restarts: Default delay and desired=accepted.#715
aluzzardi merged 1 commit into
moby:masterfrom
aluzzardi:orchestrator-default-restart-delay

Conversation

@aluzzardi
Copy link
Copy Markdown
Member

  • Add a default delay of 5 seconds between restarts if not specified
    Otherwise we end up into a restart loop by default.
  • Set the new task's desired state to "ACCEPTED" rather than "READY".
    Ready implies pulling, which means creating a service with an invalid
    image leads to a restart loop.

The second point is quite inconvenient: Moving the desired state to READY was a convenience so that while we were waiting for a restart delay to occur, the new node would at least start pulling the image.

However, with an invalid image, READY can't happen since the agent transitions from ACCEPTED to REJECTED.

/cc @stevvooe @aaronlehmann @dongluochen

@aluzzardi aluzzardi force-pushed the orchestrator-default-restart-delay branch 2 times, most recently from 6179464 to c514fda Compare May 30, 2016 01:33
@docker-codecov-bot
Copy link
Copy Markdown

docker-codecov-bot commented May 30, 2016

Current coverage is 56.88%

Merging #715 into master will increase coverage by 2.01%

  1. 2 files (not in diff) in manager/orchestrator were modified. more
    • Misses -6
    • Partials -2
    • Hits +8
  2. 2 files (not in diff) in manager were modified. more
    • Misses +2
    • Partials +2
    • Hits -7
  3. 3 files (not in diff) in ca were modified. more
    • Misses +3
    • Partials +2
    • Hits +4
  4. 2 files (not in diff) in agent/exec/container were modified. more
    • Misses -16
    • Partials -3
    • Hits -16
  5. 1 files (not in diff) in agent were modified. more
    • Misses -6
  6. 8 files (not in diff) in spec were deleted. more
  7. 2 files (not in diff) in protobuf/plugin were deleted. more
  8. 1 files (not in diff) in protobuf were deleted. more
  9. 9 files (not in diff) in manager/state/store were deleted. more
  10. 5 files (not in diff) in manager/state/raft were deleted. more
@@             master       #715   diff @@
==========================================
  Files            77         46     -31   
  Lines         11057       5700   -5357   
  Methods           0          0           
  Messages          0          0           
  Branches          0          0           
==========================================
- Hits           6066       3242   -2824   
+ Misses         4163       2020   -2143   
+ Partials        828        438    -390   

Sunburst

Powered by Codecov. Last updated by bc63afd...6179464

@aaronlehmann
Copy link
Copy Markdown
Collaborator

There is already a default restart delay, set in the spec parsing code.

By doing this in the restart supervisor directly, doesn't it become impossible to set a restart delay of 0? That's a problem.

Not sure about setting the desired state to accepted. It does seem to work around that issue, but I wonder if we can find a better way.

On May 29, 2016, at 20:56, Andrea Luzzardi notifications@github.com wrote:

Add a default delay of 5 seconds between restarts if not specified
Otherwise we end up into a restart loop by default.

Set the new task's desired state to "ACCEPTED" rather than "READY".
Ready implies pulling, which means creating a service with an invalid
image leads to a restart loop.

The second point is quite inconvenient: Moving the desired state to READY was a convenience so that while we were waiting for a restart delay to occur, the new node would at least start pulling the image.

However, with an invalid image, READY can't happen since the agent transitions from ACCEPTED to REJECTED.

/cc @stevvooe @aaronlehmann @dongluochen

You can view, comment on, or merge this pull request online at:

#715

Commit Summary

restarts: Default delay and desired=accepted.
File Changes

M manager/orchestrator/restart.go (11)
Patch Links:

https://github.com/docker/swarm-v2/pull/715.patch
https://github.com/docker/swarm-v2/pull/715.diff

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@aluzzardi
Copy link
Copy Markdown
Member Author

@aaronlehmann I don't think we should rely on spec to set defaults - that's just one client. service create would not use that for instance.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

aaronlehmann commented May 31, 2016 via email

@dongluochen
Copy link
Copy Markdown
Contributor

Setting defaultRestartDelay in 2 places is confusing. We should resolve it.

@stevvooe
Copy link
Copy Markdown
Contributor

@aaronlehmann Is there a use for a restart delay smaller than 1ns?

LGTM

@aaronlehmann
Copy link
Copy Markdown
Collaborator

@aaronlehmann Is there a use for a restart delay smaller than 1ns?

In practice, having a minimum delay like this should be fine. But I think it would be very counterintuitive to specify 'RestartDelay: 0' and have a few seconds of delay. If we make the creator of the specs substitute 1 ns for an explicit 0 value, I think that would be an acceptable workaround.

@aluzzardi
Copy link
Copy Markdown
Member Author

@aaronlehmann Or we make restartdelay a pointer

@aaronlehmann
Copy link
Copy Markdown
Collaborator

@aaronlehmann Or we make restartdelay a pointer

Yeah, I prefer that to special-casing 0 literals on the client side.

This would mean that RestartDelay would have its own message type, right?

@aluzzardi
Copy link
Copy Markdown
Member Author

@aaronlehmann I think it just means delay becomes a nullable

@aaronlehmann
Copy link
Copy Markdown
Collaborator

The delay is an int; is it possible to make those nullable? I know proto2 has the optional keyword but I haven't seen that in proto3.

@aluzzardi
Copy link
Copy Markdown
Member Author

Right - that doesn't work

@aluzzardi aluzzardi force-pushed the orchestrator-default-restart-delay branch from c514fda to 5cba65c Compare June 2, 2016 00:53
@aluzzardi
Copy link
Copy Markdown
Member Author

Rebased

@aaronlehmann
Copy link
Copy Markdown
Collaborator

--- FAIL: TestOrchestratorRestartMaxAttempts (0.41s)
    Error Trace:    restart_test.go:485
    Error:      Not equal: 256 (expected)
                    != 384 (actual)

@aluzzardi aluzzardi force-pushed the orchestrator-default-restart-delay branch from 5cba65c to 4567e9c Compare June 2, 2016 03:00
- Add a default delay of 5 seconds between restarts if not specified
Otherwise we end up into a restart loop by default.

- Set the new task's desired state to "ACCEPTED" rather than "READY".
Ready implies pulling, which means creating a service with an invalid
image leads to a restart loop.

Signed-off-by: Andrea Luzzardi <aluzzardi@gmail.com>
@aluzzardi aluzzardi force-pushed the orchestrator-default-restart-delay branch from 4567e9c to 326309d Compare June 2, 2016 03:12
@aluzzardi
Copy link
Copy Markdown
Member Author

Bad rebase. Fixed

@aaronlehmann
Copy link
Copy Markdown
Collaborator

I would prefer to handle 0-delay cases as part of this, but if you are in a hurry to merge, we can merge this now and fix 0-delay in a followup. There is already a ticket filed: #721.

@aluzzardi
Copy link
Copy Markdown
Member Author

@aaronlehmann I'm going to go ahead and merge this. We can fix in a follow up by right now the behavior on master is pretty bad.

@aluzzardi aluzzardi merged commit fa558b1 into moby:master Jun 3, 2016
@aluzzardi aluzzardi deleted the orchestrator-default-restart-delay branch June 3, 2016 05:47
aaronlehmann pushed a commit to aaronlehmann/swarmkit that referenced this pull request Jul 21, 2016
We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, moby#715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.

As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
aaronlehmann pushed a commit to aaronlehmann/swarmkit that referenced this pull request Jul 21, 2016
We used to put restarted tasks in READY state. This makes sense because
then they can go ahead and pull an image while we wait for the restart
delay to elapse. However, moby#715 changed the restart supervisor to put
restarted tasks into ACCEPTED to work around a tight restart loop when
an image doesn't exist. The problem was that the task would fail
immediately, leading the orchestrator to request a new restart, which
would cancel the ongoing restart delay.

As a better fix for this, put tasks in READY, but when a restart is
requested and there is already one in progress for the old task, we wait
for that restart to complete before starting the new one.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants