TLA+ model for SwarmKit tasks#2613
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2613 +/- ##
==========================================
- Coverage 62.31% 61.82% -0.49%
==========================================
Files 134 134
Lines 21739 21821 +82
==========================================
- Hits 13546 13491 -55
- Misses 6740 6892 +152
+ Partials 1453 1438 -15 |
dperny
left a comment
There was a problem hiding this comment.
Hopefully these explanations will make some things clearer.
| (* This has every field mentioned in `task_model.md' except for `spec', which | ||
| it doesn't seem to use for anything. | ||
|
|
||
| XXX: can desired_state be any state, or just some (e.g. {ready, running, shutdown, remove})? *) |
There was a problem hiding this comment.
I believe that there are no guarantees about this, but the current behavior should be only running, shutdown, and remove. I have also envisioned that one day for single-run tasks, the desired state might be completed. So to model accurately, it may be best to model the desired state as any possible state. And if that doesn't work, then we can constrain states as needed.
| \/ CreateService | ||
| \/ UpdateService | ||
|
|
||
| (* XXX: what about removing a service? (not described in any detail in the design docs) *) |
There was a problem hiding this comment.
Removing a service is possible. The original way it was done is to delete the service object. This cascades to the Orchestrator, which saw that the service was removed and deleted all of its tasks. The deletion of the tasks caused the Dispatcher to send an updated Assignments to the workers which did not include the deleted tasks, and in turn the worker shut down the deleted tasks.
This caused an issue because shutting down a task is far from instant, while deleting a task is. When a task is deleted, its network resources are freed by the Allocator, and can be reused. However, because the task may take some time to shut down, it was possible for network resources (IP addresses) to be reused while the old tasks' containers were still shutting down.
The solution for this was to add the Remove state, which (being a terminal state) indicates to a worker that the task should be shut down. Then, when the task is shut down, that state is reported back to the manager, which affirmatively knows that the task is terminated, and at that point can safely delete the resources it uses.
However, a further extension of this problem is the fact that a service itself can have resources associated with it, specifically Virtual IPs. Deleting the service would free those virtual IPs, creating the same problem. So, to fix that, I believe @nishanttotla was or is working on a change to the service object to flag it for removal, instead of removing it outright, which would be treated like a deletion used to be, kicking off the movement of tasks to desired state removed, etc., and then finally deleting the service when all of the tasks were deleted.
There was a problem hiding this comment.
OK, I'll wait for @nishanttotla to do that (or submit a design for it) and add it to the model then.
To check I understand the current system:
- The service is removed from the system.
- The orchestrator sets the desired state for every task without a service to
remove. - Once a task with no service is shut down, the reaper will remove it.
| are configured to start the new task before stopping the old one.'' - need to implement this | ||
|
|
||
| XXX: ``The updater takes care of making sure that each slot converges to having a single running task.'' | ||
| How does it do that? |
There was a problem hiding this comment.
You'd have to look at the code itself, but the pattern is probably something along the lines of watching the event stream for task updates, and moving desired state of older tasks in a slot during a rolling update to shutdown.
|
|
||
| XXX: `orchestrators.md' says: ``[The Restart supervisor] atomically changes the state of the old task to Shutdown''. | ||
| However, this conflicts with the claim that only the agent can change the state of a running task. | ||
| Presumably it means the desired state. |
There was a problem hiding this comment.
Yes, probably means desired state. Every state after assigned is set by the agent. Feel free to either open a new PR or add to this PR an update to the documentation.
| (* XXX: spec says ``Tasks that were removed because of service removal | ||
| or scale down are not kept around in task history. | ||
| Does that include tasks that had already finished for some | ||
| other reason? *) |
There was a problem hiding this comment.
Going further off my explanation of slots, and the fact that the primary purpose of a slot is to maintain task history, the tasks in a slot have very little relationship between each other beyond the fact that they share a slot number.
When a task is removed in a scale down or service removal, the history for that slot is no longer useful in any meaningful way, so all tasks with the removed slot number in any state are removed. The slot no longer "exists" (as far as the concept of existence is meaningful for a slot) and therefore no longer requires any history.
| (* Every task has a service: | ||
|
|
||
| XXX: The spec says: ``In some cases, there are tasks that exist independent of any service. | ||
| These do not have a value set in service_id.''. Need an example of one. *) |
There was a problem hiding this comment.
Network attachment tasks. They're an abomination, an absolutely disgusting hack that should never have been shipped. They are tasks which use the NetworkAttachmentSpec runtime spec, a spec which contains a container ID. These tasks are created through the grpc server implemented in resourceapi, and they are created by the agent. The purpose of Network Attachment tasks is to attach regular docker run containers to a swarmkit overlay network that has been created with the --attachable option.
There have been bugs in the past with attachment tasks because we have no visibility into their lifecycle and their creation and destruction is entirely at the whim of the agent. The end result is that network attachment tasks would remain in the object store after they had been terminated, leaking resources and causing the associated attached networks to be unremovable.
I added a workaround to remove network attachment tasks when a node is removed in #2409, but the underlying issue IIRC remains.
I'm sorry.
There was a problem hiding this comment.
Also, @jlhawn is working on a hack week project to run one-off tasks, like docker run-esque tasks, that would exist independently of services.
| Transition == { | ||
| << null, new >>, | ||
| << new, pending >>, | ||
| << pending, assigned >>, |
There was a problem hiding this comment.
I can't quite remember, I'd have to go check the code, but it may be possible to move from new to a terminal state, if the task cannot be allocated because there aren't enough free resources. I think. I'm not sure.
| << shutdown, remove >>, | ||
| << rejected, remove >>, | ||
| << complete, remove >>, | ||
| << failed, remove >>, |
There was a problem hiding this comment.
Tasks will never actually BE in the removed state. They will only be in the desired state of removed. The state functions mainly as a flag to the task reaper, to indicate that a task once in a terminal state it and its history can be deleted.
| << shutdown, orphaned >>, | ||
|
|
||
| << remove, null >>, | ||
| << orphaned, null >> |
There was a problem hiding this comment.
Tasks can be deleted (moved to the null state, I think, in terms of the implementation of this model) from any terminal state. For example, when a task history is deleted, the old tasks will all be in terminal states (like SHUTDOWN or FAILED) and will be deleted straight from those states. Once a task is in a terminal state, the resources it had in use are definitely free and it is safe to remove it.
| ============================================================================= | ||
| \* Modification History | ||
| \* Last modified Thu Apr 19 14:55:46 BST 2018 by tal | ||
| \* Created Mon Apr 16 11:31:19 BST 2018 by tal |
There was a problem hiding this comment.
this is a dumb nit, but wouldn't the modification history just be tracked in git? Or does this history get baked into the PDF that a TLA+ file creates?
There was a problem hiding this comment.
It's just some noise the TLA+ toolbox updates automatically when you save. Although it looks like if you delete it then it doesn't create it again, so easily solved!
|
Thanks for the detailed review @dperny! I've made some updates:
Note: I also made a change to the I also updated the set of valid transitions to group them by the component that performs them, and to check this. |
| progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the | ||
| way to `RUNNING`. If a task exits without an error code, it moves to the | ||
| progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on | ||
| the way to `RUNNING`. If a task exits without an error code, it moves to the |
There was a problem hiding this comment.
If a task exits without an error code, it moves to the
COMPLETEstate.
Lets also clarify this by saying that the actual state is changed to COMPLETE by the agent.
There was a problem hiding this comment.
Actually, it says the agent handles these states on the line above the bit shown by GitHub's diff, so maybe it's OK as it is.
More accurately, the service is removed from the system, the process of which involves setting the desired state of all of its tasks |
| (* A description of the task model in SwarmKit. | ||
| Currently, this is mostly based on `design/task_model.md'. | ||
|
|
||
| Note: I am not yet familiar with the SwarmKit code, and this document is likely to |
There was a problem hiding this comment.
nit: looks like this is no longer true :)
Even more accurately, any task which is in a terminal state or in an unassigned state (i.e. < ASSIGNED), the taskreaper will delete them. Also:
Not every task without a service, but just the tasks for the service just deleted. |
|
Thanks for the explanation. I've added service removal. I modelled it as flagging the service and all its tasks for removal and then removing it when it had no tasks left (i.e. the proposed design rather than the current one). I added a comment about it in the model. I also updated the reaper to allow removing tasks that haven't yet been assigned (as explained by @anshulpundir), and added a couple of exceptions to the rule that only one component is in charge of each state to cover that. |
|
Please sign your commits following these rules: $ git clone -b "tla" git@github.com:talex5/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842353920240
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -fAmending updates the existing PR. You DO NOT need to open a new one. |
|
I've now split the spec into multiple files, and created a separate low-level spec for worker nodes. I also added a Makefile so you can build the PDFs and run the model checker without having to install the TLA toolbox (it uses a Docker image). This could also be used in CI. |
|
I think this PR is ready for merging now. |
| progress through the `ACCEPTED`, `PREPARING`, and `STARTING` states on the | ||
| way to `RUNNING`. If a task exits without an error code, it moves to the | ||
| progress through the `ACCEPTED`, `PREPARING`, `READY', and `STARTING` states on | ||
| the way to `RUNNING`. If a task exits without an error code, it moves to the |
| (* The user removes a service. | ||
|
|
||
| Note: Currently, SwarmKit deletes the service from its records immediately. | ||
| However, this isn't right because we need to wait for the resources to be freed. |
There was a problem hiding this comment.
this isn't right because
Instead of saying this isn't right, I'd rather say that this is not very elegant. Because we still wait for resources to be freed before deleting the tasks, and resources are tied to tasks and not services.
There was a problem hiding this comment.
@dperny mentioned that there are also service-level resources that need to be freed. How about:
However, this isn't right because we need to wait for service-level resources such as Virtual IPs to be freed.
| \E id \in UnusedId(sid, slot) : | ||
| tasks' = tasks \union { NewTask(sid, slot, id, running) } | ||
|
|
||
| (* Add a task if a slot exists, contains no runnable tasks, and we weren't trying to remove it. |
There was a problem hiding this comment.
supernit: remove comma before 'and' ?
There was a problem hiding this comment.
I think this is correct. According to https://en.wikipedia.org/wiki/Serial_comma:
A majority of American style guides mandate use of the serial comma
| (* We don't model the updater explicitly, but we allow any task to be restarted (perhaps with | ||
| a different image) at any time, which should cover the behaviours of the restart supervisor. | ||
|
|
||
| XXX: wait for new task to be ready before shutting down old one? |
There was a problem hiding this comment.
Is this a TODO ? I think some clarification is needed.
There was a problem hiding this comment.
I agree. I've updated it to:
TODO: SwarmKit also allows ``start-first'' mode updates where we first get the new task to
`running' and then mark the old task for shutdown. Add this to the model.
|
|
||
| TODO: ``However, there are also cases where a slot may have multiple tasks with the | ||
| desired state of RUNNING. This can happen during rolling updates when the updates | ||
| are configured to start the new task before stopping the old one.'' - need to implement this *) |
There was a problem hiding this comment.
need to implement this
By this you mean implement the TLA model for this ?
|
|
||
| (* The allocator rejects the task because there aren't enough resources. | ||
|
|
||
| XXX: Not clear whether SwarmKit actually does this. |
There was a problem hiding this comment.
Don' think it does. The task just stays in NEW (or READY ?) until resources are available. cc @dperny
There was a problem hiding this comment.
He said he wasn't sure: #2613 (comment)
Shall I remove this then?
There was a problem hiding this comment.
Browsing through the code, I don't see a case where the current allocator fails tasks if there aren't enough resources. @talex5
There was a problem hiding this comment.
OK, thanks for checking! I've removed this bit now.
|
|
||
| ---------------------------- MODULE Scheduler ------------------------------- | ||
|
|
||
| \* Actions performed the scheduler |
There was a problem hiding this comment.
performed the scheduler => performed by the scheduler
| (* Forget about tasks in remove or orphan states. | ||
|
|
||
| Orphaned tasks belong to nodes that we are assuming are lost forever. | ||
| Or, at the very least, it's likely that a node outage lasting longer than the timeout |
There was a problem hiding this comment.
has crashed => has happened ?
There was a problem hiding this comment.
How about:
Orphaned tasks belong to nodes that we are assuming are lost forever (or have crashed and will come up with nothing running, which is an equally fine outcome).
I think this is what @dperny meant in #2613 (comment).
|
|
||
| ``Moving a task to the Orphaned state is not desirable, | ||
| because it's the one case where we break the otherwise invariant | ||
| that the agent sets all states past ASSIGNED.'' |
There was a problem hiding this comment.
that the agent sets all states past ASSIGNED.
Clarify that this is the actual state and not the desired state ?
These documents use the TLA+ notation to describe the behaviour of SwarmKit more precisely than the prose documentation in the design directory. Run `make pdfs` to render these documents as PDF files. The best one to start with is `SwarmKit.pdf`, which introduces the TLA+ notation and describes the overall components of SwarmKit. The specifications can also be executed by the TLC model checker to help find mistakes. Use `make check` to run the checks. Signed-off-by: Thomas Leonard <thomas.leonard@docker.com>
TLA+ model for SwarmKit tasks
- What I did
I want to learn about how SwarmKit works, and extend its documentation. As a first step, I've been reading through the docs in the
designdirectory and writing a more formal version of them. If someone familiar with SwarmKit has time to read over and correct my intepretation, that would be great. I marked withXXXthings I was particularly unsure about. For example, the docs say very little about how a service is removed.To make it more precise, I've written the document using TLA+ notation, but hopefully it's still readable to people not familiar with that.
So far, this is based only on reading the design docs. I plan to start reading through the Go code soon, but I wanted to get some early feedback first. Once it's correct, it could be added to the existing documentation.
- How to test it
It's easiest to read the PDF rendering of the spec.
You can also load the spec into the TLA toolbox and run the model checker on it to test the claimed properties. The document explains how to do that.
- Description for the changelog
Add a TLA+ spec for SwarmKit's task model