Skip to content

[WIP][17.06 backport] Fix leaking task resources when nodes are deleted#2840

Closed
thaJeztah wants to merge 2 commits into
moby:bump_v17.06from
thaJeztah:17.06_backport_fix_leaking_task_resources
Closed

[WIP][17.06 backport] Fix leaking task resources when nodes are deleted#2840
thaJeztah wants to merge 2 commits into
moby:bump_v17.06from
thaJeztah:17.06_backport_fix_leaking_task_resources

Conversation

@thaJeztah
Copy link
Copy Markdown
Member

backport of #2806 for the bump_v17.06 branch

When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

@thaJeztah
Copy link
Copy Markdown
Member Author

ping @dperny PTAL

@thaJeztah
Copy link
Copy Markdown
Member Author

Why is there no CI running on the 17.06 branch?

@thaJeztah thaJeztah changed the title [17.06 backport] Fix leaking task resources when nodes are deleted [WIP][17.06 backport] Fix leaking task resources when nodes are deleted Mar 29, 2019
@thaJeztah
Copy link
Copy Markdown
Member Author

Making if WIP, because I suspect this would have the same problem as #2841 (comment)

@dperny
Copy link
Copy Markdown
Collaborator

dperny commented Mar 29, 2019

17.06 probably never got updated to the newest circle CI version.

@thaJeztah thaJeztah force-pushed the 17.06_backport_fix_leaking_task_resources branch from 61e2c6e to 2f198de Compare April 12, 2019 17:52
@thaJeztah
Copy link
Copy Markdown
Member Author

Rebased; this will likely fail now (per my above comment), but at least will show that CI is doing its job! 😂

When a node is deleted, its tasks are asked to restart, which involves
putting them into a desired state of Shutdown. However, the Allocator
will not deallocate a task which is not in an actual state of a terminal
state. Once a node is deleted, the only opportunity for its tasks to
recieve updates and be moved to a terminal state is when the function
moving those tasks to TaskStateOrphaned is called, 24 hours after the
node enters the Down state. However, if a leadership change occurs, then
that function will never be called, and the tasks will never be moved to
a terminal state, leaking resources.

With this change, upon node deletion, all of its tasks will be moved to
TaskStateOrphaned, allowing those tasks' resources to be cleaned up.

Additionally, as part of this backport, avoid using the gogo
types.TimestampNow function, which does not exist in the vendored
version.

Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit 8467e6a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@dperny dperny force-pushed the 17.06_backport_fix_leaking_task_resources branch from 2f198de to c5e7960 Compare August 7, 2019 15:33
When a node is removed, its tasks are set in state ORPHANED. This does
not need to be done for tasks that are already in a terminal state, and
if all tasks in all states are updated, the size of the transaction may
grow too large to process, and node removal becomes impossible.

This changes to only set non-terminal tasks to state ORPHANED, and
terminal tasks are left alone.

Cherry pick does not apply cleanly

Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit d5df265)
Signed-off-by: Drew Erny <drew.erny@docker.com>
@thaJeztah
Copy link
Copy Markdown
Member Author

Two linting failures remaining;

/home/circleci/.go_workspace/src/github.com/docker/swarmkit/agent/exec/dockerapi/controller.go:657:3: ineffectual assignment to protocol
/home/circleci/.go_workspace/src/github.com/docker/swarmkit/cmd/swarmctl/service/flagparser/tmpfs.go:67:12: ineffectual assignment to multiplier
make: *** [ineffassign] Error 1
Exited with code 2

@thaJeztah
Copy link
Copy Markdown
Member Author

closing, as 17.06 is EOL

@thaJeztah thaJeztah closed this Jun 5, 2021
@thaJeztah thaJeztah deleted the 17.06_backport_fix_leaking_task_resources branch June 5, 2021 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants