Skip to content

address unassigned task leak when service is removed#2706

Merged
anshulpundir merged 1 commit into
moby:masterfrom
dani-docker:task_leak
Jul 16, 2018
Merged

address unassigned task leak when service is removed#2706
anshulpundir merged 1 commit into
moby:masterfrom
dani-docker:task_leak

Conversation

@dani-docker
Copy link
Copy Markdown
Contributor

@dani-docker dani-docker commented Jul 13, 2018

Signed-off-by: Dani Louca dani.louca@docker.com

- What I did

If a task is not yet assigned (for ex: noSuitableNode ) and its service has been removed, the task stays in the unassignedTasks map until a new leader election.

- How I did it

Before re-adding the task to unassignedTasks map, the fix checks if the task has a valid service in the store.

- How to test it

steps to repro:

  • set daemon.log to debug
  • create a service with a node constraint that does not exist:
    ex: docker service create -d --constraint 'node.labels.type == queue' alpine sleep 10000
  • watch the logs, you should see an error
    no suitable node available for task
  • delete the service and invoke the tick() by creating another service; watch the log and notice the same warning taskID in pops up after service is deleted

- Description for the changelog

Avoid a leak when a service with unassigned tasks is deleted

@dani-docker
Copy link
Copy Markdown
Contributor Author

dani-docker commented Jul 13, 2018

ping @anshulpundir @dperny

I am not sure about the cost of querying the store, but I believe it's negligible as it's in memory.
I will check the unit test once I get some feedback

Comment thread manager/scheduler/scheduler.go Outdated
service = store.GetService(tx, t.ServiceID)
})
if service == nil {
log.G(ctx).WithField("task.id", t.ID).Debug("skipping task, service is deleted")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding a more direct message here:

skipping task => removing task from the scheduler.

@@ -705,6 +705,15 @@ func (s *Scheduler) scheduleNTasksOnNodes(ctx context.Context, n int, taskGroup
func (s *Scheduler) noSuitableNode(ctx context.Context, taskGroup map[string]*api.Task, schedulingDecisions map[string]schedulingDecision) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I also request you to add a comment for this function and how its intended to be used ? thx!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roger that!

@anshulpundir
Copy link
Copy Markdown
Contributor

I am not sure about the cost of querying the store, but I believe it's negligible as it's in memory.
I will check the unit test once I get some feedback

Looks good. Correctness first :)
Lets add a unit-test and ship it @dani-docker

PS Please also open an issue on swarmkit for tracking. thx!

@dani-docker
Copy link
Copy Markdown
Contributor Author

@anshulpundir
A new unit test is added and existing ones updated

}

// noSuitableNode checks unassigned tasks and make sure they have an existing service in the store before
// updating the task status and adding it back to: schedulingDecisions, unassignedTasks and allTasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe also say how tasks ends up in noSuitableNode state ?

assert.Regexp(t, assignment4.NodeID, "(node1|node2)")
}

func TestSchedulerUnassignedMap(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment on what you're testing and how you're doing it.


// delete the service of an unassigned task
err = s.Update(func(tx store.Tx) error {
assert.NoError(t, store.DeleteService(tx, service1.ID))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be much simpler test by directly calling tick()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was testing the full stack call and let the scheduler does its work, but in this case, make sure to limit the change to unassigned map and the tick() call

@dani-docker dani-docker force-pushed the task_leak branch 2 times, most recently from 1025825 to ec1d9b6 Compare July 16, 2018 20:32
Signed-off-by: Dani Louca <dani.louca@docker.com>

Add a comment describing the function and adjust the log message

Signed-off-by: Dani Louca <dani.louca@docker.com>

Fixing existing unit tests

Signed-off-by: Dani Louca <dani.louca@docker.com>

Adding a test case to verify the leak fix

Signed-off-by: Dani Louca <dani.louca@docker.com>

simplifying the test

Signed-off-by: Dani Louca <dani.louca@docker.com>

comment

Signed-off-by: Dani Louca <dani.louca@docker.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 16, 2018

Codecov Report

Merging #2706 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2706      +/-   ##
==========================================
+ Coverage   61.84%   61.89%   +0.04%     
==========================================
  Files         134      134              
  Lines       21764    21771       +7     
==========================================
+ Hits        13461    13476      +15     
+ Misses       6853     6836      -17     
- Partials     1450     1459       +9

@anshulpundir anshulpundir requested a review from cyli July 16, 2018 21:28
service = store.GetService(tx, t.ServiceID)
})
if service == nil {
log.G(ctx).WithField("task.id", t.ID).Debug("removing task from the scheduler")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably remove the task from the taskGroup ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to, this map which is originated from the tick -> tasksByCommonSpec and built out of the unassignedTasks , is reset on every tick, by skipping it here we are not adding it back to unassignedTasks

Unless you mean to delete in the current scope of noSuitableNode ? If this is the case, I am not sure what would be the benefit.

Copy link
Copy Markdown
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Would it make sense to check for the existence of the service at the top of scheduleTaskGroup, as opposed to at the end of that function where noSuitableNode is called?

If the service no longer exists, then no task in that task group needs to be handled (since it seems like the tasks are grouped by service and spec version), and we can skip all the computational processing attempting to schedule all the tasks there entirely.

LGTM otherwise if we need this in right away, since it seems to do the correct thing, but it may be less efficient since in this implementation we are looping over every task in the same task group to check if the service exists, before deciding whether or not to re-enqueue it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants