address unassigned task leak when service is removed by dani-docker · Pull Request #2706 · moby/swarmkit

dani-docker · 2018-07-13T17:03:46Z

Signed-off-by: Dani Louca dani.louca@docker.com

- What I did

If a task is not yet assigned (for ex: noSuitableNode ) and its service has been removed, the task stays in the unassignedTasks map until a new leader election.

- How I did it

Before re-adding the task to unassignedTasks map, the fix checks if the task has a valid service in the store.

- How to test it

steps to repro:

set daemon.log to debug
create a service with a node constraint that does not exist:
ex: docker service create -d --constraint 'node.labels.type == queue' alpine sleep 10000
watch the logs, you should see an error
no suitable node available for task
delete the service and invoke the tick() by creating another service; watch the log and notice the same warning taskID in pops up after service is deleted

- Description for the changelog

Avoid a leak when a service with unassigned tasks is deleted

dani-docker · 2018-07-13T17:04:32Z

ping @anshulpundir @dperny

I am not sure about the cost of querying the store, but I believe it's negligible as it's in memory.
I will check the unit test once I get some feedback

anshulpundir · 2018-07-13T17:48:49Z

+			service = store.GetService(tx, t.ServiceID)
+		})
+		if service == nil {
+			log.G(ctx).WithField("task.id", t.ID).Debug("skipping task, service is deleted")


I would suggest adding a more direct message here:

skipping task => removing task from the scheduler.

anshulpundir · 2018-07-13T17:49:19Z

@@ -705,6 +705,15 @@ func (s *Scheduler) scheduleNTasksOnNodes(ctx context.Context, n int, taskGroup
 func (s *Scheduler) noSuitableNode(ctx context.Context, taskGroup map[string]*api.Task, schedulingDecisions map[string]schedulingDecision) {


Can I also request you to add a comment for this function and how its intended to be used ? thx!

roger that!

anshulpundir · 2018-07-13T17:50:48Z

I am not sure about the cost of querying the store, but I believe it's negligible as it's in memory.
I will check the unit test once I get some feedback

Looks good. Correctness first :)
Lets add a unit-test and ship it @dani-docker

PS Please also open an issue on swarmkit for tracking. thx!

dani-docker · 2018-07-14T03:27:50Z

@anshulpundir
A new unit test is added and existing ones updated

anshulpundir · 2018-07-16T18:50:25Z

 }

+// noSuitableNode checks unassigned tasks and make sure they have an existing service in the store before
+// updating the task status and adding it back to: schedulingDecisions, unassignedTasks and allTasks


nit: maybe also say how tasks ends up in noSuitableNode state ?

anshulpundir · 2018-07-16T18:51:45Z

 	assert.Regexp(t, assignment4.NodeID, "(node1|node2)")
 }

+func TestSchedulerUnassignedMap(t *testing.T) {


please add a comment on what you're testing and how you're doing it.

anshulpundir · 2018-07-16T18:54:06Z

+
+	// delete the service of an unassigned task
+	err = s.Update(func(tx store.Tx) error {
+		assert.NoError(t, store.DeleteService(tx, service1.ID))


It might be much simpler test by directly calling tick()

was testing the full stack call and let the scheduler does its work, but in this case, make sure to limit the change to unassigned map and the tick() call

Signed-off-by: Dani Louca <dani.louca@docker.com> Add a comment describing the function and adjust the log message Signed-off-by: Dani Louca <dani.louca@docker.com> Fixing existing unit tests Signed-off-by: Dani Louca <dani.louca@docker.com> Adding a test case to verify the leak fix Signed-off-by: Dani Louca <dani.louca@docker.com> simplifying the test Signed-off-by: Dani Louca <dani.louca@docker.com> comment Signed-off-by: Dani Louca <dani.louca@docker.com>

codecov · 2018-07-16T20:40:22Z

Codecov Report

Merging #2706 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2706      +/-   ##
==========================================
+ Coverage   61.84%   61.89%   +0.04%     
==========================================
  Files         134      134              
  Lines       21764    21771       +7     
==========================================
+ Hits        13461    13476      +15     
+ Misses       6853     6836      -17     
- Partials     1450     1459       +9

anshulpundir · 2018-07-16T21:30:33Z

+			service = store.GetService(tx, t.ServiceID)
+		})
+		if service == nil {
+			log.G(ctx).WithField("task.id", t.ID).Debug("removing task from the scheduler")


we should probably remove the task from the taskGroup ?

no need to, this map which is originated from the tick -> tasksByCommonSpec and built out of the unassignedTasks , is reset on every tick, by skipping it here we are not adding it back to unassignedTasks

Unless you mean to delete in the current scope of noSuitableNode ? If this is the case, I am not sure what would be the benefit.

cyli

Non-blocking: Would it make sense to check for the existence of the service at the top of scheduleTaskGroup, as opposed to at the end of that function where noSuitableNode is called?

If the service no longer exists, then no task in that task group needs to be handled (since it seems like the tasks are grouped by service and spec version), and we can skip all the computational processing attempting to schedule all the tasks there entirely.

LGTM otherwise if we need this in right away, since it seems to do the correct thing, but it may be less efficient since in this implementation we are looping over every task in the same task group to check if the service exists, before deciding whether or not to re-enqueue it.

anshulpundir reviewed Jul 13, 2018

View reviewed changes

dani-docker force-pushed the task_leak branch from b5249e5 to 88807f8 Compare July 13, 2018 19:34

dani-docker mentioned this pull request Jul 13, 2018

no suitable node available for task #2707

Closed

dani-docker force-pushed the task_leak branch from 88807f8 to 5e8fc3e Compare July 14, 2018 03:26

anshulpundir reviewed Jul 16, 2018

View reviewed changes

dani-docker force-pushed the task_leak branch 2 times, most recently from 1025825 to ec1d9b6 Compare July 16, 2018 20:32

dani-docker force-pushed the task_leak branch from ec1d9b6 to 9d977ce Compare July 16, 2018 20:35

anshulpundir requested a review from cyli July 16, 2018 21:28

anshulpundir reviewed Jul 16, 2018

View reviewed changes

cyli approved these changes Jul 16, 2018

View reviewed changes

anshulpundir approved these changes Jul 16, 2018

View reviewed changes

anshulpundir merged commit 6826639 into moby:master Jul 16, 2018

		@@ -705,6 +705,15 @@ func (s *Scheduler) scheduleNTasksOnNodes(ctx context.Context, n int, taskGroup
		func (s Scheduler) noSuitableNode(ctx context.Context, taskGroup map[string]api.Task, schedulingDecisions map[string]schedulingDecision) {

Conversation

dani-docker commented Jul 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dani-docker commented Jul 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir commented Jul 13, 2018

Uh oh!

dani-docker commented Jul 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dani-docker commented Jul 13, 2018 •

edited

Loading

dani-docker commented Jul 13, 2018 •

edited

Loading

codecov Bot commented Jul 16, 2018 •

edited

Loading

cyli left a comment •

edited

Loading