address unassigned task leak when service is removed#2706
Conversation
|
ping @anshulpundir @dperny I am not sure about the cost of querying the store, but I believe it's negligible as it's in memory. |
| service = store.GetService(tx, t.ServiceID) | ||
| }) | ||
| if service == nil { | ||
| log.G(ctx).WithField("task.id", t.ID).Debug("skipping task, service is deleted") |
There was a problem hiding this comment.
I would suggest adding a more direct message here:
skipping task => removing task from the scheduler.
| @@ -705,6 +705,15 @@ func (s *Scheduler) scheduleNTasksOnNodes(ctx context.Context, n int, taskGroup | |||
| func (s *Scheduler) noSuitableNode(ctx context.Context, taskGroup map[string]*api.Task, schedulingDecisions map[string]schedulingDecision) { | |||
There was a problem hiding this comment.
Can I also request you to add a comment for this function and how its intended to be used ? thx!
Looks good. Correctness first :) PS Please also open an issue on swarmkit for tracking. thx! |
|
@anshulpundir |
| } | ||
|
|
||
| // noSuitableNode checks unassigned tasks and make sure they have an existing service in the store before | ||
| // updating the task status and adding it back to: schedulingDecisions, unassignedTasks and allTasks |
There was a problem hiding this comment.
nit: maybe also say how tasks ends up in noSuitableNode state ?
| assert.Regexp(t, assignment4.NodeID, "(node1|node2)") | ||
| } | ||
|
|
||
| func TestSchedulerUnassignedMap(t *testing.T) { |
There was a problem hiding this comment.
please add a comment on what you're testing and how you're doing it.
|
|
||
| // delete the service of an unassigned task | ||
| err = s.Update(func(tx store.Tx) error { | ||
| assert.NoError(t, store.DeleteService(tx, service1.ID)) |
There was a problem hiding this comment.
It might be much simpler test by directly calling tick()
There was a problem hiding this comment.
was testing the full stack call and let the scheduler does its work, but in this case, make sure to limit the change to unassigned map and the tick() call
1025825 to
ec1d9b6
Compare
Signed-off-by: Dani Louca <dani.louca@docker.com> Add a comment describing the function and adjust the log message Signed-off-by: Dani Louca <dani.louca@docker.com> Fixing existing unit tests Signed-off-by: Dani Louca <dani.louca@docker.com> Adding a test case to verify the leak fix Signed-off-by: Dani Louca <dani.louca@docker.com> simplifying the test Signed-off-by: Dani Louca <dani.louca@docker.com> comment Signed-off-by: Dani Louca <dani.louca@docker.com>
Codecov Report
@@ Coverage Diff @@
## master #2706 +/- ##
==========================================
+ Coverage 61.84% 61.89% +0.04%
==========================================
Files 134 134
Lines 21764 21771 +7
==========================================
+ Hits 13461 13476 +15
+ Misses 6853 6836 -17
- Partials 1450 1459 +9 |
| service = store.GetService(tx, t.ServiceID) | ||
| }) | ||
| if service == nil { | ||
| log.G(ctx).WithField("task.id", t.ID).Debug("removing task from the scheduler") |
There was a problem hiding this comment.
we should probably remove the task from the taskGroup ?
There was a problem hiding this comment.
no need to, this map which is originated from the tick -> tasksByCommonSpec and built out of the unassignedTasks , is reset on every tick, by skipping it here we are not adding it back to unassignedTasks
Unless you mean to delete in the current scope of noSuitableNode ? If this is the case, I am not sure what would be the benefit.
There was a problem hiding this comment.
Non-blocking: Would it make sense to check for the existence of the service at the top of scheduleTaskGroup, as opposed to at the end of that function where noSuitableNode is called?
If the service no longer exists, then no task in that task group needs to be handled (since it seems like the tasks are grouped by service and spec version), and we can skip all the computational processing attempting to schedule all the tasks there entirely.
LGTM otherwise if we need this in right away, since it seems to do the correct thing, but it may be less efficient since in this implementation we are looping over every task in the same task group to check if the service exists, before deciding whether or not to re-enqueue it.
Signed-off-by: Dani Louca dani.louca@docker.com
- What I did
If a task is not yet assigned (for ex:
noSuitableNode) and its service has been removed, the task stays in theunassignedTasksmap until a new leader election.- How I did it
Before re-adding the task to
unassignedTasksmap, the fix checks if the task has a valid service in the store.- How to test it
steps to repro:
ex:
docker service create -d --constraint 'node.labels.type == queue' alpine sleep 10000no suitable node available for task- Description for the changelog
Avoid a leak when a service with unassigned tasks is deleted