Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion agent/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ func (tm *taskManager) run(ctx context.Context) {
case <-run:
// always check for shutdown before running.
select {
case <-shutdown:
case <-tm.shutdown:
continue // ignore run request and handle shutdown
case <-tm.closed:
continue
Expand Down Expand Up @@ -142,6 +142,13 @@ func (tm *taskManager) run(ctx context.Context) {
// goal is to decide whether or not we re-dispatch the operation.
cancel = nil

select {
case <-tm.shutdown:
shutdown = tm.shutdown // re-enable the shutdown branch
continue // no dispatch if we are in shutdown.
default:
}

switch err {
case exec.ErrTaskNoop:
if !updated {
Expand Down Expand Up @@ -203,6 +210,12 @@ func (tm *taskManager) run(ctx context.Context) {
if cancel != nil {
// cancel outstanding operation.
cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the test with this patch applied and the problem of orphan containers was still there. The race illustrated in #1159 did not get fixed. The point is that the calling to cancel() is disruptive to the ongoing create operation. It causes the create operation return without waiting for the response from docker daemon because of context cancelled. Then the agent task proceeds to remove the container, while the docker daemon is creating it. When the remove request comes to the daemon before the container was created, the remove operation will fail and the container will become orphan after it's created.

In simple words, the problem could happen if the cancel() is called when the agent task is still creating the container.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, it sounds like the bug is in the context handling. #1159 works because it completely disables the cancellation.

This is a race condition in the docker daemon. I'm not sure how to deal with this without doing a retry loop. Even then, if we don't have a guarantee about the state of the container after the Create method returns, we'd have to keep that container ID around until we can be absolutely sure.

We'll really need to fix this in the docker daemon, such that one can remove a container that is in the process of being created.

I've filed moby/moby#24858.

This PR should make it so Remove and container.create aren't issued concurrently, so it is still needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinuxstyle We are going to merge this PR. This should address this issue when encountering in the docker daemon integration. There is still a race condition that needs to be tackled as part of moby/moby#24858 but that is going to be a more complex PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good and it will likely make the task operations more robust. So +1.

I am glad to see you also raised an issue to docker daemon. I agree that it would be better if docker could provide a way to cancel ongoing container operations like create.


// subtle: after a cancellation, we want to avoid busy wait
// here. this gets renabled in the errs branch and we'll come
// back around and try shutdown again.
shutdown = nil // turn off this branch until op proceeds
continue // wait until operation actually exits.
}

// TODO(stevvooe): This should be left for the repear.
Expand Down