Skip to content

return exit status from executor Run/Exec#1619

Merged
tonistiigi merged 5 commits into
moby:masterfrom
coryb:execute-status
Aug 3, 2020
Merged

return exit status from executor Run/Exec#1619
tonistiigi merged 5 commits into
moby:masterfrom
coryb:execute-status

Conversation

@coryb
Copy link
Copy Markdown
Collaborator

@coryb coryb commented Jul 31, 2020

More incremental work for #749

This returns the process exit status from Run/Exec so we can send it in the ExitMessage. This also plumbs in a Resize channel to handle resize events.

I will have another PR soon with the proto changes and server side changes for that, I just wanted to get these executor changes in to keep the PRs managable. I have some questions about client side so I will add those questions/proposals to #749

Through testing I found runc to be challenging since it will not create a pty like containerd does. So for now I modified runc to just return "not implemnted" error when tty is requested. I am trying to understand how containerd manages the pty with the console-socket option on runc so I hope to find a solution soon, but will likely be another follow up PR.

return err
}

var cancel func()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this wait & error handling code I moved to a common runProcess function to share between Run and Exec. (I failed to realize the previous Exec usage was not synchronous and just returned immediately after starting, whoops)

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Jul 31, 2020

Test failures seem unrelated to my changes, I retried them a number of times and most are similar to:

        dockerfile_test.go:1618: 
            	Error Trace:	dockerfile_test.go:1618
            	            				run.go:169
            	Error:      	Received unexpected error:
            	            	rpc error: code = Unavailable desc = transport is closing
            	            	failed to receive status
            	            	github.com/moby/buildkit/client.(*Client).solve.func4
            	            		/src/client/solve.go:257
            	            	golang.org/x/sync/errgroup.(*Group).Go.func1
            	            		/src/vendor/golang.org/x/sync/errgroup/errgroup.go:57
            	            	runtime.goexit
            	            		/usr/local/go/src/runtime/asm_amd64.s:1357
            	Test:       	TestIntegration/TestCmdShell/worker=containerd/frontend=gateway
        run.go:164: 
            	Error Trace:	run.go:164
            	            				panic.go:563
            	            				testing.go:653
            	            				dockerfile_test.go:1618
            	            				run.go:169
            	Error:      	Received unexpected error:
            	            	unlinkat /tmp/bktest_containerd331672296/state/io.containerd.runtime.v2.task/buildkit/hh3yur6h22jnueql623fheuhl/rootfs: device or resource busy
            	Test:       	TestIntegration/TestCmdShell/worker=containerd/frontend=gateway

@tonistiigi
Copy link
Copy Markdown
Member

Test failures seem unrelated to my changes,

I restarted and it failed the same way. Never seen this specific error and master is green so looks like it is related.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Jul 31, 2020

Interesting, okay, I will dig into the tests.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Jul 31, 2020

Yup, sorry for the noise, I failed to check if the started chan was nil before closing it in the common process handling for Exec/Run. The test all pass now.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the status, perhaps easier would have been to return a (wrapped) typed error defined in the executor package(or in more general errdefs if needs to be more reusable). Currently status seems to be 0 for the non-exec errors. Also not a fan of returning values together with error return in general although sometimes it is needed. wdyt?

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Jul 31, 2020

Yeah, makes sense, I will make the change in the morning. I think something like executor.ExitError makes the most sense.

Signed-off-by: Cory Bennett <cbennett@netflix.com>
Comment thread executor/containerdexecutor/executor.go Outdated
}
select {
case <-ctx.Done():
err = errors.Wrap(ctx.Err(), err.Error())
Copy link
Copy Markdown
Member

@tonistiigi tonistiigi Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrap is wrong way if we want to keep the exit code as typed. Maybe we can have a custom Unwrap so that both remain typed. Or just set err.Err to ctx.Err() ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I debated if the ctx.Err was more or less relevant than the Run error. I think setting err.Err to ctx.Err likely makes the most sense here, I will update.

Comment thread executor/executor.go Outdated
type WinSize struct {
Rows uint32
Cols uint32
Xpixel uint32
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are *pixel used?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not sure, they are defined in the kernel winsize struct, but afaict they are unused on Linux. It seems intended for terminal width/height in pixels instead of characters, not sure if there are any platforms that actually use it though. I didnt find much useful information googling, so it seems safe to remove. They are unused for containerd pty.Resize call anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is the cursor position. But not sure if we should allow it or if it is even possible in containerd.

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Jul 31, 2020

While working through the runc test failure (err test needed Unwrap) I found that the containerd tests are hanging with:

	// simulate: echo -n "hello" | cat > /tmp/msg
	stdin := bytes.NewReader([]byte("hello"))
	stdout.Reset()
	stderr.Reset()
	err = w.WorkerOpt.Executor.Exec(ctx, id, executor.ProcessInfo{
		Meta: executor.Meta{
			Args: []string{"sh", "-c", "cat > /tmp/msg"},
		},
		Stdin:  ioutil.NopCloser(stdin),
		Stdout: &nopCloser{stdout},
		Stderr: &nopCloser{stderr},
	})

Now that the Exec handler correctly waits for the process to finish I can see a problem in that the container is not exiting after receiving the EOF from stdin, instead it hangs until the pid1 process exits killing the exec. I am trying to work through this problem now, but it might take a bit since I dont really have any leads at the moment.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 1, 2020

Okay, I think this is good now, I remove the *pixel stuff, fixed the error wrapping and fixed the containerd tests hanging. That last one was a pain to debug, but after combing through the ctr task exec I found the problem where we need to explicitly close the container IO when the stdin reader gets an EOF.

Comment thread executor/containerdexecutor/executor.go Outdated
func (s *stdinCloser) Read(p []byte) (int, error) {
n, err := s.stdin.Read(p)
if err == io.EOF {
if s.closer != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is safe. Bytes are read and then written to the fifo. Only after the write is complete can we close the fifo. If we close on read there should be a race between writing the last buffer and closing.

Also, not sure about that CloseIO call. Afaik it signals the shim and closes stdin there. Not even sure what stdin is in shim and how it is safe to synchronize it (eg. if it is the other side of fifo). I think much more safer would be to just close the fifo on client side after writing the last message.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a close in https://github.com/moby/buildkit/blob/master/vendor/github.com/containerd/containerd/cio/io_unix.go#L70 so not sure. So I think CloseIO just needs to be called after fifo has been open and task created. This is based on looking at the Docker code. But quite confusing though. If this is true the we shouldn't wait for EOF, just close early and we can test that calling CloseIO doesn't really drop stdin before fifo gets closed as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

@coryb coryb Aug 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I did test closing the fifo on client side, it had no impact. The signal to the server side via CloseIO was the only thing that allowed the container to exit on EOF on Stdin. I did not test the ordering of CloseIO though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But can't think of any case how it would be correct that this call happens between reading EOF from a go reader and writing last bytes to a fifo. That time is only relevant in client-side and not synchronized with the server-side at all.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is working without the stdin wrapper now. I also found there was an issue on pid1 when stdin closed. I have added another test to verify closing stdin on pid1 will cause the container to exit normally.

First tried calling CloseIO after the container.NewTask and task.Exec (after fifo created) .. this worked once, then retries failed, so was racey.
Next tried calling CloseIO after task.Wait, which worked more frequently, but also failed repeatedly, so still racey.
Finally tried calling CloseIO after task.Start, which seems to work consistently.

Anyway, it appears you are right, we can call CloseIO any time after the container is running. Not sure why the ctr code is written that way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coryb coryb force-pushed the execute-status branch 2 times, most recently from 6d02561 to ce70e1f Compare August 1, 2020 16:39
Comment thread worker/tests/common.go Outdated
id := identity.NewID()

// verify pid1 exits when stdin sees EOF
stdin := bytes.NewReader([]byte("hello"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a case where instead of using bytes.NewReader this is a pipe and only after started channel has returned you write into stdin and then close it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use io.Pipe

… stdin

Signed-off-by: Cory Bennett <cbennett@netflix.com>
Comment thread executor/containerdexecutor/executor.go Outdated
Comment on lines +330 to +334
err := p.Resize(ctx, size.Cols, size.Rows)
if err != nil {
cancel()
return err
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a race where a client-size window resize happens right around the same time the task exits, in which case the resize chan could get triggered first and then p.Resize returns an expected error because the task process has exited. That would result in the resize error getting returned instead of the actual task status (which could have been an expected exit with 0 status).

I think it also wouldn't hurt to have a timeout on the Resize call just in case it hangs for unexpected reasons.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch. I wonder if we should just ignore resize errors in general, maybe just log them? If Resize is not working the user will likely be having other issues anyway. I will add a context.WithTimeout, should probably also throw it in a goroutine to prevent it blocking the cancel/exit handling.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the resize to run later in select loop and to run in short lived goroutine with 1s timeout.

Comment on lines +327 to +328
killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second)
p.Kill(killCtx, syscall.SIGKILL)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this timeout on Kill got hit, it seems like the for loop would just continue on. So if the SIGKILL was never actually sent due to the timeout, I think that could result in being blocked in this loop indefinitely.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code I just moved into a common function, so I am not sure but that does seem plausible if the p.Kill never completes. To resolve I think we could define a killCtxDone outside the loop and add a case <-killCtxDone to return some "failed to kill" error, something like:

	var cancel func()
	var killCtxDone <-chan struct{}
	ctxDone := ctx.Done()
	for {
		select {
		case <-killCtxDone:
			return fmt.Errorf("failed to kill process on cancel")
		case <-ctxDone:
			ctxDone = nil
			var killCtx context.Context
			killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second)
			killCtxDone = killCtx.Done()
			p.Kill(killCtx, syscall.SIGKILL)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the code mostly as above, but moved the killCtxDone after the statusCh receive.

prevent resize from blocking exit
fix edgecase where kill signal never reaches process

Signed-off-by: Cory Bennett <cbennett@netflix.com>
Comment thread executor/containerdexecutor/executor.go Outdated
if cancel != nil {
cancel()
}
return fmt.Errorf("failed to kill process on cancel")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Errorf

Comment thread executor/containerdexecutor/executor.go Outdated
}
return fmt.Errorf("failed to kill process on cancel")
case size := <-resize:
ctxTimeout, cancelTimeout := context.WithTimeout(ctx, time.Second)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1 sec? Seems possible to hit with just high load. Should we protect against p.Resize calls possibly getting out of order? Also, would be nice to call cancelTimeout when in a runProcess defer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1s was arbitrary, is 10s better? I was just thinking about moving the resize processing into a separate for-loop in a goroutine. That would take care of the event sequence processing and prevent it from blocking the cancel/exit loop, will also add a defer on the cancelTimeout for runProces.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the error isn't fatal I don't think we need a special timeout at all. Just force order and wrap context so goroutines get canceled on defer.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

}

func (w *containerdExecutor) runProcess(ctx context.Context, p containerd.Process, resize <-chan executor.WinSize, started func()) error {
statusCh, err := p.Wait(context.Background())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why context.Background? Maybe leave a comment for future readers as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I was wondering the same thing, this code was just moved from Run to share with Exec. We need to ask your 3yr younger self :)
https://github.com/moby/buildkit/blame/master/executor/containerdexecutor/executor.go#L213

My hunch is that we wanted to keep the statusCh alive if we get ctx.Done so that we can send sigkill and capture status from that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at Wait implementation I now understand it. Wait is non-blocking irrelevant from the context passed and context only affects the statusCh that we shouldn't cancel.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment with your findings.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but linter failed

…cancel loop to prevent blocking.

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 3, 2020

Doh, fixed spelling error.

@tonistiigi
Copy link
Copy Markdown
Member

@sipsma lgty?

Copy link
Copy Markdown
Collaborator

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants