return exit status from executor Run/Exec by coryb · Pull Request #1619 · moby/buildkit

coryb · 2020-07-31T01:03:41Z

More incremental work for #749

This returns the process exit status from Run/Exec so we can send it in the ExitMessage. This also plumbs in a Resize channel to handle resize events.

I will have another PR soon with the proto changes and server side changes for that, I just wanted to get these executor changes in to keep the PRs managable. I have some questions about client side so I will add those questions/proposals to #749

Through testing I found runc to be challenging since it will not create a pty like containerd does. So for now I modified runc to just return "not implemnted" error when tty is requested. I am trying to understand how containerd manages the pty with the console-socket option on runc so I hope to find a solution soon, but will likely be another follow up PR.

coryb · 2020-07-31T01:07:23Z

-		return err
-	}
-
-	var cancel func()


all this wait & error handling code I moved to a common runProcess function to share between Run and Exec. (I failed to realize the previous Exec usage was not synchronous and just returned immediately after starting, whoops)

coryb · 2020-07-31T01:46:52Z

Test failures seem unrelated to my changes, I retried them a number of times and most are similar to:

        dockerfile_test.go:1618: 
            	Error Trace:	dockerfile_test.go:1618
            	            				run.go:169
            	Error:      	Received unexpected error:
            	            	rpc error: code = Unavailable desc = transport is closing
            	            	failed to receive status
            	            	github.com/moby/buildkit/client.(*Client).solve.func4
            	            		/src/client/solve.go:257
            	            	golang.org/x/sync/errgroup.(*Group).Go.func1
            	            		/src/vendor/golang.org/x/sync/errgroup/errgroup.go:57
            	            	runtime.goexit
            	            		/usr/local/go/src/runtime/asm_amd64.s:1357
            	Test:       	TestIntegration/TestCmdShell/worker=containerd/frontend=gateway
        run.go:164: 
            	Error Trace:	run.go:164
            	            				panic.go:563
            	            				testing.go:653
            	            				dockerfile_test.go:1618
            	            				run.go:169
            	Error:      	Received unexpected error:
            	            	unlinkat /tmp/bktest_containerd331672296/state/io.containerd.runtime.v2.task/buildkit/hh3yur6h22jnueql623fheuhl/rootfs: device or resource busy
            	Test:       	TestIntegration/TestCmdShell/worker=containerd/frontend=gateway

tonistiigi · 2020-07-31T03:47:03Z

Test failures seem unrelated to my changes,

I restarted and it failed the same way. Never seen this specific error and master is green so looks like it is related.

coryb · 2020-07-31T03:50:29Z

Interesting, okay, I will dig into the tests.

coryb · 2020-07-31T05:09:06Z

Yup, sorry for the noise, I failed to check if the started chan was nil before closing it in the common process handling for Exec/Run. The test all pass now.

tonistiigi

For the status, perhaps easier would have been to return a (wrapped) typed error defined in the executor package(or in more general errdefs if needs to be more reusable). Currently status seems to be 0 for the non-exec errors. Also not a fan of returning values together with error return in general although sometimes it is needed. wdyt?

coryb · 2020-07-31T06:17:37Z

Yeah, makes sense, I will make the change in the morning. I think something like executor.ExitError makes the most sense.

Signed-off-by: Cory Bennett <cbennett@netflix.com>

tonistiigi · 2020-07-31T19:28:37Z

+				}
+				select {
+				case <-ctx.Done():
+					err = errors.Wrap(ctx.Err(), err.Error())


This wrap is wrong way if we want to keep the exit code as typed. Maybe we can have a custom Unwrap so that both remain typed. Or just set err.Err to ctx.Err() ?

Yeah, I debated if the ctx.Err was more or less relevant than the Run error. I think setting err.Err to ctx.Err likely makes the most sense here, I will update.

tonistiigi · 2020-07-31T19:31:11Z

+type WinSize struct {
+	Rows   uint32
+	Cols   uint32
+	Xpixel uint32


how are *pixel used?

Honestly not sure, they are defined in the kernel winsize struct, but afaict they are unused on Linux. It seems intended for terminal width/height in pixels instead of characters, not sure if there are any platforms that actually use it though. I didnt find much useful information googling, so it seems safe to remove. They are unused for containerd pty.Resize call anyway.

Maybe this is the cursor position. But not sure if we should allow it or if it is even possible in containerd.

Signed-off-by: Cory Bennett <cbennett@netflix.com>

coryb · 2020-07-31T21:43:00Z

While working through the runc test failure (err test needed Unwrap) I found that the containerd tests are hanging with:

	// simulate: echo -n "hello" | cat > /tmp/msg
	stdin := bytes.NewReader([]byte("hello"))
	stdout.Reset()
	stderr.Reset()
	err = w.WorkerOpt.Executor.Exec(ctx, id, executor.ProcessInfo{
		Meta: executor.Meta{
			Args: []string{"sh", "-c", "cat > /tmp/msg"},
		},
		Stdin:  ioutil.NopCloser(stdin),
		Stdout: &nopCloser{stdout},
		Stderr: &nopCloser{stderr},
	})

Now that the Exec handler correctly waits for the process to finish I can see a problem in that the container is not exiting after receiving the EOF from stdin, instead it hangs until the pid1 process exits killing the exec. I am trying to work through this problem now, but it might take a bit since I dont really have any leads at the moment.

coryb · 2020-08-01T06:33:05Z

Okay, I think this is good now, I remove the *pixel stuff, fixed the error wrapping and fixed the containerd tests hanging. That last one was a pain to debug, but after combing through the ctr task exec I found the problem where we need to explicitly close the container IO when the stdin reader gets an EOF.

tonistiigi · 2020-08-01T15:17:22Z

+func (s *stdinCloser) Read(p []byte) (int, error) {
+	n, err := s.stdin.Read(p)
+	if err == io.EOF {
+		if s.closer != nil {


I'm not sure if this is safe. Bytes are read and then written to the fifo. Only after the write is complete can we close the fifo. If we close on read there should be a race between writing the last buffer and closing.

Also, not sure about that CloseIO call. Afaik it signals the shim and closes stdin there. Not even sure what stdin is in shim and how it is safe to synchronize it (eg. if it is the other side of fifo). I think much more safer would be to just close the fifo on client side after writing the last message.

Actually, there is a close in https://github.com/moby/buildkit/blob/master/vendor/github.com/containerd/containerd/cio/io_unix.go#L70 so not sure. So I think CloseIO just needs to be called after fifo has been open and task created. This is based on looking at the Docker code. But quite confusing though. If this is true the we shouldn't wait for EOF, just close early and we can test that calling CloseIO doesn't really drop stdin before fifo gets closed as well.

I can test early CloseIO. I lifted this code from containerd, so I was hoping their logic is correct :)
https://github.com/containerd/containerd/blob/master/cmd/ctr/commands/tasks/exec.go#L131-L133
https://github.com/containerd/containerd/blob/master/cmd/ctr/commands/tasks/exec.go#L181-L194

btw, I did test closing the fifo on client side, it had no impact. The signal to the server side via CloseIO was the only thing that allowed the container to exit on EOF on Stdin. I did not test the ordering of CloseIO though.

I see. But can't think of any case how it would be correct that this call happens between reading EOF from a go reader and writing last bytes to a fifo. That time is only relevant in client-side and not synchronized with the server-side at all.

I think it is working without the stdin wrapper now. I also found there was an issue on pid1 when stdin closed. I have added another test to verify closing stdin on pid1 will cause the container to exit normally.

First tried calling CloseIO after the container.NewTask and task.Exec (after fifo created) .. this worked once, then retries failed, so was racey.
Next tried calling CloseIO after task.Wait, which worked more frequently, but also failed repeatedly, so still racey.
Finally tried calling CloseIO after task.Start, which seems to work consistently.

Anyway, it appears you are right, we can call CloseIO any time after the container is running. Not sure why the ctr code is written that way.

@crosbymichael fyi

tonistiigi · 2020-08-01T16:49:01Z

 	id := identity.NewID()

+	// verify pid1 exits when stdin sees EOF
+	stdin := bytes.NewReader([]byte("hello"))


could you add a case where instead of using bytes.NewReader this is a pipe and only after started channel has returned you write into stdin and then close it.

Updated to use io.Pipe

… stdin Signed-off-by: Cory Bennett <cbennett@netflix.com>

sipsma · 2020-08-01T19:20:02Z

+			err := p.Resize(ctx, size.Cols, size.Rows)
+			if err != nil {
+				cancel()
+				return err
+			}


I think there might be a race where a client-size window resize happens right around the same time the task exits, in which case the resize chan could get triggered first and then p.Resize returns an expected error because the task process has exited. That would result in the resize error getting returned instead of the actual task status (which could have been an expected exit with 0 status).

I think it also wouldn't hurt to have a timeout on the Resize call just in case it hangs for unexpected reasons.

Yeah, good catch. I wonder if we should just ignore resize errors in general, maybe just log them? If Resize is not working the user will likely be having other issues anyway. I will add a context.WithTimeout, should probably also throw it in a goroutine to prevent it blocking the cancel/exit handling.

I updated the resize to run later in select loop and to run in short lived goroutine with 1s timeout.

sipsma · 2020-08-01T19:36:20Z

+			killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second)
+			p.Kill(killCtx, syscall.SIGKILL)


If this timeout on Kill got hit, it seems like the for loop would just continue on. So if the SIGKILL was never actually sent due to the timeout, I think that could result in being blocked in this loop indefinitely.

This is old code I just moved into a common function, so I am not sure but that does seem plausible if the p.Kill never completes. To resolve I think we could define a killCtxDone outside the loop and add a case <-killCtxDone to return some "failed to kill" error, something like:

var cancel func() var killCtxDone <-chan struct{} ctxDone := ctx.Done() for { select { case <-killCtxDone: return fmt.Errorf("failed to kill process on cancel") case <-ctxDone: ctxDone = nil var killCtx context.Context killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second) killCtxDone = killCtx.Done() p.Kill(killCtx, syscall.SIGKILL)

updated the code mostly as above, but moved the killCtxDone after the statusCh receive.

prevent resize from blocking exit fix edgecase where kill signal never reaches process Signed-off-by: Cory Bennett <cbennett@netflix.com>

tonistiigi · 2020-08-02T19:16:38Z

+			if cancel != nil {
+				cancel()
+			}
+			return fmt.Errorf("failed to kill process on cancel")


errors.Errorf

tonistiigi · 2020-08-02T19:19:02Z

+			}
+			return fmt.Errorf("failed to kill process on cancel")
+		case size := <-resize:
+			ctxTimeout, cancelTimeout := context.WithTimeout(ctx, time.Second)


Why 1 sec? Seems possible to hit with just high load. Should we protect against p.Resize calls possibly getting out of order? Also, would be nice to call cancelTimeout when in a runProcess defer.

1s was arbitrary, is 10s better? I was just thinking about moving the resize processing into a separate for-loop in a goroutine. That would take care of the event sequence processing and prevent it from blocking the cancel/exit loop, will also add a defer on the cancelTimeout for runProces.

Now that the error isn't fatal I don't think we need a special timeout at all. Just force order and wrap context so goroutines get canceled on defer.

tonistiigi · 2020-08-02T19:20:12Z

+}
+
+func (w *containerdExecutor) runProcess(ctx context.Context, p containerd.Process, resize <-chan executor.WinSize, started func()) error {
+	statusCh, err := p.Wait(context.Background())


why context.Background? Maybe leave a comment for future readers as well.

Good question, I was wondering the same thing, this code was just moved from Run to share with Exec. We need to ask your 3yr younger self :)
https://github.com/moby/buildkit/blame/master/executor/containerdexecutor/executor.go#L213

My hunch is that we wanted to keep the statusCh alive if we get ctx.Done so that we can send sigkill and capture status from that?

Looking at Wait implementation I now understand it. Wait is non-blocking irrelevant from the context passed and context only affects the statusCh that we shouldn't cancel.

I added a comment with your findings.

tonistiigi

LGTM, but linter failed

…cancel loop to prevent blocking. Signed-off-by: Cory Bennett <cbennett@netflix.com>

coryb · 2020-08-03T01:55:36Z

Doh, fixed spelling error.

tonistiigi · 2020-08-03T15:48:58Z

@sipsma lgty?

sipsma

LGTM

coryb force-pushed the execute-status branch from aabb8d8 to 051cc58 Compare July 31, 2020 01:04

coryb commented Jul 31, 2020

View reviewed changes

coryb force-pushed the execute-status branch from 051cc58 to 3966ff9 Compare July 31, 2020 04:51

tonistiigi reviewed Jul 31, 2020

View reviewed changes

coryb force-pushed the execute-status branch from 3966ff9 to 6145b1f Compare July 31, 2020 18:16

wrap errors from executor Run/Exec to allow access to exit code

4b456f1

Signed-off-by: Cory Bennett <cbennett@netflix.com>

coryb force-pushed the execute-status branch from 6145b1f to 4b456f1 Compare July 31, 2020 19:14

tonistiigi reviewed Jul 31, 2020

View reviewed changes

remove *pixel from winsize struct, tweak ExitError handling for ctx.Err

93344a9

Signed-off-by: Cory Bennett <cbennett@netflix.com>

tonistiigi reviewed Aug 1, 2020

View reviewed changes

coryb force-pushed the execute-status branch 2 times, most recently from 6d02561 to ce70e1f Compare August 1, 2020 16:39

tonistiigi reviewed Aug 1, 2020

View reviewed changes

fix containerd executor Run/Exec to close container input on eof from…

f781f83

… stdin Signed-off-by: Cory Bennett <cbennett@netflix.com>

coryb force-pushed the execute-status branch from ce70e1f to f781f83 Compare August 1, 2020 17:21

sipsma reviewed Aug 1, 2020

View reviewed changes

only warn on resize errors

86e246a

prevent resize from blocking exit fix edgecase where kill signal never reaches process Signed-off-by: Cory Bennett <cbennett@netflix.com>

tonistiigi reviewed Aug 2, 2020

View reviewed changes

tonistiigi approved these changes Aug 3, 2020

View reviewed changes

update container resize events in sequence, also move it out of exit/…

19c0077

…cancel loop to prevent blocking. Signed-off-by: Cory Bennett <cbennett@netflix.com>

coryb force-pushed the execute-status branch from 453feed to 19c0077 Compare August 3, 2020 01:38

sipsma approved these changes Aug 3, 2020

View reviewed changes

tonistiigi merged commit dd6df88 into moby:master Aug 3, 2020

This was referenced Oct 2, 2020

update gateway to add ability to run and exec into containers #1627

Merged

gateway exec: runc executor needs to be able to create pty #1714

Closed

		killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second)
		p.Kill(killCtx, syscall.SIGKILL)

Conversation

coryb commented Jul 31, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coryb commented Jul 31, 2020

Uh oh!

tonistiigi commented Jul 31, 2020

Uh oh!

coryb commented Jul 31, 2020

Uh oh!

coryb commented Jul 31, 2020

Uh oh!

tonistiigi left a comment

Choose a reason for hiding this comment

Uh oh!

coryb commented Jul 31, 2020

Uh oh!

tonistiigi Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coryb commented Jul 31, 2020

Uh oh!

coryb commented Aug 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coryb Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonistiigi Jul 31, 2020 •

edited

Loading

coryb Aug 1, 2020 •

edited

Loading

coryb commented Aug 3, 2020 •

edited

Loading