return exit status from executor Run/Exec#1619
Conversation
| return err | ||
| } | ||
|
|
||
| var cancel func() |
There was a problem hiding this comment.
all this wait & error handling code I moved to a common runProcess function to share between Run and Exec. (I failed to realize the previous Exec usage was not synchronous and just returned immediately after starting, whoops)
|
Test failures seem unrelated to my changes, I retried them a number of times and most are similar to: |
I restarted and it failed the same way. Never seen this specific error and master is green so looks like it is related. |
|
Interesting, okay, I will dig into the tests. |
|
Yup, sorry for the noise, I failed to check if the |
tonistiigi
left a comment
There was a problem hiding this comment.
For the status, perhaps easier would have been to return a (wrapped) typed error defined in the executor package(or in more general errdefs if needs to be more reusable). Currently status seems to be 0 for the non-exec errors. Also not a fan of returning values together with error return in general although sometimes it is needed. wdyt?
|
Yeah, makes sense, I will make the change in the morning. I think something like |
Signed-off-by: Cory Bennett <cbennett@netflix.com>
| } | ||
| select { | ||
| case <-ctx.Done(): | ||
| err = errors.Wrap(ctx.Err(), err.Error()) |
There was a problem hiding this comment.
This wrap is wrong way if we want to keep the exit code as typed. Maybe we can have a custom Unwrap so that both remain typed. Or just set err.Err to ctx.Err() ?
There was a problem hiding this comment.
Yeah, I debated if the ctx.Err was more or less relevant than the Run error. I think setting err.Err to ctx.Err likely makes the most sense here, I will update.
| type WinSize struct { | ||
| Rows uint32 | ||
| Cols uint32 | ||
| Xpixel uint32 |
There was a problem hiding this comment.
Honestly not sure, they are defined in the kernel winsize struct, but afaict they are unused on Linux. It seems intended for terminal width/height in pixels instead of characters, not sure if there are any platforms that actually use it though. I didnt find much useful information googling, so it seems safe to remove. They are unused for containerd pty.Resize call anyway.
There was a problem hiding this comment.
Maybe this is the cursor position. But not sure if we should allow it or if it is even possible in containerd.
Signed-off-by: Cory Bennett <cbennett@netflix.com>
|
While working through the runc test failure (err test needed Unwrap) I found that the containerd tests are hanging with: Now that the Exec handler correctly waits for the process to finish I can see a problem in that the container is not exiting after receiving the EOF from stdin, instead it hangs until the pid1 process exits killing the exec. I am trying to work through this problem now, but it might take a bit since I dont really have any leads at the moment. |
|
Okay, I think this is good now, I remove the *pixel stuff, fixed the error wrapping and fixed the containerd tests hanging. That last one was a pain to debug, but after combing through the |
| func (s *stdinCloser) Read(p []byte) (int, error) { | ||
| n, err := s.stdin.Read(p) | ||
| if err == io.EOF { | ||
| if s.closer != nil { |
There was a problem hiding this comment.
I'm not sure if this is safe. Bytes are read and then written to the fifo. Only after the write is complete can we close the fifo. If we close on read there should be a race between writing the last buffer and closing.
Also, not sure about that CloseIO call. Afaik it signals the shim and closes stdin there. Not even sure what stdin is in shim and how it is safe to synchronize it (eg. if it is the other side of fifo). I think much more safer would be to just close the fifo on client side after writing the last message.
There was a problem hiding this comment.
Actually, there is a close in https://github.com/moby/buildkit/blob/master/vendor/github.com/containerd/containerd/cio/io_unix.go#L70 so not sure. So I think CloseIO just needs to be called after fifo has been open and task created. This is based on looking at the Docker code. But quite confusing though. If this is true the we shouldn't wait for EOF, just close early and we can test that calling CloseIO doesn't really drop stdin before fifo gets closed as well.
There was a problem hiding this comment.
I can test early CloseIO. I lifted this code from containerd, so I was hoping their logic is correct :)
https://github.com/containerd/containerd/blob/master/cmd/ctr/commands/tasks/exec.go#L131-L133
https://github.com/containerd/containerd/blob/master/cmd/ctr/commands/tasks/exec.go#L181-L194
There was a problem hiding this comment.
btw, I did test closing the fifo on client side, it had no impact. The signal to the server side via CloseIO was the only thing that allowed the container to exit on EOF on Stdin. I did not test the ordering of CloseIO though.
There was a problem hiding this comment.
I see. But can't think of any case how it would be correct that this call happens between reading EOF from a go reader and writing last bytes to a fifo. That time is only relevant in client-side and not synchronized with the server-side at all.
There was a problem hiding this comment.
I think it is working without the stdin wrapper now. I also found there was an issue on pid1 when stdin closed. I have added another test to verify closing stdin on pid1 will cause the container to exit normally.
First tried calling CloseIO after the container.NewTask and task.Exec (after fifo created) .. this worked once, then retries failed, so was racey.
Next tried calling CloseIO after task.Wait, which worked more frequently, but also failed repeatedly, so still racey.
Finally tried calling CloseIO after task.Start, which seems to work consistently.
Anyway, it appears you are right, we can call CloseIO any time after the container is running. Not sure why the ctr code is written that way.
6d02561 to
ce70e1f
Compare
| id := identity.NewID() | ||
|
|
||
| // verify pid1 exits when stdin sees EOF | ||
| stdin := bytes.NewReader([]byte("hello")) |
There was a problem hiding this comment.
could you add a case where instead of using bytes.NewReader this is a pipe and only after started channel has returned you write into stdin and then close it.
There was a problem hiding this comment.
Updated to use io.Pipe
… stdin Signed-off-by: Cory Bennett <cbennett@netflix.com>
| err := p.Resize(ctx, size.Cols, size.Rows) | ||
| if err != nil { | ||
| cancel() | ||
| return err | ||
| } |
There was a problem hiding this comment.
I think there might be a race where a client-size window resize happens right around the same time the task exits, in which case the resize chan could get triggered first and then p.Resize returns an expected error because the task process has exited. That would result in the resize error getting returned instead of the actual task status (which could have been an expected exit with 0 status).
I think it also wouldn't hurt to have a timeout on the Resize call just in case it hangs for unexpected reasons.
There was a problem hiding this comment.
Yeah, good catch. I wonder if we should just ignore resize errors in general, maybe just log them? If Resize is not working the user will likely be having other issues anyway. I will add a context.WithTimeout, should probably also throw it in a goroutine to prevent it blocking the cancel/exit handling.
There was a problem hiding this comment.
I updated the resize to run later in select loop and to run in short lived goroutine with 1s timeout.
| killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second) | ||
| p.Kill(killCtx, syscall.SIGKILL) |
There was a problem hiding this comment.
If this timeout on Kill got hit, it seems like the for loop would just continue on. So if the SIGKILL was never actually sent due to the timeout, I think that could result in being blocked in this loop indefinitely.
There was a problem hiding this comment.
This is old code I just moved into a common function, so I am not sure but that does seem plausible if the p.Kill never completes. To resolve I think we could define a killCtxDone outside the loop and add a case <-killCtxDone to return some "failed to kill" error, something like:
var cancel func()
var killCtxDone <-chan struct{}
ctxDone := ctx.Done()
for {
select {
case <-killCtxDone:
return fmt.Errorf("failed to kill process on cancel")
case <-ctxDone:
ctxDone = nil
var killCtx context.Context
killCtx, cancel = context.WithTimeout(context.Background(), 10*time.Second)
killCtxDone = killCtx.Done()
p.Kill(killCtx, syscall.SIGKILL)
There was a problem hiding this comment.
updated the code mostly as above, but moved the killCtxDone after the statusCh receive.
prevent resize from blocking exit fix edgecase where kill signal never reaches process Signed-off-by: Cory Bennett <cbennett@netflix.com>
| if cancel != nil { | ||
| cancel() | ||
| } | ||
| return fmt.Errorf("failed to kill process on cancel") |
| } | ||
| return fmt.Errorf("failed to kill process on cancel") | ||
| case size := <-resize: | ||
| ctxTimeout, cancelTimeout := context.WithTimeout(ctx, time.Second) |
There was a problem hiding this comment.
Why 1 sec? Seems possible to hit with just high load. Should we protect against p.Resize calls possibly getting out of order? Also, would be nice to call cancelTimeout when in a runProcess defer.
There was a problem hiding this comment.
1s was arbitrary, is 10s better? I was just thinking about moving the resize processing into a separate for-loop in a goroutine. That would take care of the event sequence processing and prevent it from blocking the cancel/exit loop, will also add a defer on the cancelTimeout for runProces.
There was a problem hiding this comment.
Now that the error isn't fatal I don't think we need a special timeout at all. Just force order and wrap context so goroutines get canceled on defer.
| } | ||
|
|
||
| func (w *containerdExecutor) runProcess(ctx context.Context, p containerd.Process, resize <-chan executor.WinSize, started func()) error { | ||
| statusCh, err := p.Wait(context.Background()) |
There was a problem hiding this comment.
why context.Background? Maybe leave a comment for future readers as well.
There was a problem hiding this comment.
Good question, I was wondering the same thing, this code was just moved from Run to share with Exec. We need to ask your 3yr younger self :)
https://github.com/moby/buildkit/blame/master/executor/containerdexecutor/executor.go#L213
My hunch is that we wanted to keep the statusCh alive if we get ctx.Done so that we can send sigkill and capture status from that?
There was a problem hiding this comment.
Looking at Wait implementation I now understand it. Wait is non-blocking irrelevant from the context passed and context only affects the statusCh that we shouldn't cancel.
There was a problem hiding this comment.
I added a comment with your findings.
…cancel loop to prevent blocking. Signed-off-by: Cory Bennett <cbennett@netflix.com>
|
Doh, fixed spelling error. |
|
@sipsma lgty? |
More incremental work for #749
This returns the process exit status from Run/Exec so we can send it in the ExitMessage. This also plumbs in a Resize channel to handle resize events.
I will have another PR soon with the proto changes and server side changes for that, I just wanted to get these executor changes in to keep the PRs managable. I have some questions about client side so I will add those questions/proposals to #749
Through testing I found
runcto be challenging since it will not create a pty likecontainerddoes. So for now I modified runc to just return "not implemnted" error when tty is requested. I am trying to understand howcontainerdmanages the pty with the console-socket option onruncso I hope to find a solution soon, but will likely be another follow up PR.