add tty support for runc executor#1731
Conversation
|
I'm looking into the runc worker test failures, the integration tests were passing, not sure what is wrong with these tests yet. |
| // testClientGatewayContainerPID1Tty is testing that we can get a tty via | ||
| // a container pid1, executor.Run | ||
| func testClientGatewayContainerPID1Tty(t *testing.T, sb integration.Sandbox) { | ||
| requiresLinux(t) |
There was a problem hiding this comment.
Do we need both requiresLinux and +build linux?
There was a problem hiding this comment.
we need the +build linux otherwise the test wont compile on windows due to container/console dep. So, yeah, I think I can remove the requiresLinux
| // TODO fix this | ||
| // We get `panic: cannot statfs cgroup root` when running this test | ||
| // with runc-rootless | ||
| t.Skip("Skipping runc-rootless for cgroup error") |
There was a problem hiding this comment.
Is this waiting for: opencontainers/runc#2634?
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() |
There was a problem hiding this comment.
Do we need to signal.Stop(signals) and close(signals) here?
There was a problem hiding this comment.
yeah, will add. If the runc worker is enabled this loop will effectively run forever though.
| if err != nil { | ||
| return containerd.UnknownExitStatus, errors.Wrap(err, "failed to create pidfile") | ||
| } | ||
| pidfile.Close() |
There was a problem hiding this comment.
It should be unnecessary for pid1 since we os.RemoveAll for the entire bundle directory when (*runcExecutor).Run ends.
| } | ||
|
|
||
| status := waitpid(ctx, pid, exitCh) | ||
| defer endedOnce.Do(func() { |
There was a problem hiding this comment.
Curious why do we need a sync.Once here but not in the non-linux one?
There was a problem hiding this comment.
I was originally thinking the timing of closing ended was important and didn't want it delayed by the deferred cleanup for console shutdown. Looking again, I dont think the timing is important, so will just defer close(ended) in the linux impl as well.
| return status, nil | ||
| } | ||
|
|
||
| func (w *runcExecutor) exec(ctx context.Context, id, bundle string, pid0 int, specsProcess *specs.Process, process executor.ProcessInfo) (int, error) { |
There was a problem hiding this comment.
A lot of this seems duplicated with (*runcExecutor).run.
There was a problem hiding this comment.
Yeah, I will take a look to try to dedup.
| defer wg.Wait() | ||
|
|
||
| status := waitpid(ctx, pid, exitCh) | ||
| cancel() // stop the parent waitpid if still running |
There was a problem hiding this comment.
Why do we need this cancel when we already have a defer cancel() on line 124? AFAIK, this call isn't blocking.
There was a problem hiding this comment.
We need to cancel so the parent waitpid will exit:
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
// watch for parent process to die, if that happens then cancel the
// exec context
exitCh := reaper.Default.Subscribe()
waitpid(ctx, pid0, exitCh)
cancel()
}()
defer wg.Wait()
The defer wg.Wait() will hang until cancel() is called, or the parent exits. In the case that the parent continues to run we have to manually call cancel() so the defer wg.Wait() will exit. The previous defer cancel() is just to clean up resources leak in the case of error.
| socket, err := runc.NewTempConsoleSocket() | ||
| if err != nil { | ||
| return nil, nil, errors.Wrap(err, "failed to create OCI runtime console socket") | ||
| } |
There was a problem hiding this comment.
I think the code may be simpler and without the initErrCh if we just create the console socket and then handle everything below in one of the (*errgroup.Group).Go on line 103.
Then, the goroutines on line 235 and 253 can just be synchronous.
One more thing, the cleaners aren't returned on errors so: e.g. an error from cio.NewFIFOSetInDir means socket.Close won't get run.
There was a problem hiding this comment.
I think I see what you are saying, I will try to simplify. Currently the cleaners are not returned on errors, they are run directly via a hacky defer, this will likely be simplified also with the refactoring you suggest.
| "github.com/containerd/console" | ||
| "github.com/containerd/containerd" | ||
| "github.com/containerd/containerd/cio" | ||
| runcV2 "github.com/containerd/containerd/runtime/v2/runc" |
There was a problem hiding this comment.
Looking to keep these dependencies somewhat under control. iiuc this is what adds most of them while we only use single stdio.Platform helper implementation from there. I wouldn't mind copy/paste for this and eventually refactoring this code so it can be cleanly imported.
There was a problem hiding this comment.
There are a bunch of deps coming from the reaper logic as well. I will look into extracting the logic we need for the console and the reaper to see if there is a way to trim what is being pulled in.
|
I spent a while yesterday debugging the I dont have any workarounds currently. We have to run runc with the It seems like we need add our own shim to make this work properly. I will keep looking for alternate solutions. I just noticed there is a |
ad23f18 to
0297aa8
Compare
|
Okay, reimplemented everything, I think it is much cleaner. I was able to determine what was missing when creating a pty and running runc in the foreground ... the SIGWINCH needs to be sent to the process after resizing. I had tried sending SIGWINCH the first time I tried this but I used |
| // test resize, note we cannot verify the console size in the container | ||
| // because we need a controlling terminal to report size back, and the | ||
| // simple ptm we are using is not enough. Ideally we could just use | ||
| // `/bin/ttysize` but that reports `0 0` with this test. |
There was a problem hiding this comment.
what's the reason for it reporting 0 0? Afaics we don't need NewPty at all for this test, could just use io.Pipe.
There was a problem hiding this comment.
Not sure why ttysize does not work, I thought an actual terminal was necessary for this, but I might be wrong. I will try it again after the updates, also will try with io.Pipe()
There was a problem hiding this comment.
updated the tests, removed the NewPty usage. ttysize works now, the problem was I was sending the SIGWINCH to the pid from the pidfile but that was Meta.Args program (ie sh) and the resize only takes effect when we send SIGWINCH to the runc run process.
It was a bit fussy getting the test reliable, there were spurious control characters printed at random times, not sure where they are coming from but I only noticed them after ttysize was added, not sure if is related. To make the test more reliable we should probably use github.com/hinshun/vt10x or github.com/Netflix/go-expect, but those will bring in a few deps. I figure we can try it as is, the short sleeps should keeps things consistent, but if it becomes unreliable we can disable or rework.
There was a problem hiding this comment.
One way to clean this up would be to run stty -echo as the first command. Then you don't need to worry about the timings between prompt echo and response strings. This leaves the 500ms timeouts that waits for the resize. I guess these are fine but you could also write a helper that tests the buffer contents for expected contents every 50-100ms and has a longer 5-10sec timeout that produces failure if no content appears.
hinshun
left a comment
There was a problem hiding this comment.
Looks a lot cleaner now. I only have a small nit.
| if err != nil { | ||
| logrus.Warningf("error while shutting down tty io: %s", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
Prefer defer closer to line 78 in case code changes in the future and introduces early exit code paths.
a766964 to
1a9fdf3
Compare
| }) | ||
| require.NoError(t, err) | ||
|
|
||
| err = pid2.Resize(ctx, client.WinSize{ |
There was a problem hiding this comment.
most of this test looks very similar. can we put the shared part in a helper function or use subtest.
There was a problem hiding this comment.
updated to add common routine for the terminal input and output
|
|
||
| if process.Stdout != nil { | ||
| eg.Go(func() error { | ||
| io.Copy(process.Stdout, ptm) |
There was a problem hiding this comment.
no error handling in here? don't they exit cleanly? add comment with the error if that is the case.
There was a problem hiding this comment.
They didnt error cleanly, but I have updated to ignore the expected errors.
|
|
||
| func (w *runcExecutor) run(ctx context.Context, id, bundle string, process executor.ProcessInfo) (int, error) { | ||
| return w.callWithIO(ctx, id, bundle, process, func(ctx context.Context, pidfile string, io runc.IO) (int, error) { | ||
| return w.runc.Run(ctx, id, bundle, &runc.CreateOpts{ |
There was a problem hiding this comment.
Instead of the proc/ppid hack could we make a upstream change so we can get the pid directly from this/exec function. A Started chan int field in runc.CreateOpts should suffice. https://github.com/containerd/go-runc/blob/master/runc.go#L267 would call
if opts.Started != nil {
opts.Started <- cmd.Process.Pid
}
We can still merge the current and fix later when upstream has updated.
There was a problem hiding this comment.
Yeah, we will need to fix upstream. There is another issue upstream which prevents us from discovering the exit status of runc.Exec when there is stdio associated. These lines discard the status code:
https://github.com/containerd/go-runc/blob/master/runc.go#L241-L245
I will try to get these changes in, but I suspect it will be a slow process.
There was a problem hiding this comment.
I have created a PR: containerd/go-runc#69 I will remove our local hacks when hopefully the RP is merged eventually.
Signed-off-by: Cory Bennett <cbennett@netflix.com>
| Stdin: inputR, | ||
| Stdout: &nopCloser{output}, | ||
| Stderr: &nopCloser{output}, | ||
| Env: []string{"PS1=% "}, |
There was a problem hiding this comment.
Just curious, what is this for?
There was a problem hiding this comment.
Just sets the prompt to constant % to prevent the output from changing if the the defaults change in the future. The default prompt currently echos the directory first in sh on busybox, and echos ansi color codes on alpine, so I was just trying to make it consistent for this test. It was more useful originally when I was trying to parse the output per prompt, but I gave up on that without a vt10x terminal to process the terminal escape sequences, so it is less useful now.
There was a problem hiding this comment.
did you try getting rid of the prompt per #1731 (comment) to remove the sleeps per line?
There was a problem hiding this comment.
I am working on adding a simple output scanner looking for the prompt instead of the sleeps, it seems to work fine. The expected output is a lot more confusing with the stty -echo so I think it works fine to go without that. We will still have the sleeps to allow the resize signals to go through, 500ms is pessimistic I think, but these tests systems can be slow.
I am now trying to work though an issue with the test hangs (ctrProc.Wait hangs) when the expected prompt does not show up in time (ie context cancelled). It appears to be waiting for the container process Done signal even after ctr.Release() is called. I send another update when I get to the bottom of this.
There was a problem hiding this comment.
The expected output is a lot more confusing with the stty -echo so I think it works fine to go without that.
What's the new confusing part in the output? I'd expect it to be same, but without repeating all the commands you send. So you can send the commands together and don't need to time between waiting results from the previous command and sending the next one. Still need to wait for Resize() calls to trigger but size check can happen multiple times until it eventually gets the correct size or times out.
There was a problem hiding this comment.
If we leave % the prompt in, with stty -echo we will see something like:
% stty -echo
% 80 40
% % /tmp
% % foobar
% 100 60
vs what we see with the echo on:
% ttysize
80 40
% cd /tmp
% pwd
/tmp
% echo foobar > newfile
% cat /tmp/newfile
foobar
% ttysize
100 60
% exit 99
We need to leave the prompt in so we can scan for it. There is only one prompt we really need to wait for, that is the one after the first ttysize command, otherwise there is a race between the ttysize and the second ctrProc.Resize(ctx, client.WinSize{Rows: 60, Cols: 100}) so that we know the results of the ttysize are from the first resize event.
There was a problem hiding this comment.
I thought one way to do this is to just:
for {
input.Write([]byte("ttysize\n"))
time.Sleep(100 * time.Millisecond)
if strings.Contains(output.String(), expectedSize) {
break
} if timeout {
error
}
}
and not worry about how many times ttysize was called if it eventually switches to correct size before timeout. But that only works if echo is disabled.
There was a problem hiding this comment.
Sure something like that should work, will update when I figure out what is causing the hanging on timeout errors.
There was a problem hiding this comment.
I refactored the prompting into a simple helper, it seems cleaner to me. I also found the hanging issue where if the context was cancelled used for container Start then the event loop was not finishing so stdin continued to be read from.
Signed-off-by: Cory Bennett <cbennett@netflix.com>
Signed-off-by: Cory Bennett <cbennett@netflix.com>
| err = pid1.Resize(ctx, client.WinSize{Rows: 60, Cols: 100}) | ||
| require.NoError(t, err) | ||
| prompt.SendExpect("ttysize", "100 60") | ||
| prompt.SendExit(99) |
| if errors.As(err, &ptmClosedError) { | ||
| if ptmClosedError.Op == "read" && | ||
| ptmClosedError.Path == "/dev/ptmx" && | ||
| ptmClosedError.Err == syscall.Errno(0x5) { |
Signed-off-by: Cory Bennett <cbennett@netflix.com>
This reuses some of the containerd logic for handling console io. The basic outline is that now all runc calls are made with
detach: trueand an associated pidfile. We have aSIGCHLDreaper so we can get notified when the detached pid exits (and determine exit status). When a tty is requested we use theconsole-socketoptions in runc so that runc creates the pty pair and returns the ptm via the socket.I tried naively creating a pty pair in buildkit and setting the input streams to the pts, which mostly worked, except the ptm resize had no effect this way, it seems using the
console-socketis the only way to allow resize events to propagate.Unfortunately this is a linux only implementation since the containerd libraries in use are linux only:
github.com/containerd/containerd/sys/reaper
github.com/containerd/containerd/runtime/v2/runc
The non linux implementation will work as before, no tty support for runc.
Also I added a basic tty test with generated pty (again linux only via github.com/containerd/console), which works to test out most of the tty logic, but that is obviously not a fully implemented vt10x terminal so we cannot check the success of resize events.