fix gateway exec tty cleanup on context.Canceled#3658
Conversation
9eec9d1 to
1dca040
Compare
|
looking at the test failures ... |
1dca040 to
8d521a7
Compare
| timeout() | ||
| select { | ||
| case <-time.After(50 * time.Millisecond): | ||
| cancelRun() |
There was a problem hiding this comment.
This allows the tests to pass, I am a little confused why this was necessary. The pid1 sleep 10 was continuing to run, even though the ctx.Done was triggered almost immediately. Then the w.runc.Kill ran which did not return an error, but the w.run call below didn't terminate. So this loop waited for 50ms then ran the kill 9 again and again, until the pid1 ended after 10s. I am not sure how the kill 9 is not actually terminating the runc process. With this change we cancel the ctx passed into w.run after the kill 9 is ignored, and the process ends up exiting as expected.
There was a problem hiding this comment.
This does not look quite right. I'm not sure why sigkill wouldn't work (can we replicate it outside of test?) but even when we handle misbehaving runc that doesn't react(properly) to sigkill, we should give it more than 50ms to shut down normally.
There was a problem hiding this comment.
I have tried to reproduce the steps by calling runc directly and so far it works, not sure what the difference is. I will keep looking, although I have very little time these days. The only thing I can think of is that the sigkill signal is not actually being sent to the runc or sleep process, not sure how/why though.
There was a problem hiding this comment.
if you can see that this work outside of the test. can you just add a counter before calling cancelRun() so that case does not get called on 50ms but after a couple of seconds for example? Also some comments describing why we are doing this.
There was a problem hiding this comment.
I spent some time debugging this, I think the problem is with the test container, not with buildkit. The sleep process gets wedged on zap_pid_ns_processes which seems to be related to parent/child reaping. The cancelRun was just giving up and shutting things down even though the sleep process persisted. I have updated the test container to run with tini as the entrypoint to better handle reaping during the go tests, so far it seems to work well. Hopefully it will make it through the github workflows...
There was a problem hiding this comment.
Looks like the tests all pass when run under tini with out the cancelRun() hack.
There was a problem hiding this comment.
I have another idea why the zombie processes might happening, let me test that out. In theory runc should be doing the waitpid so there should be no zombies for tini to handle...
There was a problem hiding this comment.
I have updated the PR again, the bug was in (*runcExecutor).Exec, and the tini hack put me on the right path. So the runc pid1 (sleep) was getting wedged on zap_pid_ns_process, which according to the docs:
The
zap_pid_ns_processesfunction is used in Linux to terminate all processes within a specific namespace
So the problem was not with the pid1 Kill directly, it was actually a zombie process from the Exec (the sh command) which was preventing zap_pid_ns_processes from finishing. It turns out we were using context.Background() for the runc call via (*runcExecutor).Run but were using the request ctx for the runc call via (*runcExecutor).Exec. So when the ctx.Done happened on the parent runc command for the Exec was getting terminated immediately, before it could call waitpid on the child sh command.
I have moved the wait/kill logic into a common routine now that will be called for both Run and Exec and we use context.Background() for both runc calls now.
f7eb53c to
38e37c5
Compare
|
I got a build failure that seems unrelated, retrying: |
|
Okay, tests passing, it looks like |
|
The flakiness is likely from #3401. |
|
@tonistiigi would be great to get your review on this one again, I think all the issues are known/fixed at this point. |
There was a problem hiding this comment.
Looks good to me, the fix makes sense 🎉 I did spend some time trying to actually find the part of the PR that was relevant to the fix vs the runc issue. Could you split the busy loop fix into a separate commit from the runc fixes (and maybe from the tests as well if there's the potential to cherry-pick just the busy loop fix), to make it a bit easier to trace back in the history?
Am I right in guessing that we could just take the busy loop break for cherry-pick, and leave the tests and runc changes on master? Bit frustrating to not take the tests as well, but from my understanding we'd need the runc changes as well (which seem a bit riskier to cherry-pick to me).
Yeah, that is correct, we have two fixes here. One for the busy loop, one for zombie processes via runc.Exec. I can certainly split up the patch. One patch for the runc.Exec fix, one patch for the busy loop + test.
I suspect we really should cherry-pick both fixes. The zombie processes accumulating from context cancel for gateway exec processes is not great and leads to solves "getting stuck". In practice, I don't think gateway containers with a TTY are used much, so that is likely why we have not seen many problems with this, but both issues are pretty frustrating when trying to use gateway containers. |
38e37c5 to
21a1d11
Compare
|
I have have split out the runc.Exec fix into #3722, we will need that merged first before the tests in this PR will pass. |
This fixes an issue where the tty message handling loop will go into a tight loop and never exit upon context.Canceled. There is select statement in `(*procMessageForwarder).Recv` that returns nil on ctx.Done, but the control loop in `(*container).Start` did not exit on this condition. I think the intent was to flush out any inflight messages on cancel, but this is already done in `(*procMessageForwarder) Close`. Signed-off-by: coryb <cbennett@netflix.com>
21a1d11 to
aa827f5
Compare
This fixes an issue where the tty message handling loop will go into a tight loop and never exit upon context.Canceled. There is select statement in
(*procMessageForwarder).Recvthat returns nil on ctx.Done, but the control loop in(*container).Startdid not exit on this condition. I think the intent was to flush out any inflight messages on cancel, but this is already done in(*procMessageForwarder) Close.