libcontainer: map pre-exec termination signals to 128+signal#5189
Conversation
|
If SIGTERM is ignored, it seems like it's dropped forever then? Which would mean (in the container termination case) the init/exec process would start and a higher-level runtime would eventually need to send another SIGTERM or SIGKILL? Also looking at https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993 the exit(2) there comes after a bunch of other attempts for the signal to be handled properly. |
|
agree ignoring doesn't seem quite correct ... https://man7.org/linux/man-pages/man7/signal.7.html
Rather than ignoring (which appears to get replicated to the execve process?), should we be registering handlers for the behavior we want runc to have prior to execve completing successfully, which will apparently get reset by execve? |
|
@samuelkarp @liggitt totally! meant to open as draft and clicked the wrong button. let me see how i can morph this to get what we want. |
48eb982 to
75ba546
Compare
|
Indeed this is what happens; the strace output matches the code in https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993 [kir@kir-tp1 cli]$ ps ax | grep runc
2850610 pts/14 Sl+ 0:00 ./runc --debug run 1234
2850624 ? Ssl 0:00 ./runc init
2850631 pts/1 S+ 0:00 grep --color=auto runc
[kir@kir-tp1 cli]$ strace -p 2850624
strace: Process 2850624 attached
futex(0x5607913d3ed0, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid() = 1
gettid() = 1
tgkill(1, 1, SIGTERM) = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid() = 1
gettid() = 1
tgkill(1, 1, SIGTERM) = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
getpid() = 1
gettid() = 1
tgkill(1, 1, SIGTERM) = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
exit_group(2) = ?Perhaps an easier solution would be to ignore SIGTERM? I.e. add signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)to roughly the same place the patch here adds |
but... we want the signal to result in termination (with an exit code of exitSignalOffset+signal), not be ignored, right? |
Usually SIGTERM is followed by SIGKILL, and this is when we get what we want (exit status 137). Basically the same outcome, only a tad later. |
My bad, it probably won't work as the container init will inherit that, which is definitely not what we want here. |
|
Right, Go expects the process to die from SIGTERM and similar signals, because the signal handler is not set explicitly, and by default it should die from SIGTERM. Yet it does not because With that, the proposed patch is probably what we have to do here. While it doesn't give us true "killed by signal 15" semantics (from the child reaper side), it's the second best. Currently, my only concerns with the proposed implementation are:
It seems that the list of signals is correct (corresponding to those that have One more minor thing -- I don't understand why is this being tested in |
|
@kolyshkin please take a look now, i think i incorporated everything you mentioned. 🤞🏾 |
|
|
||
| // Translate termination signals to conventional shell-style exit codes | ||
| // while PID 1 is still the Go-based runc init helper. | ||
| setupPreExecSignalExit() |
There was a problem hiding this comment.
If runc gets a SIGTERM/SIGINT/SIGHUP prior to this point, does it exit with the code we want?
If so, what changes between that point earlier in runc's execution and here?
If not, should this move closer to the top of the Init() call?
There was a problem hiding this comment.
excellent question, will defer to @kolyshkin for placing this call.
There was a problem hiding this comment.
Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in Init(). We still have
I think we also need to explain in the commit message and/or code that this special treatment is needed because of discrepancy between Go runtime expectations (that the kernel will finish the process) and the actual kernel behavior for PID 1. It is a special case that can be seen as a bug in Go runtime, here:
(and adding something like && getpid() != 1 to the condition should probably fix it, although I am not an expert in Go runtime to be sure).
@dims can you squash your commits please, and use 80-col width for the commit message?
There was a problem hiding this comment.
@kolyshkin thanks for the quick reviews and feedback. I think i got it all squared away now hopefully 🤞🏾
There was a problem hiding this comment.
Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in
Init().
Were we going to relocate the setupPreExecSignalExit() call earlier because of this? Maybe up into libcontainer.Init()?
There was a problem hiding this comment.
It won't work this way if the signal will be delivered during waiting on FIFO (which is where runc init spends most of the time I guess), as go runtime will process it immediately.
I guess we can make it work if we add pathrs.Reopen(l.fifoFile, ...) to under the same select which checks the interrupt channel. Which requires a goroutine, so we're back to square 1.
The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).
There was a problem hiding this comment.
The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).
yeah
There was a problem hiding this comment.
I believe this issue should be addressed in the Go runtime itself.
If we really need to address this issue early in runc, I have a proposal based on the following observation: once a cgroup PID limit is enforced, the kernel only checks the limit when new threads (or processes) are created. If no new threads are spawned after the limit is applied -- even if the current thread count already exceeds the limit -- the kernel won’t kill the process.
Given that, if we ensure the signal-handling goroutine is fully initialized before syncParentHooks notifies the parent (and thus before the PID limit is set), we eliminate the race window entirely. As long as no additional OS threads are created after the limit takes effect, PID 1 should remain safe.
I’ve stress-tested this approach extensively, and it appears to be stable. That said, I’d appreciate further review from the team to validate the logic and check for edge cases I might have missed.
Please see #5197
There was a problem hiding this comment.
@kolyshkin runtime.GOMAXPROCS(1) is only there to avoid us hitting very low pid cgroup limits, it doesn't actually guarantee we are single-threaded (Go doesn't provide a mechanism to do this and probably can never provide it because of their runtime model).
I think runtime.LockOSThread is more out of an abundance of caution? I would need to take a closer look at it again...
83a4629 to
d73cae2
Compare
|
the single CI failure seems like a infra/flake unrelated to the commit. |
|
|
||
| // Translate termination signals to conventional shell-style exit codes | ||
| // while PID 1 is still the Go-based runc init helper. | ||
| setupPreExecSignalExit() |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
bf66268 to
4cd15ce
Compare
There is a narrow pre-exec window spanning the exec.fifo handshake and the final execve in which the Go-based runc init helper is still the container's PID 1. If SIGTERM, SIGINT, or SIGHUP arrives in that window, Linux does not apply the default terminating action because PID 1 is special. The Go runtime signal path assumes the kernel will finish that work for terminating signals and calls dieFromSignal on that basis; see: https://github.com/golang/go/blob/c60392da/src/runtime/signal_unix.go#L993 For runc's PID 1 helper, that mismatch leaks Go's internal exit status 2 instead of the usual shell-style 128+signal. Install a narrow pre-exec signal handler for those signals while the helper is PID 1, and translate them to 128+signal until execve replaces the helper with the container payload. Add libcontainer integration coverage for the regression. The test uses a StartContainer hook to hold the process in the post-fifo, pre-exec window, signals init through the libcontainer API, and verifies the resulting exit status for SIGTERM, SIGINT, and SIGHUP. Signed-off-by: Davanum Srinivas <davanum@gmail.com>
4cd15ce to
6e49ffb
Compare
|
closing in favor of better approach. #5189 (comment) |
|
FYI I opened https://go-review.googlesource.com/c/go/+/759040 (mostly as an RFC for now) |
|
Maybe I'm missing something, but why can't we just unmask these signals? PID 1 does get a special signal mask by default but IIRC you can just update the signal mask and things should work okay? Signal masks are preserved across |
However, there is still a race condition during the reset call. |
@kolyshkin I looked into your patch for the Go runtime. From my understanding, your implementation also requires Additionally, your patch breaks containers running Go-based applications. For example: package main
import (
"fmt"
"time"
)
func main() {
done := make(chan bool)
go func() {
for i := 1; ; i++ {
fmt.Printf("Some background tasks... %d\n", i)
}
}()
<-done
}With this program, you can no longer terminate the container using |
I think this is not about the signal mask or signal delivery, but about the kernel behavior very specific to PID 1 (which is to NOT terminate the process which received e.g. SIGHUP but does not have a handler installed for it). This kernel behavior seems to be controlled by having IOW this behavior is hardcoded in the kernel and is not affected by the signal mask. |
A curious bug was reported to kubernetes[1] and runc[2] recently: sometimes runc init reports exit status of 2. Turns out, Go runtime assumes that on any UNIX system signals such as SIGTERM (or any other that has _sigKill flag set in sigtable) with no signal handler set up, will result in kernel terminating the program. This is true, except for PID 1 which gets a custom treatment from the kernel. As a result, when a Go program that runs as PID 1 (which is easy to achieve in Linux by using a new PID namespace) receives such a signal, Go runtime calls dieFromSignal which falls through all the way to exit(2), which is very confusing to a user. This issue can be worked around by the program by adding custom handlers for SIGTERM/SIGINT/SIGHUP, but that requires a goroutine to handle those signals, which, in case of runc, unnecessarily raises its NPROC/pid.max requirement (see discussion at [2]). Since practically exit(2) in dieFromSignal can only happen when the process is running as PID 1, replace it with exit(128+sig) to mimic the shell convention when a child is terminated by a signal. Add a test case which demonstrates the issue and validates the fix (albeit only on Linux). [An earlier version of this patch used to do nothing in dieFromSignal for PID 1 case, but such behavior might be a breaking change for a Go program running in a Linux container as PID 1.] Fixes #78442 [1]: kubernetes/kubernetes#135713 [2]: opencontainers/runc#5189 Change-Id: I196e09e4b5ce84ce2c747a0c2d1fc6e9cf3a6131 Reviewed-on: https://go-review.googlesource.com/c/go/+/759040 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
There is a narrow pre-exec window spanning the exec.fifo handshake and the final execve in which PID 1 is still the Go-based runc init helper rather than the container payload.
If
SIGTERM,SIGINT, orSIGHUParrives during that interval, Linux does not apply the default terminating action because PID 1 is special. The Go runtime signal path assumes the kernel will finish that work for terminating signals and callsdieFromSignalon that basis. Forrunc's PID 1 helper, that mismatch leaks Go's internal exit status of2instead of the usual shell-style128+signal.https://github.com/golang/go/blob/c60392da8b6f18b2aa92db5d22c4963ec25ae0ad/src/runtime/signal_unix.go#L749-L750
This change installs a narrow pre-exec signal handler for those signals while the helper is PID 1. If one of them arrives during that pre-exec window, the helper exits with the conventional shell-style status 128+signal instead of leaking exit status 2.
It also adds
libcontainerintegration coverage to reproduce the pre-exec race and verify the resulting exit statuses forSIGTERM,SIGINT, andSIGHUP.