libcontainer: map pre-exec termination signals to 128+signal by dims · Pull Request #5189 · opencontainers/runc

dims · 2026-03-20T16:54:47Z

There is a narrow pre-exec window spanning the exec.fifo handshake and the final execve in which PID 1 is still the Go-based runc init helper rather than the container payload.

If SIGTERM, SIGINT, or SIGHUP arrives during that interval, Linux does not apply the default terminating action because PID 1 is special. The Go runtime signal path assumes the kernel will finish that work for terminating signals and calls dieFromSignal on that basis. For runc's PID 1 helper, that mismatch leaks Go's internal exit status of 2 instead of the usual shell-style 128+signal.

https://github.com/golang/go/blob/c60392da8b6f18b2aa92db5d22c4963ec25ae0ad/src/runtime/signal_unix.go#L749-L750

This change installs a narrow pre-exec signal handler for those signals while the helper is PID 1. If one of them arrives during that pre-exec window, the helper exits with the conventional shell-style status 128+signal instead of leaking exit status 2.

It also adds libcontainer integration coverage to reproduce the pre-exec race and verify the resulting exit statuses for SIGTERM, SIGINT, and SIGHUP.

dims · 2026-03-20T16:56:23Z

xref: kubernetes/kubernetes#135713

samuelkarp · 2026-03-20T17:03:41Z

If SIGTERM is ignored, it seems like it's dropped forever then? Which would mean (in the container termination case) the init/exec process would start and a higher-level runtime would eventually need to send another SIGTERM or SIGKILL?

Also looking at https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993 the exit(2) there comes after a bunch of other attempts for the signal to be handled properly.

liggitt · 2026-03-20T18:02:33Z

agree ignoring doesn't seem quite correct ... https://man7.org/linux/man-pages/man7/signal.7.html

During an execve(2), the dispositions of handled
signals are reset to the default; the dispositions of ignored
signals are left unchanged.

Rather than ignoring (which appears to get replicated to the execve process?), should we be registering handlers for the behavior we want runc to have prior to execve completing successfully, which will apparently get reset by execve?

dims · 2026-03-20T18:52:01Z

@samuelkarp @liggitt totally! meant to open as draft and clicked the wrong button. let me see how i can morph this to get what we want.

kolyshkin · 2026-03-21T02:31:31Z

Indeed this is what happens; the strace output matches the code in https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993

[kir@kir-tp1 cli]$ ps ax | grep runc
2850610 pts/14   Sl+    0:00 ./runc --debug run 1234
2850624 ?        Ssl    0:00 ./runc init
2850631 pts/1    S+     0:00 grep --color=auto runc
[kir@kir-tp1 cli]$ strace -p 2850624
strace: Process 2850624 attached
futex(0x5607913d3ed0, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
exit_group(2)                           = ?

Perhaps an easier solution would be to ignore SIGTERM? I.e. add

signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

to roughly the same place the patch here adds defer setupPreExecSignalExit()()?

liggitt · 2026-03-21T04:11:41Z

Perhaps an easier solution would be to ignore SIGTERM? I.e. add
signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

but... we want the signal to result in termination (with an exit code of exitSignalOffset+signal), not be ignored, right?

kolyshkin · 2026-03-21T05:05:05Z

Perhaps an easier solution would be to ignore SIGTERM? I.e. add
signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)
but... we want the signal to result in termination (with an exit code of exitSignalOffset+signal), not be ignored, right?

Usually SIGTERM is followed by SIGKILL, and this is when we get what we want (exit status 137). Basically the same outcome, only a tad later.

kolyshkin · 2026-03-21T05:13:57Z

Perhaps an easier solution would be to ignore SIGTERM?

My bad, it probably won't work as the container init will inherit that, which is definitely not what we want here.

kolyshkin · 2026-03-21T06:17:00Z

Right, Go expects the process to die from SIGTERM and similar signals, because the signal handler is not set explicitly, and by default it should die from SIGTERM. Yet it does not because runc init is pid 1 which enjoys a special treatment from the kernel and so we end up with what we see.

With that, the proposed patch is probably what we have to do here. While it doesn't give us true "killed by signal 15" semantics (from the child reaper side), it's the second best.

Currently, my only concerns with the proposed implementation are:

this is obv not needed for exec (aka (*linuxSetnsInit).Init) since runc init won't be pid 1;
this is not needed for create either when PID namespace is not specified (i.e. add if unix.Getpid() != 1 {return} to setupPreExecSignalExit;
perhaps the cleanup/defer is also not needed? We either end up execve()'ing the process, or exiting with an error, in neither case we need to undo the custom sighandler.

It seems that the list of signals is correct (corresponding to those that have _SigKill in the list in https://github.com/golang/go/blob/master/src/runtime/sigtab_linux_generic.go

One more minor thing -- I don't understand why is this being tested in pidfd-socket.bats. This issue seems unrelated to pidfd. Or is it just more convenient to use pidfd-socket in the test? I don't see it either.

dims · 2026-03-21T11:08:39Z

@kolyshkin please take a look now, i think i incorporated everything you mentioned. 🤞🏾

liggitt · 2026-03-21T14:19:01Z


+	// Translate termination signals to conventional shell-style exit codes
+	// while PID 1 is still the Go-based runc init helper.
+	setupPreExecSignalExit()


If runc gets a SIGTERM/SIGINT/SIGHUP prior to this point, does it exit with the code we want?

If so, what changes between that point earlier in runc's execution and here?

If not, should this move closer to the top of the Init() call?

excellent question, will defer to @kolyshkin for placing this call.

Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in Init(). We still have

I think we also need to explain in the commit message and/or code that this special treatment is needed because of discrepancy between Go runtime expectations (that the kernel will finish the process) and the actual kernel behavior for PID 1. It is a special case that can be seen as a bug in Go runtime, here:

https://github.com/golang/go/blob/c60392da8b6f18b2aa92db5d22c4963ec25ae0ad/src/runtime/signal_unix.go#L749-L750

(and adding something like && getpid() != 1 to the condition should probably fix it, although I am not an expert in Go runtime to be sure).

@dims can you squash your commits please, and use 80-col width for the commit message?

@kolyshkin thanks for the quick reviews and feedback. I think i got it all squared away now hopefully 🤞🏾

Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in Init().

Were we going to relocate the setupPreExecSignalExit() call earlier because of this? Maybe up into libcontainer.Init()?

It won't work this way if the signal will be delivered during waiting on FIFO (which is where runc init spends most of the time I guess), as go runtime will process it immediately.

I guess we can make it work if we add pathrs.Reopen(l.fifoFile, ...) to under the same select which checks the interrupt channel. Which requires a goroutine, so we're back to square 1.

The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).

The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).

yeah

I believe this issue should be addressed in the Go runtime itself.

If we really need to address this issue early in runc, I have a proposal based on the following observation: once a cgroup PID limit is enforced, the kernel only checks the limit when new threads (or processes) are created. If no new threads are spawned after the limit is applied -- even if the current thread count already exceeds the limit -- the kernel won’t kill the process.

Given that, if we ensure the signal-handling goroutine is fully initialized before syncParentHooks notifies the parent (and thus before the PID limit is set), we eliminate the race window entirely. As long as no additional OS threads are created after the limit takes effect, PID 1 should remain safe.

I’ve stress-tested this approach extensively, and it appears to be stable. That said, I’d appreciate further review from the team to validate the logic and check for edge cases I might have missed.

Please see #5197

@lifubang Awesome! happy to close this and defer to your #5197 PR. thank you for digging in.

@kolyshkin runtime.GOMAXPROCS(1) is only there to avoid us hitting very low pid cgroup limits, it doesn't actually guarantee we are single-threaded (Go doesn't provide a mechanism to do this and probably can never provide it because of their runtime model).

I think runtime.LockOSThread is more out of an abundance of caution? I would need to take a closer look at it again...

dims · 2026-03-22T05:12:12Z

the single CI failure seems like a infra/flake unrelated to the commit.

+ wget https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz.asc
--2026-03-21 20:22:26--  https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2026-03-21 20:22:26 ERROR 502: Bad Gateway.

--2026-03-21 20:22:26--  https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz.asc
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2026-03-21 20:22:26 ERROR 502: Bad Gateway.

Error: Process completed with exit code 8.

lifubang

LGTM with a few nits.

Sign in to view


+	// Translate termination signals to conventional shell-style exit codes
+	// while PID 1 is still the Go-based runc init helper.
+	setupPreExecSignalExit()


There is a narrow pre-exec window spanning the exec.fifo handshake and the final execve in which the Go-based runc init helper is still the container's PID 1. If SIGTERM, SIGINT, or SIGHUP arrives in that window, Linux does not apply the default terminating action because PID 1 is special. The Go runtime signal path assumes the kernel will finish that work for terminating signals and calls dieFromSignal on that basis; see: https://github.com/golang/go/blob/c60392da/src/runtime/signal_unix.go#L993 For runc's PID 1 helper, that mismatch leaks Go's internal exit status 2 instead of the usual shell-style 128+signal. Install a narrow pre-exec signal handler for those signals while the helper is PID 1, and translate them to 128+signal until execve replaces the helper with the container payload. Add libcontainer integration coverage for the regression. The test uses a StartContainer hook to hold the process in the post-fifo, pre-exec window, signals init through the libcontainer API, and verifies the resulting exit status for SIGTERM, SIGINT, and SIGHUP. Signed-off-by: Davanum Srinivas <davanum@gmail.com>

Needs more discussion.

dims · 2026-03-25T05:29:05Z

closing in favor of better approach. #5189 (comment)

kolyshkin · 2026-03-25T05:54:02Z

FYI I opened https://go-review.googlesource.com/c/go/+/759040 (mostly as an RFC for now)

cyphar · 2026-03-25T07:03:41Z

Maybe I'm missing something, but why can't we just unmask these signals? PID 1 does get a special signal mask by default but IIRC you can just update the signal mask and things should work okay? Signal masks are preserved across exec IIRC, so we would need to reset the mask before execing the user proces.

lifubang · 2026-03-25T07:13:05Z

so we would need to reset the mask before execing the user proces.

However, there is still a race condition during the reset call.

lifubang · 2026-03-25T08:28:24Z

FYI I opened https://go-review.googlesource.com/c/go/+/759040 (mostly as an RFC for now)

@kolyshkin I looked into your patch for the Go runtime. From my understanding, your implementation also requires runc to explicitly handle these signals. Otherwise, the behavior of runc kill <container-id> SIGINT becomes inconsistent depending on whether the container’s user process has already been execed.

Additionally, your patch breaks containers running Go-based applications. For example:

package main

import (
	"fmt"
	"time"
)

func main() {
	done := make(chan bool)

	go func() {
		for i := 1; ; i++ {
			fmt.Printf("Some background tasks... %d\n", i)
		}
	}()

	<-done
}

With this program, you can no longer terminate the container using Ctrl+C (i.e., sending SIGINT), because the signal is ignored or blocked by the Go runtime under your changes.

kolyshkin · 2026-03-26T05:37:14Z

Maybe I'm missing something, but why can't we just unmask these signals? PID 1 does get a special signal mask by default but IIRC you can just update the signal mask and things should work okay?

I think this is not about the signal mask or signal delivery, but about the kernel behavior very specific to PID 1 (which is to NOT terminate the process which received e.g. SIGHUP but does not have a handler installed for it). This kernel behavior seems to be controlled by having SIGNAL_UNKILLABLE flag in task_struct's signal->flags. The flag is set for any new process which has pid == 1.

IOW this behavior is hardcoded in the kernel and is not affected by the signal mask.

A curious bug was reported to kubernetes[1] and runc[2] recently: sometimes runc init reports exit status of 2. Turns out, Go runtime assumes that on any UNIX system signals such as SIGTERM (or any other that has _sigKill flag set in sigtable) with no signal handler set up, will result in kernel terminating the program. This is true, except for PID 1 which gets a custom treatment from the kernel. As a result, when a Go program that runs as PID 1 (which is easy to achieve in Linux by using a new PID namespace) receives such a signal, Go runtime calls dieFromSignal which falls through all the way to exit(2), which is very confusing to a user. This issue can be worked around by the program by adding custom handlers for SIGTERM/SIGINT/SIGHUP, but that requires a goroutine to handle those signals, which, in case of runc, unnecessarily raises its NPROC/pid.max requirement (see discussion at [2]). Since practically exit(2) in dieFromSignal can only happen when the process is running as PID 1, replace it with exit(128+sig) to mimic the shell convention when a child is terminated by a signal. Add a test case which demonstrates the issue and validates the fix (albeit only on Linux). [An earlier version of this patch used to do nothing in dieFromSignal for PID 1 case, but such behavior might be a breaking change for a Go program running in a Linux container as PID 1.] Fixes #78442 [1]: kubernetes/kubernetes#135713 [2]: opencontainers/runc#5189 Change-Id: I196e09e4b5ce84ce2c747a0c2d1fc6e9cf3a6131 Reviewed-on: https://go-review.googlesource.com/c/go/+/759040 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>

dims changed the title ~~Dsrinivas/runc exit2 repro~~ libcontainer: ignore termination signals in the pre-exec window Mar 20, 2026

dims mentioned this pull request Mar 20, 2026

[Flaky test] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails - unexpected exit code 2 kubernetes/kubernetes#135713

Closed

dims marked this pull request as draft March 20, 2026 18:36

dims force-pushed the dsrinivas/runc-exit2-repro branch from 48eb982 to 75ba546 Compare March 20, 2026 20:13

dims changed the title ~~libcontainer: ignore termination signals in the pre-exec window~~ libcontainer: map pre-exec termination signals to 128+signal Mar 20, 2026

samuelkarp reviewed Mar 20, 2026

View reviewed changes

Comment thread tests/integration/pidfd-socket.bats Outdated

liggitt reviewed Mar 21, 2026

View reviewed changes

dims force-pushed the dsrinivas/runc-exit2-repro branch 4 times, most recently from 83a4629 to d73cae2 Compare March 21, 2026 20:22

dims marked this pull request as ready for review March 21, 2026 20:27

lifubang previously approved these changes Mar 23, 2026

View reviewed changes

dims force-pushed the dsrinivas/runc-exit2-repro branch 2 times, most recently from bf66268 to 4cd15ce Compare March 23, 2026 12:53

dims force-pushed the dsrinivas/runc-exit2-repro branch from 4cd15ce to 6e49ffb Compare March 23, 2026 15:05

liggitt reviewed Mar 23, 2026

View reviewed changes

Comment thread libcontainer/init_linux.go

lifubang mentioned this pull request Mar 25, 2026

libcontainer: map pre-exec PID 1 signals to 128+signal #5197

Closed

dims closed this Mar 25, 2026

kolyshkin mentioned this pull request Mar 29, 2026

runtime: exit 2 from PID 1 upon SIGTERM/INT/HUP golang/go#78442

Closed

kolyshkin removed backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 backport/1.4-todo A PR in main branch which needs to backported to release-1.4 backport/1.5-todo A PR in main branch which needs to be backported to release-1.5 labels Mar 30, 2026

Conversation

dims commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dims commented Mar 20, 2026

Uh oh!

samuelkarp commented Mar 20, 2026

Uh oh!

liggitt commented Mar 20, 2026

Uh oh!

dims commented Mar 20, 2026

Uh oh!

Uh oh!

kolyshkin commented Mar 21, 2026

Uh oh!

liggitt commented Mar 21, 2026

Uh oh!

kolyshkin commented Mar 21, 2026

Uh oh!

kolyshkin commented Mar 21, 2026

Uh oh!

kolyshkin commented Mar 21, 2026

Uh oh!

dims commented Mar 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dims commented Mar 22, 2026

Uh oh!

lifubang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

dims commented Mar 25, 2026

Uh oh!

kolyshkin commented Mar 25, 2026

Uh oh!

cyphar commented Mar 25, 2026

Uh oh!

lifubang commented Mar 25, 2026

Uh oh!

lifubang commented Mar 25, 2026

Uh oh!

kolyshkin commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dims commented Mar 20, 2026 •

edited

Loading