Skip to content

libcontainer: map pre-exec termination signals to 128+signal#5189

Closed
dims wants to merge 1 commit into
opencontainers:mainfrom
dims:dsrinivas/runc-exit2-repro
Closed

libcontainer: map pre-exec termination signals to 128+signal#5189
dims wants to merge 1 commit into
opencontainers:mainfrom
dims:dsrinivas/runc-exit2-repro

Conversation

@dims
Copy link
Copy Markdown
Contributor

@dims dims commented Mar 20, 2026

There is a narrow pre-exec window spanning the exec.fifo handshake and the final execve in which PID 1 is still the Go-based runc init helper rather than the container payload.

If SIGTERM, SIGINT, or SIGHUP arrives during that interval, Linux does not apply the default terminating action because PID 1 is special. The Go runtime signal path assumes the kernel will finish that work for terminating signals and calls dieFromSignal on that basis. For runc's PID 1 helper, that mismatch leaks Go's internal exit status of 2 instead of the usual shell-style 128+signal.

https://github.com/golang/go/blob/c60392da8b6f18b2aa92db5d22c4963ec25ae0ad/src/runtime/signal_unix.go#L749-L750

This change installs a narrow pre-exec signal handler for those signals while the helper is PID 1. If one of them arrives during that pre-exec window, the helper exits with the conventional shell-style status 128+signal instead of leaking exit status 2.

It also adds libcontainer integration coverage to reproduce the pre-exec race and verify the resulting exit statuses for SIGTERM, SIGINT, and SIGHUP.

@dims dims changed the title Dsrinivas/runc exit2 repro libcontainer: ignore termination signals in the pre-exec window Mar 20, 2026
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Mar 20, 2026

xref: kubernetes/kubernetes#135713

@samuelkarp
Copy link
Copy Markdown
Member

If SIGTERM is ignored, it seems like it's dropped forever then? Which would mean (in the container termination case) the init/exec process would start and a higher-level runtime would eventually need to send another SIGTERM or SIGKILL?

Also looking at https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993 the exit(2) there comes after a bunch of other attempts for the signal to be handled properly.

@liggitt
Copy link
Copy Markdown
Contributor

liggitt commented Mar 20, 2026

agree ignoring doesn't seem quite correct ... https://man7.org/linux/man-pages/man7/signal.7.html

During an execve(2), the dispositions of handled
signals are reset to the default; the dispositions of ignored
signals are left unchanged.

Rather than ignoring (which appears to get replicated to the execve process?), should we be registering handlers for the behavior we want runc to have prior to execve completing successfully, which will apparently get reset by execve?

@dims dims marked this pull request as draft March 20, 2026 18:36
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Mar 20, 2026

@samuelkarp @liggitt totally! meant to open as draft and clicked the wrong button. let me see how i can morph this to get what we want.

@dims dims force-pushed the dsrinivas/runc-exit2-repro branch from 48eb982 to 75ba546 Compare March 20, 2026 20:13
@dims dims changed the title libcontainer: ignore termination signals in the pre-exec window libcontainer: map pre-exec termination signals to 128+signal Mar 20, 2026
Comment thread tests/integration/pidfd-socket.bats Outdated
@kolyshkin
Copy link
Copy Markdown
Contributor

Indeed this is what happens; the strace output matches the code in https://github.com/golang/go/blob/master/src/runtime/signal_unix.go#L993

[kir@kir-tp1 cli]$ ps ax | grep runc
2850610 pts/14   Sl+    0:00 ./runc --debug run 1234
2850624 ?        Ssl    0:00 ./runc init
2850631 pts/1    S+     0:00 grep --color=auto runc
[kir@kir-tp1 cli]$ strace -p 2850624
strace: Process 2850624 attached
futex(0x5607913d3ed0, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fde050b8290}, NULL, 8) = 0
getpid()                                = 1
gettid()                                = 1
tgkill(1, 1, SIGTERM)                   = 0
--- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=1, si_uid=0} ---
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
exit_group(2)                           = ?

Perhaps an easier solution would be to ignore SIGTERM? I.e. add

signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

to roughly the same place the patch here adds defer setupPreExecSignalExit()()?

@liggitt
Copy link
Copy Markdown
Contributor

liggitt commented Mar 21, 2026

Perhaps an easier solution would be to ignore SIGTERM? I.e. add

signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

but... we want the signal to result in termination (with an exit code of exitSignalOffset+signal), not be ignored, right?

@kolyshkin
Copy link
Copy Markdown
Contributor

Perhaps an easier solution would be to ignore SIGTERM? I.e. add

signal.Ignore(os.Interrupt, syscall.SIGTERM, syscall.SIGHUP)

but... we want the signal to result in termination (with an exit code of exitSignalOffset+signal), not be ignored, right?

Usually SIGTERM is followed by SIGKILL, and this is when we get what we want (exit status 137). Basically the same outcome, only a tad later.

@kolyshkin
Copy link
Copy Markdown
Contributor

Perhaps an easier solution would be to ignore SIGTERM?

My bad, it probably won't work as the container init will inherit that, which is definitely not what we want here.

@kolyshkin
Copy link
Copy Markdown
Contributor

Right, Go expects the process to die from SIGTERM and similar signals, because the signal handler is not set explicitly, and by default it should die from SIGTERM. Yet it does not because runc init is pid 1 which enjoys a special treatment from the kernel and so we end up with what we see.

With that, the proposed patch is probably what we have to do here. While it doesn't give us true "killed by signal 15" semantics (from the child reaper side), it's the second best.

Currently, my only concerns with the proposed implementation are:

  • this is obv not needed for exec (aka (*linuxSetnsInit).Init) since runc init won't be pid 1;
  • this is not needed for create either when PID namespace is not specified (i.e. add if unix.Getpid() != 1 {return} to setupPreExecSignalExit;
  • perhaps the cleanup/defer is also not needed? We either end up execve()'ing the process, or exiting with an error, in neither case we need to undo the custom sighandler.

It seems that the list of signals is correct (corresponding to those that have _SigKill in the list in https://github.com/golang/go/blob/master/src/runtime/sigtab_linux_generic.go

One more minor thing -- I don't understand why is this being tested in pidfd-socket.bats. This issue seems unrelated to pidfd. Or is it just more convenient to use pidfd-socket in the test? I don't see it either.

@dims
Copy link
Copy Markdown
Contributor Author

dims commented Mar 21, 2026

@kolyshkin please take a look now, i think i incorporated everything you mentioned. 🤞🏾

Comment thread libcontainer/standard_init_linux.go Outdated

// Translate termination signals to conventional shell-style exit codes
// while PID 1 is still the Go-based runc init helper.
setupPreExecSignalExit()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If runc gets a SIGTERM/SIGINT/SIGHUP prior to this point, does it exit with the code we want?

If so, what changes between that point earlier in runc's execution and here?

If not, should this move closer to the top of the Init() call?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent question, will defer to @kolyshkin for placing this call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in Init(). We still have

I think we also need to explain in the commit message and/or code that this special treatment is needed because of discrepancy between Go runtime expectations (that the kernel will finish the process) and the actual kernel behavior for PID 1. It is a special case that can be seen as a bug in Go runtime, here:

https://github.com/golang/go/blob/c60392da8b6f18b2aa92db5d22c4963ec25ae0ad/src/runtime/signal_unix.go#L749-L750

(and adding something like && getpid() != 1 to the condition should probably fix it, although I am not an expert in Go runtime to be sure).

@dims can you squash your commits please, and use 80-col width for the commit message?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin thanks for the quick reviews and feedback. I think i got it all squared away now hopefully 🤞🏾

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that if signal is delivered earlier, we'll end up with exit(2) as before, so it should probably be set in Init().

Were we going to relocate the setupPreExecSignalExit() call earlier because of this? Maybe up into libcontainer.Init()?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work this way if the signal will be delivered during waiting on FIFO (which is where runc init spends most of the time I guess), as go runtime will process it immediately.

I guess we can make it work if we add pathrs.Reopen(l.fifoFile, ...) to under the same select which checks the interrupt channel. Which requires a goroutine, so we're back to square 1.

The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at it, the more I think this should be handled by Go runtime (the fix is to not assume PID 1 will be killed by SIGTERM/SIGINT/SIGHUP).

yeah

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this issue should be addressed in the Go runtime itself.

If we really need to address this issue early in runc, I have a proposal based on the following observation: once a cgroup PID limit is enforced, the kernel only checks the limit when new threads (or processes) are created. If no new threads are spawned after the limit is applied -- even if the current thread count already exceeds the limit -- the kernel won’t kill the process.

Given that, if we ensure the signal-handling goroutine is fully initialized before syncParentHooks notifies the parent (and thus before the PID limit is set), we eliminate the race window entirely. As long as no additional OS threads are created after the limit takes effect, PID 1 should remain safe.

I’ve stress-tested this approach extensively, and it appears to be stable. That said, I’d appreciate further review from the team to validate the logic and check for edge cases I might have missed.

Please see #5197

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang Awesome! happy to close this and defer to your #5197 PR. thank you for digging in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin runtime.GOMAXPROCS(1) is only there to avoid us hitting very low pid cgroup limits, it doesn't actually guarantee we are single-threaded (Go doesn't provide a mechanism to do this and probably can never provide it because of their runtime model).

I think runtime.LockOSThread is more out of an abundance of caution? I would need to take a closer look at it again...

@dims dims force-pushed the dsrinivas/runc-exit2-repro branch 4 times, most recently from 83a4629 to d73cae2 Compare March 21, 2026 20:22
@dims dims marked this pull request as ready for review March 21, 2026 20:27
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Mar 22, 2026

the single CI failure seems like a infra/flake unrelated to the commit.

+ wget https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz.asc
--2026-03-21 20:22:26--  https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2026-03-21 20:22:26 ERROR 502: Bad Gateway.

--2026-03-21 20:22:26--  https://github.com/cyphar/libpathrs/releases/download/v0.2.4/libpathrs-0.2.4.tar.xz.asc
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2026-03-21 20:22:26 ERROR 502: Bad Gateway.

Error: Process completed with exit code 8.

lifubang
lifubang previously approved these changes Mar 23, 2026
Copy link
Copy Markdown
Member

@lifubang lifubang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits.

Comment thread libcontainer/integration/preexec_signal_test.go
Comment thread libcontainer/init_linux.go Outdated
Comment thread libcontainer/standard_init_linux.go Outdated

// Translate termination signals to conventional shell-style exit codes
// while PID 1 is still the Go-based runc init helper.
setupPreExecSignalExit()

This comment was marked as outdated.

@lifubang lifubang added kind/bug backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 backport/1.4-todo A PR in main branch which needs to backported to release-1.4 backport/1.5-todo A PR in main branch which needs to be backported to release-1.5 enhancement and removed kind/bug labels Mar 23, 2026
@dims dims force-pushed the dsrinivas/runc-exit2-repro branch 2 times, most recently from bf66268 to 4cd15ce Compare March 23, 2026 12:53
There is a narrow pre-exec window spanning the exec.fifo handshake
and the final execve in which the Go-based runc init helper is still
the container's PID 1.

If SIGTERM, SIGINT, or SIGHUP arrives in that window, Linux does not
apply the default terminating action because PID 1 is special. The Go
runtime signal path assumes the kernel will finish that work for
terminating signals and calls dieFromSignal on that basis; see:
https://github.com/golang/go/blob/c60392da/src/runtime/signal_unix.go#L993

For runc's PID 1 helper, that mismatch leaks Go's internal exit status
2 instead of the usual shell-style 128+signal.

Install a narrow pre-exec signal handler for those signals while the
helper is PID 1, and translate them to 128+signal until execve
replaces the helper with the container payload.

Add libcontainer integration coverage for the regression. The test uses
a StartContainer hook to hold the process in the post-fifo, pre-exec
window, signals init through the libcontainer API, and verifies the
resulting exit status for SIGTERM, SIGINT, and SIGHUP.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
@dims dims force-pushed the dsrinivas/runc-exit2-repro branch from 4cd15ce to 6e49ffb Compare March 23, 2026 15:05
@lifubang lifubang dismissed their stale review March 23, 2026 16:08

Needs more discussion.

Comment thread libcontainer/init_linux.go
@dims
Copy link
Copy Markdown
Contributor Author

dims commented Mar 25, 2026

closing in favor of better approach. #5189 (comment)

@dims dims closed this Mar 25, 2026
@kolyshkin
Copy link
Copy Markdown
Contributor

FYI I opened https://go-review.googlesource.com/c/go/+/759040 (mostly as an RFC for now)

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Mar 25, 2026

Maybe I'm missing something, but why can't we just unmask these signals? PID 1 does get a special signal mask by default but IIRC you can just update the signal mask and things should work okay? Signal masks are preserved across exec IIRC, so we would need to reset the mask before execing the user proces.

@lifubang
Copy link
Copy Markdown
Member

so we would need to reset the mask before execing the user proces.

However, there is still a race condition during the reset call.

@lifubang
Copy link
Copy Markdown
Member

FYI I opened https://go-review.googlesource.com/c/go/+/759040 (mostly as an RFC for now)

@kolyshkin I looked into your patch for the Go runtime. From my understanding, your implementation also requires runc to explicitly handle these signals. Otherwise, the behavior of runc kill <container-id> SIGINT becomes inconsistent depending on whether the container’s user process has already been execed.

Additionally, your patch breaks containers running Go-based applications. For example:

package main

import (
	"fmt"
	"time"
)

func main() {
	done := make(chan bool)

	go func() {
		for i := 1; ; i++ {
			fmt.Printf("Some background tasks... %d\n", i)
		}
	}()

	<-done
}

With this program, you can no longer terminate the container using Ctrl+C (i.e., sending SIGINT), because the signal is ignored or blocked by the Go runtime under your changes.

@kolyshkin
Copy link
Copy Markdown
Contributor

Maybe I'm missing something, but why can't we just unmask these signals? PID 1 does get a special signal mask by default but IIRC you can just update the signal mask and things should work okay?

I think this is not about the signal mask or signal delivery, but about the kernel behavior very specific to PID 1 (which is to NOT terminate the process which received e.g. SIGHUP but does not have a handler installed for it). This kernel behavior seems to be controlled by having SIGNAL_UNKILLABLE flag in task_struct's signal->flags. The flag is set for any new process which has pid == 1.

IOW this behavior is hardcoded in the kernel and is not affected by the signal mask.

@kolyshkin kolyshkin removed backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 backport/1.4-todo A PR in main branch which needs to backported to release-1.4 backport/1.5-todo A PR in main branch which needs to be backported to release-1.5 labels Mar 30, 2026
gopherbot pushed a commit to golang/go that referenced this pull request Mar 31, 2026
A curious bug was reported to kubernetes[1] and runc[2] recently:
sometimes runc init reports exit status of 2.

Turns out, Go runtime assumes that on any UNIX system signals such as
SIGTERM (or any other that has _sigKill flag set in sigtable) with no
signal handler set up, will result in kernel terminating the program.
This is true, except for PID 1 which gets a custom treatment from the
kernel.

As a result, when a Go program that runs as PID 1 (which is easy to
achieve in Linux by using a new PID namespace) receives such a signal,
Go runtime calls dieFromSignal which falls through all the way to
exit(2), which is very confusing to a user.

This issue can be worked around by the program by adding custom handlers
for SIGTERM/SIGINT/SIGHUP, but that requires a goroutine to handle those
signals, which, in case of runc, unnecessarily raises its NPROC/pid.max
requirement (see discussion at [2]).

Since practically exit(2) in dieFromSignal can only happen when the
process is running as PID 1, replace it with exit(128+sig) to mimic
the shell convention when a child is terminated by a signal.

Add a test case which demonstrates the issue and validates the fix
(albeit only on Linux).

[An earlier version of this patch used to do nothing in dieFromSignal
for PID 1 case, but such behavior might be a breaking change for a Go
program running in a Linux container as PID 1.]

Fixes #78442

[1]: kubernetes/kubernetes#135713
[2]: opencontainers/runc#5189

Change-Id: I196e09e4b5ce84ce2c747a0c2d1fc6e9cf3a6131
Reviewed-on: https://go-review.googlesource.com/c/go/+/759040
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants