Skip to content

libct: fix a race with systemd removal#3812

Merged
AkihiroSuda merged 1 commit intoopencontainers:mainfrom
kolyshkin:sd-rm-race
Apr 21, 2023
Merged

libct: fix a race with systemd removal#3812
AkihiroSuda merged 1 commit intoopencontainers:mainfrom
kolyshkin:sd-rm-race

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Apr 5, 2023

For the previous attempt to fix that (and added test cases), see commit 9087f2e (PR #2338).

Alas, it's not always working because of cgroup directory TOCTOU.

To solve this and avoid the race, add an error after the operation. Implement it as a method that ignores the error that should be ignored. Instead of currentStatus(), use faster runType(), since we are not interested in Paused status here.

For Processes(), remove the pre-op check, and only use it after getting an error, making the non-error path more straightforward.

For Signal(), add a second check after getting an error. The first check is left as is because signalAllProcesses might print a warning if the cgroup does not exist, and we'd like to avoid that.

This should fix an occasional failure like this one:

not ok 84 kill detached busybox
# (in test file tests/integration/kill.bats, line 27)
#   `[ "$status" -eq 0 ]' failed
....
# runc kill test_busybox KILL (status=0):
# runc kill -a test_busybox 0 (status=1):
# time="2023-04-04T18:24:27Z" level=error msg="lstat /sys/fs/cgroup/devices/system.slice/runc-test_busybox.scope: no such file or directory"

Fixes: #3372
Fixes: #3744

@kolyshkin kolyshkin added this to the 1.2.0 milestone Apr 5, 2023
AkihiroSuda
AkihiroSuda previously approved these changes Apr 6, 2023
@kolyshkin
Copy link
Contributor Author

@lifubang @thaJeztah PTAL

@kolyshkin kolyshkin added area/systemd backport/1.1-pr A backport PR to release-1.1 backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 and removed backport/1.1-pr A backport PR to release-1.1 labels Apr 6, 2023
// for systemd cgroup, the unit's cgroup path will be auto removed if container's all processes exited
if status == Stopped && !c.cgroupManager.Exists() {
pids, err := c.cgroupManager.GetAllPids()
if c.ignoreCgroupError(err) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a more readable implement like this:

  1. change the return type of ignoreCgroupError to bool;
  2. if err == nil || c.ignoreCgroupError(err) {
    Otherwise the reader needs to see the function body of ignoreCgroupError to find out what happened here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps reverse the logic, which seems more natural;

if err := c.ignoreCgroupError(err); err != nil {
    return nil, fmt.Errorf("unable to get all container pids: %w", err)
}

return pids, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, my initial implementation was called isIgnorableCgroupError and returned bool. Then I took the approach used by func ignoreTerminateErrors.

Either way is fine with me.

@lifubang
Copy link
Member

lifubang commented Apr 6, 2023

Yes, I think there is a race condition here, because systemd is a service, we can't know the exactly time of the removal of the cgroup path when there is no pid in this cgroup.

// for systemd cgroup, the unit's cgroup path will be auto removed if container's all processes exited
if status == Stopped && !c.cgroupManager.Exists() {
pids, err := c.cgroupManager.GetAllPids()
if c.ignoreCgroupError(err) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps reverse the logic, which seems more natural;

if err := c.ignoreCgroupError(err); err != nil {
    return nil, fmt.Errorf("unable to get all container pids: %w", err)
}

return pids, nil

@kolyshkin kolyshkin force-pushed the sd-rm-race branch 2 times, most recently from f4e94f3 to 463675a Compare April 7, 2023 00:39
@kolyshkin
Copy link
Contributor Author

Now, here'a an interesting question: why runc kill <CTID> <SIG> fails if the container is stopped, while runc kill -a <CTID> <SIG> does not?

@lifubang
Copy link
Member

lifubang commented Apr 8, 2023

Now, here'a an interesting question: why runc kill <CTID> <SIG> fails if the container is stopped, while runc kill -a <CTID> <SIG> does not?

I think this happened when randomly hitting this issue.
Both kill with -a or not may hit this issue randomly. Because ps and kill with -a are all calling m.GetAllPids.

@kolyshkin
Copy link
Contributor Author

Now, here'a an interesting question: why runc kill <CTID> <SIG> fails if the container is stopped, while runc kill -a <CTID> <SIG> does not?

I think this happened when randomly hitting this issue. Both kill with -a or not may hit this issue randomly. Because ps and kill with -a are all calling m.GetAllPids.

Let me rephrase my question.

Currently, on a stopped container, runc kill fails, but runc kill -a does not. See:

[root@kir-rhat runc-tst]# ./runc list
ID          PID         STATUS      BUNDLE                                                CREATED                          OWNER
123         0           stopped     /home/kir/go/src/github.com/opencontainers/runc-tst   2023-04-06T21:51:20.521334579Z   root
[root@kir-rhat runc-tst]# ./runc kill 123; echo $?
ERRO[0000] container not running                        
1
[root@kir-rhat runc-tst]# ./runc kill -a 123; echo $?
0

My question was -- is this (the fact that adding -a makes the error go away) an intended behavior, and why?

@lifubang
Copy link
Member

My question was -- is this (the fact that adding -a makes the error go away) an intended behavior, and why?

As described in runtime-spec, please see https://github.com/opencontainers/runtime-spec/blob/main/runtime.md?plain=1#L131-L137
Attempting to send a signal to a container that is neither created nor running MUST have no effect on the container and MUST generate an error.
I think this is an intended behavior.

For -a, it's not defined in runtime-spec. But I think it's also right for the stopped container, for example: shared namespace containers. So, shall we need to add a description for -a in runtime-spec?

@kolyshkin
Copy link
Contributor Author

For -a, it's not defined in runtime-spec. But I think it's also right for the stopped container, for example: shared namespace containers. So, shall we need to add a description for -a in runtime-spec?

Good question. So, it looks that kill -a currently violates the spec.

There are a few ways to look at how runc kill -a should work:

  1. Adhere to runtime-spec description quoted above, and return a "container not running" error (i.e. same behavior as without -a option).
  2. Amend the -a description with "Ignore the container-not-running error". This makes it possible to use the command in a scenario where we want to kill and destroy a container no matter what.
  3. Add --ignore-stopped boolean option to achieve the same scenario as in 2 (and optionally have -a implying --ignore-stopped, which can be reverted by specifying --ignore-stopped=false or smth).

@kolyshkin
Copy link
Contributor Author

@giuseppe WDYT? (see previous comment)

@giuseppe
Copy link
Member

I've no strong opinion, I've followed the same runc behavior in crun.

I am afraid to change the current behavior if anyone depends on it (although the entire mechanism seems like a big race condition).

-a is useful only when there is no PID namespace, since otherwise it is fine to just kill the first process. If -a would follow the runtime spec and fail when the container is stopped, then it wouldn't be able to terminate the other processes. Perhaps the behavior is fine and should just be documented?

@kolyshkin
Copy link
Contributor Author

OK let's agree to merely document the existing behavior of -a.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member

OK let's agree to merely document the existing behavior of -a.

Are you planning to open a PR for that, or do we need a tracking ticket?

(looks like this PR may need a rebase, as it's marked "outdated")

For a previous attempt to fix that (and added test cases), see commit
9087f2e.

Alas, it's not always working because of cgroup directory TOCTOU.

To solve this and avoid the race, add an error _after_ the operation.
Implement it as a method that ignores the error that should be ignored.
Instead of currentStatus(), use faster runType(), since we are not
interested in Paused status here.

For Processes(), remove the pre-op check, and only use it after getting
an error, making the non-error path more straightforward.

For Signal(), add a second check after getting an error. The first check
is left as is because signalAllProcesses might print a warning if the
cgroup does not exist, and we'd like to avoid that.

This should fix an occasional failure like this one:

	not ok 84 kill detached busybox
	# (in test file tests/integration/kill.bats, line 27)
	#   `[ "$status" -eq 0 ]' failed
	....
	# runc kill test_busybox KILL (status=0):
	# runc kill -a test_busybox 0 (status=1):
	# time="2023-04-04T18:24:27Z" level=error msg="lstat /sys/fs/cgroup/devices/system.slice/runc-test_busybox.scope: no such file or directory"

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the sd-rm-race branch 2 times, most recently from be1682d to fe278b9 Compare April 21, 2023 00:59
@kolyshkin
Copy link
Contributor Author

Are you planning to open a PR for that, or do we need a tracking ticket?

Opened #3834

(looks like this PR may need a rebase, as it's marked "outdated")

Rebased

@kolyshkin kolyshkin requested a review from AkihiroSuda April 21, 2023 01:01
@AkihiroSuda AkihiroSuda merged commit dac3852 into opencontainers:main Apr 21, 2023
leonardoadame

This comment was marked as off-topic.

@kolyshkin kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels May 22, 2023
@kolyshkin
Copy link
Contributor Author

1.1 backport: #3877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/systemd backport/1.1-done A PR in main branch which has been backported to release-1.1

Projects

None yet

6 participants