cgroup: check systemd unit creation/removal succeeded by kolyshkin · Pull Request #331 · containers/crun

kolyshkin · 2020-04-16T15:28:42Z

While playing with Fedora 31 host with old/broken selinux packages,
I found out that systemd fails to create a transient unit. Here is
an except from journalctl:

audit[1]: AVC avc: denied { setsched } for pid=1 comm="systemd" scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=process permissive=0
systemd[1]: crun-555.scope: Failed to add PIDs to scope's control group: Permission denied
systemd[1]: crun-555.scope: Failed with result 'resources'.
systemd[1]: Failed to start libcrun container.

and yet crun did not show any error and proceeded to start the
container, which lead to a number of issues.

Since the cgroup was not created by systemd,
but the error was not detected, the container process was put
into the wrong cgroup (but left in the same cgroup as the shell from which crun
was called).
Since crun gets the cgroup name from /proc/$PID/cgroup
(where $PID is container process PID), it proceeded to set the
limits for that (wrong) cgroup:

# cat /sys/fs/cgroup/system.slice/sshd.service/memory.max
536870912

crun delete apparently removes the system.slice/sshd.service
cgroup :(

The primary cause is the missing check that the transient unit has
been created. This is what this patch adds (similar to how it's done
in cgroup-run code).

After this patch:

 # ../crun --systemd-cgroup run -d 555
 error creating systemd unit crun-555.scope: failed

For more background on how the issue was found, steps to repro etc
please see a similar (but much less brutal -- it just fails to start
the container) issue in runc:

runc --systemd-cgroup run: need to check that StartTransientUnit succeeds opencontainers/runc#2313

While at it, abstract out the code preparing and doing the check.

giuseppe

left some comments inline.

Also it seems the libocispec git module is also committed to a previous version

kolyshkin · 2020-04-16T17:07:29Z

  if (strcmp (p->path, path) == 0)
-    *p->terminated = 1;
+    {
+      p->terminated = 1;


theoretically, we can get rid of terminated, but I am not sure if unit and result are guaranteed to be set.

kolyshkin · 2020-04-16T17:07:50Z

+{
+  int ret;
+
+  *data = xmalloc(sizeof(struct systemd_job_removed_s));


Forgot the memset :(

coding style (missing space after ()

don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)

giuseppe · 2020-04-16T19:51:31Z

+      free(d->unit);
+      free(d->result);


these won't be freed if the function returns earlier

I think it will easier if the ownership for data is owned by the caller (so the cleanup can happen in the exit: section), and free both the strings and the struct itselct could be on the stack as it is really small

Are you suggesting moving the code that checks and generates the error into the callback function (systemd_job_removed)? Let me see...

It looks like the only way to allocate the struct systemd_job_removed_s on the stack would be to merge systemd_check_job_status_setup into systemd_check_job_status.

I don't see any way to allocate the unit and result fields on the stack since systemd_job_removed is a callback. The only way to not allocate those are to remove those (moving the error generation into systemd_job_removed (which might not be a bad idea).

giuseppe · 2020-04-16T19:51:44Z

+{
+  int ret;
+
+  *data = xmalloc(sizeof(struct systemd_job_removed_s));


don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)

kolyshkin · 2020-04-16T20:37:14Z

Updated, addressed review comments + using _async to set the match, should make the code somewhat faster (again stealing from systemd-run).

Alternatively, we can merge systemd_check_job_status_setup into systemd_check_job_status, making the code simpler.

giuseppe · 2020-04-16T20:51:05Z

+  if (d->unit && d->result)
+    {
+      if (strcmp (d->result, "done") != 0)
+          return crun_make_error (err, 0, "error %s systemd unit %s: %s", op, d->unit, d->result);


could we make sure we don't leak d->unit and d->result here?

We could just store the return code instead of returning here.

If you feel like playing with autocleanup (but absolutely not necessary), another alternative could be something like: https://github.com/containers/crun/blob/master/src/libcrun/linux.c#L1556 and define a function to do the cleanup for data instead of cleanup_free, and the function could free both strings and the struct itself.

Waaaay too much golang for me 😭

kolyshkin · 2020-04-17T01:25:51Z

Here's how to reproduce.

In crun top srcdir:

sudo dnf install vagrant vagrant-libvirt
cat << EOF > Vagrantfile
Vagrant.configure("2") do |config|
  config.vm.box = "fedora/31-cloud-base"
  config.vm.provider :virtualbox do |v|
    v.memory = 2048
    v.cpus = 2
  end
  config.vm.provider :libvirt do |v|
    v.memory = 2048
    v.cpus = 2
  end
  config.vm.provision "shell", inline: <<-SHELL
    cat << EOF | dnf -y shell
config install_weak_deps False
install podman make autoconf \
 automake libtool gcc python libcap-devel systemd-devel yajl-devel \
 libseccomp-devel python3-libmount go-md2man
ts run
EOF
  SHELL
end
<< EOF
vagrant up
vagrant ssh

On a vagrant box:

rpm -q container-selinux # should show container-selinux-2.117.0-1.gitbfde70a.fc31.noarch or so
cd /vagrant
./configure && make
sudo -s
mkdir t && cd t
# setup config.json and rootfs from e.g. busybox image
../crun --systemd-cgroup run -d foobar
journalctl --tail

giuseppe · 2020-04-17T07:39:46Z

could you amend these changes? I think it simplifies how we deal with the data struct:

diff --git a/src/libcrun/cgroup.c b/src/libcrun/cgroup.c
index 95a99d5..6dede85 100644
--- a/src/libcrun/cgroup.c
+++ b/src/libcrun/cgroup.c
@@ -741,20 +741,18 @@ systemd_job_removed (sd_bus_message *m, void *userdata, sd_bus_error *error arg_
     {
       d->terminated = 1;
       if (strcmp (result, "done") != 0)
-        crun_make_error (&d->err, 0, "error %s systemd unit %s: %s", d->op, unit, result);
+        crun_make_error (&d->err, 0, "error %s systemd unit `%s`: got `%s`", d->op, unit, result);
     }
   return 0;
 }
 
 static int
 systemd_check_job_status_setup (sd_bus *bus,
-                                void **data,
+                                struct systemd_job_removed_s *data,
                                 libcrun_error_t *err)
 {
   int ret;
 
-  *data = xmalloc (sizeof (struct systemd_job_removed_s));
-  memset (*data, 0, sizeof (struct systemd_job_removed_s));
   ret = sd_bus_match_signal_async (bus,
                                    NULL,
                                    "org.freedesktop.systemd1",
@@ -763,29 +761,25 @@ systemd_check_job_status_setup (sd_bus *bus,
                                    "JobRemoved",
                                    systemd_job_removed,
                                    NULL,
-                                   *data);
+                                   data);
   if (UNLIKELY (ret < 0))
-    {
-      free (*data);
-      return crun_make_error (err, -ret, "sd-bus match signal");
-    }
+    return crun_make_error (err, -ret, "sd-bus match signal");
 
   return 0;
 }
 
 static int
 systemd_check_job_status (sd_bus *bus,
-                          void *data,
+                          struct systemd_job_removed_s *data,
                           const char *path,
                           const char *op,
                           libcrun_error_t *err)
 {
   int sd_err;
-  cleanup_free struct systemd_job_removed_s *d = data;
 
-  d->path = path;
-  d->op = op;
-  while (!d->terminated)
+  data->path = path;
+  data->op = op;
+  while (!data->terminated)
     {
       sd_err = sd_bus_process (bus, NULL);
       if (UNLIKELY (sd_err < 0))
@@ -800,10 +794,11 @@ systemd_check_job_status (sd_bus *bus,
 
     }
 
-  if (d->err != NULL) {
-    *err = d->err;
-    return -1;
-  }
+  if (data->err != NULL)
+    {
+      *err = data->err;
+      return -1;
+    }
 
   return 0;
 }
@@ -958,7 +953,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
   int sd_err, ret = 0;
   sd_bus_error error = SD_BUS_ERROR_NULL;
   const char *object;
-  void *data;
+  struct systemd_job_removed_s job_data = {};
   int i;
   const char *boolean_opts[10];
 
@@ -989,7 +984,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
         }
     }
 
-  ret = systemd_check_job_status_setup (bus, &data, err);
+  ret = systemd_check_job_status_setup (bus, &job_data, err);
   if (UNLIKELY (ret < 0))
       goto exit;
 
@@ -1107,7 +1102,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
       goto exit;
     }
 
-  ret = systemd_check_job_status (bus, data, object, "creating", err);
+  ret = systemd_check_job_status (bus, &job_data, object, "creating", err);
 
 exit:
   if (bus)
@@ -1129,7 +1124,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
   int ret = 0;
   sd_bus_error error = SD_BUS_ERROR_NULL;
   const char *object;
-  void *data;
+  struct systemd_job_removed_s job_data = {};
 
   if (sd_bus_default (&bus) < 0)
     {
@@ -1137,7 +1132,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
       goto exit;
     }
 
-  ret = systemd_check_job_status_setup (bus, &data, err);
+  ret = systemd_check_job_status_setup (bus, &job_data, err);
   if (UNLIKELY (ret < 0))
     goto exit;
 
@@ -1172,7 +1167,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
       goto exit;
     }
 
-  ret = systemd_check_job_status (bus, data, object, "removing", err);
+  ret = systemd_check_job_status (bus, &job_data, object, "removing", err);
 
 exit:
   if (bus)

While playing with Fedora 31 host with old/broken selinux packages, I found out that systemd fails to create a transient unit. Here is an except from journalctl: > audit[1]: AVC avc: denied { setsched } for pid=1 comm="systemd" scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=process permissive=0 > systemd[1]: crun-555.scope: Failed to add PIDs to scope's control group: Permission denied > systemd[1]: crun-555.scope: Failed with result 'resources'. > systemd[1]: Failed to start libcrun container. and yet crun did not show any error and proceeded to start the container, which lead to a number of issues. 1. Since the cgroup was not created by systemd, but the error was not detected, the container process was not put into its own cgroup (but left in the same cgroup as the shell from which `crun` was called). 2. Since crun gets the cgroup name from /proc/$PID/cgroup (where $PID is container process PID), it proceeded to set the limits for that (wrong) cgroup: # cat /sys/fs/cgroup/system.slice/sshd.service/memory.max 536870912 3. `crun delete` apparently removes the `system.slice/sshd.service` cgroup :( The primary cause is the missing check that the transient unit has been created. This is what this patch adds (similar to how it's done in cgroup-run code). After this patch: # ../crun --systemd-cgroup run -d 555 2020-04-16T14:47:34.000354150Z: error creating systemd unit crun-555.scope: failed For more background on how the issue was found, steps to repro etc please see a similar (but much less brutal -- it just fails to start the container) issue in runc: - opencontainers/runc#2313 While at it, abstract out the code preparing and doing the check. Fixes: eaccb4b Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2020-04-17T08:01:57Z

could you amend these changes?

done

I think it simplifies how we deal with the data struct:

I agree, it is more practical that way.

kolyshkin · 2020-04-17T08:03:22Z

Tested to work using repro described here: #331 (comment)

The only issue is the error message is printed twice (filed as #335)

giuseppe

LGTM

giuseppe reviewed Apr 16, 2020

View reviewed changes

Comment thread src/libcrun/cgroup.c Outdated

Comment thread src/libcrun/cgroup.c Outdated

kolyshkin force-pushed the systemd-unit branch from 52d2d13 to e4d9ef9 Compare April 16, 2020 16:26

kolyshkin changed the title ~~cgroup: check systemd unit creation succeeded~~ cgroup: check systemd unit creation/removal succeeded Apr 16, 2020

kolyshkin force-pushed the systemd-unit branch from e4d9ef9 to 4c671d8 Compare April 16, 2020 16:33

giuseppe reviewed Apr 16, 2020

View reviewed changes

Comment thread src/libcrun/cgroup.c Outdated

Comment thread src/libcrun/cgroup.c Outdated

kolyshkin force-pushed the systemd-unit branch from 4c671d8 to cd9cc5e Compare April 16, 2020 17:02

kolyshkin commented Apr 16, 2020

View reviewed changes

giuseppe reviewed Apr 16, 2020

View reviewed changes

kolyshkin force-pushed the systemd-unit branch from cd9cc5e to ac2a7ff Compare April 16, 2020 20:34

giuseppe reviewed Apr 16, 2020

View reviewed changes

kolyshkin force-pushed the systemd-unit branch from ac2a7ff to b4402e4 Compare April 17, 2020 00:32

kolyshkin mentioned this pull request Apr 17, 2020

cgroup: check for systemd job result #328

Closed

kolyshkin force-pushed the systemd-unit branch from b4402e4 to 2a67166 Compare April 17, 2020 07:59

giuseppe approved these changes Apr 17, 2020

View reviewed changes

giuseppe merged commit 4ad44b8 into containers:master Apr 17, 2020

kolyshkin mentioned this pull request Apr 20, 2020

check that StartTransientUnit/StopUnit succeeds opencontainers/runc#2331

Merged

hswong3i mentioned this pull request May 18, 2020

crun master require systemd >= 237 #361

Closed

Conversation

kolyshkin commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Apr 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented Apr 17, 2020

Uh oh!

giuseppe commented Apr 17, 2020

Uh oh!

kolyshkin commented Apr 17, 2020

Uh oh!

kolyshkin commented Apr 17, 2020

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kolyshkin commented Apr 16, 2020 •

edited

Loading

kolyshkin Apr 16, 2020 •

edited

Loading