Skip to content

cgroup: check systemd unit creation/removal succeeded#331

Merged
giuseppe merged 1 commit into
containers:masterfrom
kolyshkin:systemd-unit
Apr 17, 2020
Merged

cgroup: check systemd unit creation/removal succeeded#331
giuseppe merged 1 commit into
containers:masterfrom
kolyshkin:systemd-unit

Conversation

@kolyshkin
Copy link
Copy Markdown
Collaborator

@kolyshkin kolyshkin commented Apr 16, 2020

While playing with Fedora 31 host with old/broken selinux packages,
I found out that systemd fails to create a transient unit. Here is
an except from journalctl:

audit[1]: AVC avc: denied { setsched } for pid=1 comm="systemd" scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=process permissive=0
systemd[1]: crun-555.scope: Failed to add PIDs to scope's control group: Permission denied
systemd[1]: crun-555.scope: Failed with result 'resources'.
systemd[1]: Failed to start libcrun container.

and yet crun did not show any error and proceeded to start the
container, which lead to a number of issues.

  1. Since the cgroup was not created by systemd,
    but the error was not detected, the container process was put
    into the wrong cgroup (but left in the same cgroup as the shell from which crun
    was called).

  2. Since crun gets the cgroup name from /proc/$PID/cgroup
    (where $PID is container process PID), it proceeded to set the
    limits for that (wrong) cgroup:

# cat /sys/fs/cgroup/system.slice/sshd.service/memory.max
536870912
  1. crun delete apparently removes the system.slice/sshd.service
    cgroup :(

The primary cause is the missing check that the transient unit has
been created. This is what this patch adds (similar to how it's done
in cgroup-run code).

After this patch:

 # ../crun --systemd-cgroup run -d 555
 error creating systemd unit crun-555.scope: failed

For more background on how the issue was found, steps to repro etc
please see a similar (but much less brutal -- it just fails to start
the container) issue in runc:

While at it, abstract out the code preparing and doing the check.

Comment thread src/libcrun/cgroup.c Outdated
Comment thread src/libcrun/cgroup.c Outdated
@kolyshkin kolyshkin changed the title cgroup: check systemd unit creation succeeded cgroup: check systemd unit creation/removal succeeded Apr 16, 2020
Copy link
Copy Markdown
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments inline.

Also it seems the libocispec git module is also committed to a previous version

Comment thread src/libcrun/cgroup.c Outdated
Comment thread src/libcrun/cgroup.c Outdated
Comment thread src/libcrun/cgroup.c Outdated
if (strcmp (p->path, path) == 0)
*p->terminated = 1;
{
p->terminated = 1;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically, we can get rid of terminated, but I am not sure if unit and result are guaranteed to be set.

Comment thread src/libcrun/cgroup.c Outdated
{
int ret;

*data = xmalloc(sizeof(struct systemd_job_removed_s));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot the memset :(

Copy link
Copy Markdown
Collaborator Author

@kolyshkin kolyshkin Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding style (missing space after ()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)

Comment thread src/libcrun/cgroup.c Outdated
Comment on lines +802 to +803
free(d->unit);
free(d->result);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these won't be freed if the function returns earlier

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will easier if the ownership for data is owned by the caller (so the cleanup can happen in the exit: section), and free both the strings and the struct itselct could be on the stack as it is really small

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting moving the code that checks and generates the error into the callback function (systemd_job_removed)? Let me see...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only way to allocate the struct systemd_job_removed_s on the stack would be to merge systemd_check_job_status_setup into systemd_check_job_status.

I don't see any way to allocate the unit and result fields on the stack since systemd_job_removed is a callback. The only way to not allocate those are to remove those (moving the error generation into systemd_job_removed (which might not be a bad idea).

Comment thread src/libcrun/cgroup.c Outdated
{
int ret;

*data = xmalloc(sizeof(struct systemd_job_removed_s));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)

@kolyshkin
Copy link
Copy Markdown
Collaborator Author

Updated, addressed review comments + using _async to set the match, should make the code somewhat faster (again stealing from systemd-run).

Alternatively, we can merge systemd_check_job_status_setup into systemd_check_job_status, making the code simpler.

Comment thread src/libcrun/cgroup.c Outdated
if (d->unit && d->result)
{
if (strcmp (d->result, "done") != 0)
return crun_make_error (err, 0, "error %s systemd unit %s: %s", op, d->unit, d->result);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make sure we don't leak d->unit and d->result here?

We could just store the return code instead of returning here.

If you feel like playing with autocleanup (but absolutely not necessary), another alternative could be something like: https://github.com/containers/crun/blob/master/src/libcrun/linux.c#L1556 and define a function to do the cleanup for data instead of cleanup_free, and the function could free both strings and the struct itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waaaay too much golang for me 😭

@kolyshkin
Copy link
Copy Markdown
Collaborator Author

Here's how to reproduce.

In crun top srcdir:

sudo dnf install vagrant vagrant-libvirt
cat << EOF > Vagrantfile
Vagrant.configure("2") do |config|
  config.vm.box = "fedora/31-cloud-base"
  config.vm.provider :virtualbox do |v|
    v.memory = 2048
    v.cpus = 2
  end
  config.vm.provider :libvirt do |v|
    v.memory = 2048
    v.cpus = 2
  end
  config.vm.provision "shell", inline: <<-SHELL
    cat << EOF | dnf -y shell
config install_weak_deps False
install podman make autoconf \
 automake libtool gcc python libcap-devel systemd-devel yajl-devel \
 libseccomp-devel python3-libmount go-md2man
ts run
EOF
  SHELL
end
<< EOF
vagrant up
vagrant ssh

On a vagrant box:

rpm -q container-selinux # should show container-selinux-2.117.0-1.gitbfde70a.fc31.noarch or so
cd /vagrant
./configure && make
sudo -s
mkdir t && cd t
# setup config.json and rootfs from e.g. busybox image
../crun --systemd-cgroup run -d foobar
journalctl --tail

@giuseppe
Copy link
Copy Markdown
Member

could you amend these changes? I think it simplifies how we deal with the data struct:

diff --git a/src/libcrun/cgroup.c b/src/libcrun/cgroup.c
index 95a99d5..6dede85 100644
--- a/src/libcrun/cgroup.c
+++ b/src/libcrun/cgroup.c
@@ -741,20 +741,18 @@ systemd_job_removed (sd_bus_message *m, void *userdata, sd_bus_error *error arg_
     {
       d->terminated = 1;
       if (strcmp (result, "done") != 0)
-        crun_make_error (&d->err, 0, "error %s systemd unit %s: %s", d->op, unit, result);
+        crun_make_error (&d->err, 0, "error %s systemd unit `%s`: got `%s`", d->op, unit, result);
     }
   return 0;
 }
 
 static int
 systemd_check_job_status_setup (sd_bus *bus,
-                                void **data,
+                                struct systemd_job_removed_s *data,
                                 libcrun_error_t *err)
 {
   int ret;
 
-  *data = xmalloc (sizeof (struct systemd_job_removed_s));
-  memset (*data, 0, sizeof (struct systemd_job_removed_s));
   ret = sd_bus_match_signal_async (bus,
                                    NULL,
                                    "org.freedesktop.systemd1",
@@ -763,29 +761,25 @@ systemd_check_job_status_setup (sd_bus *bus,
                                    "JobRemoved",
                                    systemd_job_removed,
                                    NULL,
-                                   *data);
+                                   data);
   if (UNLIKELY (ret < 0))
-    {
-      free (*data);
-      return crun_make_error (err, -ret, "sd-bus match signal");
-    }
+    return crun_make_error (err, -ret, "sd-bus match signal");
 
   return 0;
 }
 
 static int
 systemd_check_job_status (sd_bus *bus,
-                          void *data,
+                          struct systemd_job_removed_s *data,
                           const char *path,
                           const char *op,
                           libcrun_error_t *err)
 {
   int sd_err;
-  cleanup_free struct systemd_job_removed_s *d = data;
 
-  d->path = path;
-  d->op = op;
-  while (!d->terminated)
+  data->path = path;
+  data->op = op;
+  while (!data->terminated)
     {
       sd_err = sd_bus_process (bus, NULL);
       if (UNLIKELY (sd_err < 0))
@@ -800,10 +794,11 @@ systemd_check_job_status (sd_bus *bus,
 
     }
 
-  if (d->err != NULL) {
-    *err = d->err;
-    return -1;
-  }
+  if (data->err != NULL)
+    {
+      *err = data->err;
+      return -1;
+    }
 
   return 0;
 }
@@ -958,7 +953,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
   int sd_err, ret = 0;
   sd_bus_error error = SD_BUS_ERROR_NULL;
   const char *object;
-  void *data;
+  struct systemd_job_removed_s job_data = {};
   int i;
   const char *boolean_opts[10];
 
@@ -989,7 +984,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
         }
     }
 
-  ret = systemd_check_job_status_setup (bus, &data, err);
+  ret = systemd_check_job_status_setup (bus, &job_data, err);
   if (UNLIKELY (ret < 0))
       goto exit;
 
@@ -1107,7 +1102,7 @@ int enter_systemd_cgroup_scope (runtime_spec_schema_config_linux_resources *reso
       goto exit;
     }
 
-  ret = systemd_check_job_status (bus, data, object, "creating", err);
+  ret = systemd_check_job_status (bus, &job_data, object, "creating", err);
 
 exit:
   if (bus)
@@ -1129,7 +1124,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
   int ret = 0;
   sd_bus_error error = SD_BUS_ERROR_NULL;
   const char *object;
-  void *data;
+  struct systemd_job_removed_s job_data = {};
 
   if (sd_bus_default (&bus) < 0)
     {
@@ -1137,7 +1132,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
       goto exit;
     }
 
-  ret = systemd_check_job_status_setup (bus, &data, err);
+  ret = systemd_check_job_status_setup (bus, &job_data, err);
   if (UNLIKELY (ret < 0))
     goto exit;
 
@@ -1172,7 +1167,7 @@ int destroy_systemd_cgroup_scope (const char *scope, libcrun_error_t *err)
       goto exit;
     }
 
-  ret = systemd_check_job_status (bus, data, object, "removing", err);
+  ret = systemd_check_job_status (bus, &job_data, object, "removing", err);
 
 exit:
   if (bus)

While playing with Fedora 31 host with old/broken selinux packages,
I found out that systemd fails to create a transient unit. Here is
an except from journalctl:

> audit[1]: AVC avc:  denied  { setsched } for  pid=1 comm="systemd" scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=process permissive=0
> systemd[1]: crun-555.scope: Failed to add PIDs to scope's control group: Permission denied
> systemd[1]: crun-555.scope: Failed with result 'resources'.
> systemd[1]: Failed to start libcrun container.

and yet crun did not show any error and proceeded to start the
container, which lead to a number of issues.

1. Since the cgroup was not created by systemd, but the error
was not detected, the container process was not put into its own
cgroup (but left in the same cgroup as the shell from which `crun`
was called).

2. Since crun gets the cgroup name from /proc/$PID/cgroup
(where $PID is container process PID), it proceeded to set the
limits for that (wrong) cgroup:

 # cat /sys/fs/cgroup/system.slice/sshd.service/memory.max
 536870912

3. `crun delete` apparently removes the `system.slice/sshd.service`
cgroup :(

The primary cause is the missing check that the transient unit has
been created. This is what this patch adds (similar to how it's done
in cgroup-run code).

After this patch:

 # ../crun --systemd-cgroup run -d 555
 2020-04-16T14:47:34.000354150Z: error creating systemd unit crun-555.scope: failed

For more background on how the issue was found, steps to repro etc
please see a similar (but much less brutal -- it just fails to start
the container) issue in runc:

 - opencontainers/runc#2313

While at it, abstract out the code preparing and doing the check.

Fixes: eaccb4b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Copy Markdown
Collaborator Author

could you amend these changes?

done

I think it simplifies how we deal with the data struct:

I agree, it is more practical that way.

@kolyshkin
Copy link
Copy Markdown
Collaborator Author

Tested to work using repro described here: #331 (comment)

The only issue is the error message is printed twice (filed as #335)

Copy link
Copy Markdown
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants