cgroup: check systemd unit creation/removal succeeded#331
Conversation
giuseppe
left a comment
There was a problem hiding this comment.
left some comments inline.
Also it seems the libocispec git module is also committed to a previous version
| if (strcmp (p->path, path) == 0) | ||
| *p->terminated = 1; | ||
| { | ||
| p->terminated = 1; |
There was a problem hiding this comment.
theoretically, we can get rid of terminated, but I am not sure if unit and result are guaranteed to be set.
| { | ||
| int ret; | ||
|
|
||
| *data = xmalloc(sizeof(struct systemd_job_removed_s)); |
There was a problem hiding this comment.
Forgot the memset :(
There was a problem hiding this comment.
coding style (missing space after ()
There was a problem hiding this comment.
don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)
| free(d->unit); | ||
| free(d->result); |
There was a problem hiding this comment.
these won't be freed if the function returns earlier
There was a problem hiding this comment.
I think it will easier if the ownership for data is owned by the caller (so the cleanup can happen in the exit: section), and free both the strings and the struct itselct could be on the stack as it is really small
There was a problem hiding this comment.
Are you suggesting moving the code that checks and generates the error into the callback function (systemd_job_removed)? Let me see...
There was a problem hiding this comment.
It looks like the only way to allocate the struct systemd_job_removed_s on the stack would be to merge systemd_check_job_status_setup into systemd_check_job_status.
I don't see any way to allocate the unit and result fields on the stack since systemd_job_removed is a callback. The only way to not allocate those are to remove those (moving the error generation into systemd_job_removed (which might not be a bad idea).
| { | ||
| int ret; | ||
|
|
||
| *data = xmalloc(sizeof(struct systemd_job_removed_s)); |
There was a problem hiding this comment.
don't worry about the coding style, I will re-format and add a check for it with indent after the release :-)
|
Updated, addressed review comments + using Alternatively, we can merge |
| if (d->unit && d->result) | ||
| { | ||
| if (strcmp (d->result, "done") != 0) | ||
| return crun_make_error (err, 0, "error %s systemd unit %s: %s", op, d->unit, d->result); |
There was a problem hiding this comment.
could we make sure we don't leak d->unit and d->result here?
We could just store the return code instead of returning here.
If you feel like playing with autocleanup (but absolutely not necessary), another alternative could be something like: https://github.com/containers/crun/blob/master/src/libcrun/linux.c#L1556 and define a function to do the cleanup for data instead of cleanup_free, and the function could free both strings and the struct itself.
There was a problem hiding this comment.
Waaaay too much golang for me 😭
|
Here's how to reproduce. In crun top srcdir: sudo dnf install vagrant vagrant-libvirt
cat << EOF > Vagrantfile
Vagrant.configure("2") do |config|
config.vm.box = "fedora/31-cloud-base"
config.vm.provider :virtualbox do |v|
v.memory = 2048
v.cpus = 2
end
config.vm.provider :libvirt do |v|
v.memory = 2048
v.cpus = 2
end
config.vm.provision "shell", inline: <<-SHELL
cat << EOF | dnf -y shell
config install_weak_deps False
install podman make autoconf \
automake libtool gcc python libcap-devel systemd-devel yajl-devel \
libseccomp-devel python3-libmount go-md2man
ts run
EOF
SHELL
end
<< EOF
vagrant up
vagrant sshOn a vagrant box: rpm -q container-selinux # should show container-selinux-2.117.0-1.gitbfde70a.fc31.noarch or so
cd /vagrant
./configure && make
sudo -s
mkdir t && cd t
# setup config.json and rootfs from e.g. busybox image
../crun --systemd-cgroup run -d foobar
journalctl --tail |
|
could you amend these changes? I think it simplifies how we deal with the data struct: |
While playing with Fedora 31 host with old/broken selinux packages,
I found out that systemd fails to create a transient unit. Here is
an except from journalctl:
> audit[1]: AVC avc: denied { setsched } for pid=1 comm="systemd" scontext=system_u:system_r:init_t:s0 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=process permissive=0
> systemd[1]: crun-555.scope: Failed to add PIDs to scope's control group: Permission denied
> systemd[1]: crun-555.scope: Failed with result 'resources'.
> systemd[1]: Failed to start libcrun container.
and yet crun did not show any error and proceeded to start the
container, which lead to a number of issues.
1. Since the cgroup was not created by systemd, but the error
was not detected, the container process was not put into its own
cgroup (but left in the same cgroup as the shell from which `crun`
was called).
2. Since crun gets the cgroup name from /proc/$PID/cgroup
(where $PID is container process PID), it proceeded to set the
limits for that (wrong) cgroup:
# cat /sys/fs/cgroup/system.slice/sshd.service/memory.max
536870912
3. `crun delete` apparently removes the `system.slice/sshd.service`
cgroup :(
The primary cause is the missing check that the transient unit has
been created. This is what this patch adds (similar to how it's done
in cgroup-run code).
After this patch:
# ../crun --systemd-cgroup run -d 555
2020-04-16T14:47:34.000354150Z: error creating systemd unit crun-555.scope: failed
For more background on how the issue was found, steps to repro etc
please see a similar (but much less brutal -- it just fails to start
the container) issue in runc:
- opencontainers/runc#2313
While at it, abstract out the code preparing and doing the check.
Fixes: eaccb4b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
done
I agree, it is more practical that way. |
|
Tested to work using repro described here: #331 (comment) The only issue is the error message is printed twice (filed as #335) |
While playing with Fedora 31 host with old/broken selinux packages,
I found out that systemd fails to create a transient unit. Here is
an except from journalctl:
and yet crun did not show any error and proceeded to start the
container, which lead to a number of issues.
Since the cgroup was not created by systemd,
but the error was not detected, the container process was put
into the wrong cgroup (but left in the same cgroup as the shell from which
crunwas called).
Since crun gets the cgroup name from
/proc/$PID/cgroup(where
$PIDis container process PID), it proceeded to set thelimits for that (wrong) cgroup:
crun deleteapparently removes thesystem.slice/sshd.servicecgroup :(
The primary cause is the missing check that the transient unit has
been created. This is what this patch adds (similar to how it's done
in cgroup-run code).
After this patch:
For more background on how the issue was found, steps to repro etc
please see a similar (but much less brutal -- it just fails to start
the container) issue in runc:
While at it, abstract out the code preparing and doing the check.