Skip to content

cgroup/systemd: fix making CharDevice path in systemdProperties#3568

Closed
yangfeiyu20102011 wants to merge 1 commit intoopencontainers:mainfrom
yangfeiyu20102011:main
Closed

cgroup/systemd: fix making CharDevice path in systemdProperties#3568
yangfeiyu20102011 wants to merge 1 commit intoopencontainers:mainfrom
yangfeiyu20102011:main

Conversation

@yangfeiyu20102011
Copy link
Copy Markdown

cgroup/systemd: func systemdProperties will set CharDevice path like /dev/char/0:0,

but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x,
getNVIDIAEntryPath will fix this problem.

Signed-off-by: yangfeiyu20102011 yangfeiyu20102011@163.com

@yangfeiyu20102011
Copy link
Copy Markdown
Author

yangfeiyu20102011 commented Aug 23, 2022

PTAL, thanks! cc @AkihiroSuda @thaJeztah
#3567

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?

Comment thread libcontainer/cgroups/devices/systemd.go Outdated
Comment thread libcontainer/cgroups/devices/systemd.go Outdated
@yangfeiyu20102011
Copy link
Copy Markdown
Author

yangfeiyu20102011 commented Aug 23, 2022

Left a comment about the implementation, but in general, this feels rather odd to have this exception for these devices, and I wonder if this should be included in runc, being the reference implementation of the OCI runtime spec; does the spec describe anything about this special case?

Thanks, I have modified the code.
When doing runc update, it will skip setting the cgroup device files.
If the spec contains devices like /dev/nvidia*, it will make the DeviceAllow.conf as follow.

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DeviceAllow.conf
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/195:0 rw
DeviceAllow=/dev/char/195:1 rw

The DeviceAllow=/dev/char/195:0 rw will not work.

And if DevicePolicy.conf set DevicePolicy=strict, the devices.list may end in
c 195:* m
after
setUnitProperties(m.dbus, unitName, properties...)

cat /run/systemd/system/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope.d/50-DevicePolicy.conf
[Scope]
DevicePolicy=strict

cat /sys/fs/cgroup/devices/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podeeccd2f8_7bef_4054_a659_6554b908432a.slice/cri-containerd-74e9a65ee73edfecdf345f477e4bfb44a39428243d4a64519cd860fcc0f6901b.scope/devices.list
c 136:* rwm
c 5:2 rwm
c 195:* m

@kolyshkin
Copy link
Copy Markdown
Contributor

Indeed, not all character devices have /dev/char/MM:mm equivalent for some reason. Here's what I found on my machine (Fedora 36 laptop running kernel 5.17.14-300.fc36.x86_64):

[kir@kir-rhat linux]$ ls -lR /dev | grep ^c | awk '{print $10, $5, $6}' | sed -e 's|, |:|' -e 's| | /dev/char/|' | awk '{printf "/dev/" $1 "\t"; system("ls -l " $2);}' 2>&1 | grep cannot
/dev/cuse	ls: cannot access '/dev/char/10:203': No such file or directory
/dev/lp0	ls: cannot access '/dev/char/6:0': No such file or directory
/dev/lp1	ls: cannot access '/dev/char/6:1': No such file or directory
/dev/lp2	ls: cannot access '/dev/char/6:2': No such file or directory
/dev/lp3	ls: cannot access '/dev/char/6:3': No such file or directory
/dev/ppp	ls: cannot access '/dev/char/108:0': No such file or directory
/dev/uhid	ls: cannot access '/dev/char/10:239': No such file or directory
/dev/uinput	ls: cannot access '/dev/char/10:223': No such file or directory
/dev/vhci	ls: cannot access '/dev/char/10:137': No such file or directory
/dev/vhost-vsock	ls: cannot access '/dev/char/10:241': No such file or directory

(there might be some more char devices in subdirectories of /dev)

For block devices, I haven't found any that does not have a symlink in /dev/block.

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found. WDYT @cyphar

func systemdProperties will set CharDevice path like /dev/char/0:0,
but NVIDIA devices with major 195:* and minor 507:* can not be found in path /dev/char/x:x,
getNVIDIAEntryPath will fix this problem.

Signed-off-by: yangfeiyu20102011 <yangfeiyu20102011@163.com>
@yangfeiyu20102011
Copy link
Copy Markdown
Author

Now the NVIDIA devices in DeviceAllow.conf are not as expected. This PR can solve some NVIDIA GPU rw problems and it is a improved method at least.
We can completely solve the char devices problem in a better way in the future.
cc @thaJeztah @cyphar

@kolyshkin
Copy link
Copy Markdown
Contributor

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

@yangfeiyu20102011
Copy link
Copy Markdown
Author

yangfeiyu20102011 commented Sep 1, 2022

@yangfeiyu20102011 can you please provide OCI spec example with NVidia devices added?

cc @kolyshkin
OK, here are the spec and DeviceAllow.conf
oci spec

{
    "ociVersion": "1.0.2-dev",
    "process":
    {
        "user":
        {
            "uid": 0,
            "gid": 0
        },
        "args":
        [
            "sleep",
            "36000"
        ],
        "env":
        [
            "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "HOSTNAME=gpu-operator-test",
            "NVARCH=x86_64",
            "NVIDIA_REQUIRE_CUDA=cuda>=11.0 brand=tesla,driver>=418,driver<419",
            "NV_CUDA_CUDART_VERSION=11.0.221-1",
            "NV_CUDA_COMPAT_PACKAGE=cuda-compat-11-0",
            "CUDA_VERSION=11.0.3",
            "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
            "NVIDIA_VISIBLE_DEVICES=all",
            "NVIDIA_DRIVER_CAPABILITIES=compute,utility",
            "NVIDIA_VISIBLE_DEVICES=GPU-44f83262-58b8-2db0-7960-01d193fcf7b5",
            "NOT_HOST_NETWORK=true",
            "KUBERNETES_SERVICE_PORT=443",
            "KUBERNETES_SERVICE_PORT_HTTPS=443"
        ],
        "cwd": "/",
        "capabilities":
        {
            "bounding":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "effective":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "permitted":
            [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ]
        },
        "oomScoreAdj": 1000
    },
    "root":
    {
        "path": "rootfs"
    },
    "mounts":
    [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options":
            [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options":
            [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options":
            [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        },
        {
            "destination": "/dev/nvidiactl",
            "type": "bind",
            "source": "/dev/nvidiactl",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia0",
            "type": "bind",
            "source": "/dev/nvidia0",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/nvidia-uvm",
            "type": "bind",
            "source": "/dev/nvidia-uvm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hosts",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/etc-hosts",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/termination-log",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/containers/cuda-vector-add/e8ed181c",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/hostname",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/hostname",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/etc/resolv.conf",
            "type": "bind",
            "source": "/media/disk1/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/resolv.conf",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "bind",
            "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd/shm",
            "options":
            [
                "rbind",
                "rprivate",
                "rw"
            ]
        },
        {
            "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
            "type": "bind",
            "source": "/data/kubelet/pods/c043b554-0f9c-4db6-874b-6977ee24fa96/volumes/kubernetes.io~secret/default-token",
            "options":
            [
                "rbind",
                "rprivate",
                "ro"
            ]
        }
    ],
    "hooks":
    {
        "prestart":
        [
            {
                "path": "/usr/bin/nvidia-container-runtime-hook",
                "args":
                [
                    "/usr/bin/nvidia-container-runtime-hook",
                    "prestart"
                ]
            }
        ]
    },
    "annotations":
    {
        "io.kubernetes.cri.container-name": "cuda-vector-add",
        "io.kubernetes.cri.container-type": "container",
        "io.kubernetes.cri.image-name": "docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04",
        "io.kubernetes.cri.sandbox-id": "c2141dbe87f259715bbfb6f7923cb7d85a484f9d4a809f45555234ecbbf9d7bd",
        "io.kubernetes.cri.sandbox-name": "gpu-operator-test",
        "io.kubernetes.cri.sandbox-namespace": "default"
    },
    "linux":
    {
        "resources":
        {
            "devices":
            [
                {
                    "allow": false,
                    "access": "rwm"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 255,
                    "access": "rw"
                },
                {
                    "allow": true,
                    "type": "c",
                    "major": 195,
                    "minor": 3,
                    "access": "rw"
                }
            ],
            "memory":
            {},
            "cpu":
            {
                "shares": 2,
                "period": 100000,
                "cpus": "0-79"
            }
        },
        "cgroupsPath": "kubepods-besteffort-podc043b554_0f9c_4db6_874b_6977ee24fa96.slice:cri-containerd:73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511",
        "namespaces":
        [
            {
                "type": "pid"
            },
            {
                "type": "ipc",
                "path": "/proc/340434/ns/ipc"
            },
            {
                "type": "uts",
                "path": "/proc/340434/ns/uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "network",
                "path": "/proc/340434/ns/net"
            }
        ],
        "devices":
        [
            {
                "path": "/dev/nvidiactl",
                "type": "c",
                "major": 195,
                "minor": 255,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            },
            {
                "path": "/dev/nvidia3",
                "type": "c",
                "major": 195,
                "minor": 3,
                "fileMode": 438,
                "uid": 0,
                "gid": 0
            }
        ],
        "maskedPaths":
        [
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware"
        ],
        "readonlyPaths":
        [
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

systemd conf:
cat /run/systemd/system/cri-containerd-73376298ce204adb73424bc020366b89281562a5560cbfaeaee0af0f39071511.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow=
DeviceAllow=/dev/char/195:255 rw
DeviceAllow=/dev/char/195:3 rw
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

@yangfeiyu20102011
Copy link
Copy Markdown
Author

@thaJeztah @kolyshkin Hi,is there a better solution for solving this problem?

Copy link
Copy Markdown
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to being hardcoded, there is no reason to assume that these major and minor numbers will always be associated with NVIDIA devices. The kernel provides basically no guarantees as to what major and minor numbers will be associated with which driver.

At the very least this should be double-checked against /proc/devices -- but even then, I'm not convinced. This also has the potential to be a security issue -- a container might allow 195:254 access for another driver but then this code would cause runc to allow access to the nvidia-related device files because of this hardcoded check.

Unfortunately the correct way to implement this would've been to have the systemd-style semantics (or some other non-major/minor-based semantics) in the runtime-spec so we could've avoided this whole issue. But obviously it's a bit late for that discussion...

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Sep 19, 2022

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

@yangfeiyu20102011
Copy link
Copy Markdown
Author

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

@cyphar Thanks. Is there a plan for solving this issue? I can use this patch in my personal project, but I still hope this problem can be fixed in the latest runc.

@kolyshkin
Copy link
Copy Markdown
Contributor

Perhaps what we should do is to try using device path set in spec, in case /dev/char/MM:mm is not found.

Hmmm. This would work for most device configurations (though not all of them), but we should absolutely double-check that the path on the host has the same major/minor numbers as the rule that references it (otherwise we may end up with a security issue).

Checking is not an issue. The fact that LinuxDeviceCgroup in OCI runtime spec doesn't have Path field is.

Now I'm thinking about creating a device file and passing it to systemd; this might be easier and less error prone.

@zvier
Copy link
Copy Markdown
Contributor

zvier commented Dec 14, 2022

Any better solution about this issue ?

@zvier
Copy link
Copy Markdown
Contributor

zvier commented Feb 27, 2023

The same problem refers to NVIDIA/nvidia-docker#1730 and a fix will be present in the next patch release of all supported NVIDIA GPU drivers.

@kolyshkin
Copy link
Copy Markdown
Contributor

This is now being fixed by #3842.

@kolyshkin kolyshkin closed this Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants