Allow VFIO devices to be passed through as VFIO devices#845
Conversation
|
/test |
|
Adding a comment here as a reminder to go through this in the next few days. @devimc, regardless of how many other reviewers will review this one, I'd appreciate your input before merging this series. |
| return fmt.Errorf("Malformed VFIO option \"%s\"", opt) | ||
| } | ||
|
|
||
| hostBdf := l[0] |
There was a problem hiding this comment.
I would probably explain the format of the BDF you expect here, and maybe validate it a little (even though it is validated indirectly by reading /sys later).
There was a problem hiding this comment.
Hrm, BDF is a pretty standard concept in PCI, and it's used in other places in the code. I really can't think of what to explain that's useful without being verbose overkill.
I don't think validation makes sense here, because we don't actually care about the internal structure of this value. All we care is that it's the thing we need to replace in the PCIDEVICE_ environment variables.
There was a problem hiding this comment.
BDF is standard in PCI jargon, but not necessarily for this audience.
Generally, ACRONYMs in code are OBNOXIOUS, where ACRONYM is, obviously, pretty standard among l33t for Alphabetical Code for Remembering Odd Names You Make up, and OBNOXIOUS is similarly pretty standard (among French developers with a knack for non sense) for Obfuscated Blurbs Not Ordinarily Crossing/Intersecting Our Usual Standards.
In that specific case, though, I'm really interested in specifying that we expect hostBDF to be in the form B:D.F and matching what we find in the PCIDEVICE_... environment variable, and guestPath to be in the form GuestBridge/GuestDeviceID as matching entries in guest /sys/bus/pici/devices.
Regarding validation, I agree with you it can't be done easily that early. However, that made me think that we will need (in a later commit, no need to amend this PR) to add a validation that the loop rewriting the environment actually finds matching PCIDEVICE_ environment variables. If not, then things will fail, and I think we should at least Warn of not Error out.
There was a problem hiding this comment.
BDF is standard in PCI jargon, but not necessarily for this audience.
Hm, maybe. I mean, I'd prefer to call it "PCI address" which is what it is. Unfortunately the Kata code badly abuses "pci address" to mean several different things in different places. "BDF" was the most succinct way I could think of to make it clear that we're talking a normal DDDD:BB:SS.F style PCI address.
Generally, ACRONYMs in code are OBNOXIOUS, where ACRONYM is, obviously, pretty standard among l33t for Alphabetical Code for Remembering Odd Names You Make up, and OBNOXIOUS is similarly pretty standard (among French developers with a knack for non sense) for Obfuscated Blurbs Not Ordinarily Crossing/Intersecting Our Usual Standards.
As a general principle I absolutely disagree. Appropriate use of jargon - including acryonyms - is essential to making code readable without verbosity (which itself diminishes readability). Now, what jargon is appropriate for the context is a non-trivial question, I'll grant you.
In that specific case, though, I'm really interested in specifying that we expect hostBDF to be in the form
B:D.Fand matching what we find in thePCIDEVICE_...environment variable, andguestPathto be in the formGuestBridge/GuestDeviceIDas matching entries in guest/sys/bus/pici/devices.Regarding validation, I agree with you it can't be done easily that early. However, that made me think that we will need (in a later commit, no need to amend this PR) to add a validation that the loop rewriting the environment actually finds matching
PCIDEVICE_environment variables. If not, then things will fail, and I think we should at leastWarnof notErrorout.
Warn.. maybe. We absolutely shouldn't fail on this. It is entirely possible to have a working setup without any PCIDEVICE variables, and in fact it's quite likely if someone is creating a container manually (e.g. with podman) rather than with k8s. For such ad-hoc cases you can either manually set the right address in the payload, or have scripts inside the payload to locate the device via heuristics like "find the only VFIO-accessible PCI device of the type(s) I care about" (my test containers for development do exactly this).
| return updateSpecDevice(spec, devIdx, device.ContainerPath, device.VmPath, device.VmPath) | ||
| } | ||
|
|
||
| func VfioTranslateEnv(s *sandbox, env []string) []string { |
There was a problem hiding this comment.
Ah, this is where the magic happens. Now I finally get it. Smart.
I would add a comment showing the kind of inputs that you expects. Basically, something like "the container receives information about the PCI devices through environment variables that look like PCIDEVICE_ 0000:00:14.0=... etc, and explain the replacement you are doing.
There was a problem hiding this comment.
I believe that function VfioTranslateEnv has no unit test.
| newEnv := make([]string, 0, len(env)) | ||
|
|
||
| for _, e := range env { | ||
| if strings.HasPrefix(e, "PCIDEVICE_") { |
There was a problem hiding this comment.
I would make PCIDEVICE_ configurable, although it currently seems pretty hard-coded in the rest of the stack. That function should probably take that as input. Ask around to see if this logic would not apply to other env variables.
There was a problem hiding this comment.
Hmm. Configurable where, though?
There was a problem hiding this comment.
Just to make it clear: that would have to happen later, not in this PR. Just a note to do it later.
The agent has a config.go that lists some agent options. That would be a good starting point. I would make it a list of prefixes for the names of environment variables to rewrite, default to [ "PCIDEVICE_" ].
Also, this is something that you might want to be able to override with annotations. See https://github.com/kata-containers/runtime/blob/master/virtcontainers/pkg/annotations/annotations.go#L233. I have not checked if agent annotations work well though.
There was a problem hiding this comment.
The environment variable is actually specific to the Kubernetes device plugin being used. For example the Intel SRIOV plugin inserts PCIDEVICE_ (https://github.com/intel/sriov-network-device-plugin). Same with OpenShift SRIOV device plugin (https://github.com/openshift/sriov-network-device-plugin).
I don't think there is any standardization.
Having this environment variable prefix part of runtime config will be preferred
|
New push fixes some of Christoph's comments (others I still need to look at). I now have a first draft of the kata-runtime changes which go with this at kata-containers/runtime#2963 |
Cool, I will look at it. I will also update the comment I made regarding the runtime part with your link. |
| // If no ContainerPath is provided, we won't be able to match and | ||
| // update the device in the OCI spec device list. This is an error. | ||
| if device.ContainerPath == "" { | ||
| // Take the guest device with the given DDDD:BB:SS.F PCI address, |
There was a problem hiding this comment.
Should the comment refer to "DDDD:BB:DD.F" ? (Domain:Bus:Dev.Func )
There was a problem hiding this comment.
I guess S stands for slot? D is already taken by "Domain", so having it twice would be ambiguous.
| // If no hostPath is provided, we won't be able to match and | ||
| // update the device in the OCI spec device list. This is an | ||
| // error. | ||
| if hostPath == "" { |
There was a problem hiding this comment.
Should there be a check for empty vmPath as well?
There was a problem hiding this comment.
If it's empty, it will fail on the syscall.Stat() below with a somewhat sensible error message, so I would not create a special case for that.
|
New update is a substantial rework, rebased on #850 |
|
/test-ubuntu |
Codecov Report
@@ Coverage Diff @@
## master #845 +/- ##
==========================================
- Coverage 57.72% 56.33% -1.39%
==========================================
Files 17 17
Lines 2370 2437 +67
==========================================
+ Hits 1368 1373 +5
- Misses 840 904 +64
+ Partials 162 160 -2 |
Currently TestGetDeviceName checks just one pair of sysfs path with /dev node path, with a fair bit of setup to do that. To allow future tests with different pairs of sys/dev paths, factor out the guts of the test into a parameterized helper `oneGetDeviceNameTest`. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently getDeviceName() and listenToUdevEvents() which supplies it with information ignore events which don't supply a /dev path. However, we have upcoming use cases where we need to wait for a device (in the broad sense) to be ready, even though it doesn't have an actual /dev node. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
hotplugTimeout controls how long we'll wait for a uevent indicating that a device is ready in getDeviceName(). It's currently 3s, which is plenty of time to allow for udev processing that we're usually dealing with. However, we have upcoming cases where we may need to wait for a complete SHPC PCI hotplug to complete. SHPC has a 5s delay built into the protocol (and therefore the guest implementations). To accomodate this case, increase the timeout to 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If extra devices are given in the OCI spec, the user will expect them to be ready and available once the (inner) container starts executing. So, if that means hotplugging something into the Kata VM, we need to wait for that hotplug operation to complete before executing the container. For most devices that's handled by getDeviceName() which waits for a uevent indicating the device is ready. VFIO devices, however, don't have any handler so we don't explicitly wait for them to be ready. We usually get away with it, because we force a PCI rescan in in finishCreateContainer() which blocks until the rescan is complete. That really only works by accident though, completing a rescan doesn't necessarily mean all the device probing logic is complete. Worse, in some cases the PCI rescan can collide with the hotplug processing and cause the device to go into a broken state. So, instead of relying on the rescan, accept information from the runtime about what VFIO devices we expect, and explicitly wait for them to be ready. VFIO devices in Kata are (for now) weird - they will bind to whatever the Kata VM's driver for them is, so could appear as any kind of device (char, block, network interface, etc.). That means waiting for just the PCI device to be ready isn't foolproof, since there could be some extra processing time for the driver to complete probing and create the appropriate secondary devices. But, really, that was true of the forced rescan as well, so this is a clear improvement. fixes #781 Depends-on: github.com/kata-containers/runtime#2981 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PCI bus rescan code was added long time ago in Clear Containers due to lack of ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug in Kata Containers. A workaround to this issue is the "lazy attach" mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the PCI bus, unfourtunately some non-LBS devices are being affected too, for instance SR-IOV devices. It would not make sense to lazy-attach non-LBS devices because kata will end up lazy-attaching all the devices, having said that, the PCI bus rescan code and the "lazy attach" mechanism should be removed I'm not sure why, but this seems to expose a problem in TestStorageHandlers where in some cases it relies on sb.deviceWatchers being initialized, but it isn't. So, fix that up as well. fixes #781 fixes kata-containers/runtime#2664 [1] clearcontainers/agent#139 [2] kata-containers/runtime#2461 Signed-off-by: Julio Montes <julio.montes@intel.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
VFIO devices can be added to a Kata container and they will be passed through to the sandbox guest. However, inside the guest those devices will generally bind to a native guest driver, so they will no longer appear as VFIO devices within the guest. This behaviour differs from runc or other conventional container runtimes. This code allows the agent to match the behaviour of other runtimes, if instructed to by kata-runtime. VFIO devices its informed about with the "vfio" type instead of the existing "vfio-vm" type will be rebound to the vfio-pci driver within the guest. Fixes: #833 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Each VFIO device passed into the guest could represent a whole IOMMU group of devices on the host. Since these devices aren't DMA isolated from each other, they most appear as the same IOMMU group in the guest as well. qemu should enforce that for us, but double check it, since things can't work otherwise. This also means we determine the guest IOMMU group ID, which we'll be needing later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
updateSpecDeviceList is used to update the container configuration to change devices major/minor numbers configured by the Kata client based on host details to values suitable for the sandbox VM, which may differ. It takes a 'device' object, but the only things it actually uses from there are ContainerPath and VmPath. Refactor this as updateSpecDevice(), taking the host and guest paths to the device as explicit parameters. This makes the function more self-contained and will enable some future extensions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
…e host Currently, updateSpecDevice() assumes that the proper device path in the inner container is the same as the device path specified in the outer OCI spec on the host. Usually that's correct. However for VFIO group devices we actually need the container to see the VM's device path, since it's normal to correlate that with IOMMU group information from sysfs which will be different in the guest and which we can't namespace away. So, add an extra "finalPath" parameter to updateSpecDevice to allow callers to chose the device path that should be used for the inner container. All current callers pass the same thing as ContainerPath, but that will change in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
No really new logic here, we just need to expand the filters to allow the udev events for VFIO group devices through. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add the vfio devices to the inner container spec so that libcontainer will make them. Requires several hacks to work around problems for now. Fixes: #833 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
c3d
left a comment
There was a problem hiding this comment.
I did not see any serious functional issue, so approving.
A few minor suggestions, like a typo in a commit message, comments, etc.
| } | ||
|
|
||
| func TestGetDeviceName(t *testing.T) { | ||
| func oneGetDeviceNameTest(t *testing.T, sysName, devName, watchFor string) { |
There was a problem hiding this comment.
nit: for consistency with the other non-exported test functions, I'd call it testGetOneDeviceName. See for example testAddDeviceSuccessful.
| for key, value := range s.sysToDevMap { | ||
| if strings.Contains(key, devID) { | ||
| devName = value | ||
| devName, found = value, true |
|
|
||
| // Timeout waiting for a device to be hotplugged | ||
| var hotplugTimeout = 3 * time.Second | ||
| var hotplugTimeout = 10 * time.Second |
There was a problem hiding this comment.
I agree with a comment by @devimc that this is high enough to have us being killed by the upper layer. I would simply add a comment to that effect, indicating that this timeout is likely to be ineffective. Or maybe drop it to 9s so that we'd have a chance to report the problem before being summarily killed.
| return nil | ||
| } | ||
|
|
||
| func vfioDeviceHandler(ctx context.Context, device pb.Device, spec *pb.Spec, s *sandbox, devIdx devIndex) error { |
There was a problem hiding this comment.
Thanks. The fact that there were errors in my description only highlights how complex the process is.
Alternatively, consider making a diagram in a doc, then adding a comment with a reference to that diagram.
| // If no ContainerPath is provided, we won't be able to match and | ||
| // update the device in the OCI spec device list. This is an error. | ||
| if device.ContainerPath == "" { | ||
| // Take the guest device with the given DDDD:BB:SS.F PCI address, |
There was a problem hiding this comment.
I guess S stands for slot? D is already taken by "Domain", so having it twice would be ambiguous.
| fieldLogger := agentLog.WithField("host-dev", device.ContainerPath) | ||
|
|
||
| var rebindToVfio bool | ||
| var group string |
There was a problem hiding this comment.
Apparently, github has no good way to comment on the commit message.
other, they most appear as the same IOMMU group in the guest as well.
Typo: s/most/must/.
| // If no hostPath is provided, we won't be able to match and | ||
| // update the device in the OCI spec device list. This is an | ||
| // error. | ||
| if hostPath == "" { |
There was a problem hiding this comment.
If it's empty, it will fail on the syscall.Stat() below with a somewhat sensible error message, so I would not create a special case for that.
| stat := syscall.Stat_t{} | ||
| if err := syscall.Stat(device.VmPath, &stat); err != nil { | ||
| if err := syscall.Stat(vmPath, &stat); err != nil { | ||
| return err |
There was a problem hiding this comment.
I wonder if it would be useful to wrap this error in some context? Something like
err = grpcStatus.Errorf(codes.Internal, "Invalid device VM path %s: %s", vmPath, err)
|
Folks, this PR has been stale and I'm closing it as we're walking towards the final 1.13.0 release of the kata-containers-1.x. Thanks a lot for your contribution, we're sorry that for some reason it got stale, and we hope to see you around for new adventures in the kata-containers-2.x land. |
This patches series implements a fix for #833.
It's based on pull requests for other issues, specifically #782 and #836, so those should be merged first.
This series on its own won't do anything without corresponding host-side fixes for kata-containers/runtime#2938 which I'm still polishing for a PR.