Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

Allow VFIO devices to be passed through as VFIO devices#845

Closed
dgibson wants to merge 11 commits into
kata-containers:masterfrom
dgibson:bug833
Closed

Allow VFIO devices to be passed through as VFIO devices#845
dgibson wants to merge 11 commits into
kata-containers:masterfrom
dgibson:bug833

Conversation

@dgibson
Copy link
Copy Markdown
Contributor

@dgibson dgibson commented Sep 15, 2020

This patches series implements a fix for #833.

It's based on pull requests for other issues, specifically #782 and #836, so those should be merged first.

This series on its own won't do anything without corresponding host-side fixes for kata-containers/runtime#2938 which I'm still polishing for a PR.

@fidencio
Copy link
Copy Markdown
Member

/test

@fidencio
Copy link
Copy Markdown
Member

Adding a comment here as a reminder to go through this in the next few days.

@devimc, regardless of how many other reviewers will review this one, I'd appreciate your input before merging this series.

Copy link
Copy Markdown
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with possibly one bug (ReadLink of the driver entry not taking the base path of the result). So waiting for your input on this.

Also, as discussed on Slack, check if you need to add something to free entries in the new map.

Comment thread agent.go
Comment thread device.go Outdated
Comment thread agent.go
Comment thread agent.go Outdated
Comment thread device.go Outdated
Comment thread device.go Outdated
return fmt.Errorf("Malformed VFIO option \"%s\"", opt)
}

hostBdf := l[0]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably explain the format of the BDF you expect here, and maybe validate it a little (even though it is validated indirectly by reading /sys later).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, BDF is a pretty standard concept in PCI, and it's used in other places in the code. I really can't think of what to explain that's useful without being verbose overkill.

I don't think validation makes sense here, because we don't actually care about the internal structure of this value. All we care is that it's the thing we need to replace in the PCIDEVICE_ environment variables.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BDF is standard in PCI jargon, but not necessarily for this audience.

Generally, ACRONYMs in code are OBNOXIOUS, where ACRONYM is, obviously, pretty standard among l33t for Alphabetical Code for Remembering Odd Names You Make up, and OBNOXIOUS is similarly pretty standard (among French developers with a knack for non sense) for Obfuscated Blurbs Not Ordinarily Crossing/Intersecting Our Usual Standards.

In that specific case, though, I'm really interested in specifying that we expect hostBDF to be in the form B:D.F and matching what we find in the PCIDEVICE_... environment variable, and guestPath to be in the form GuestBridge/GuestDeviceID as matching entries in guest /sys/bus/pici/devices.

Regarding validation, I agree with you it can't be done easily that early. However, that made me think that we will need (in a later commit, no need to amend this PR) to add a validation that the loop rewriting the environment actually finds matching PCIDEVICE_ environment variables. If not, then things will fail, and I think we should at least Warn of not Error out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BDF is standard in PCI jargon, but not necessarily for this audience.

Hm, maybe. I mean, I'd prefer to call it "PCI address" which is what it is. Unfortunately the Kata code badly abuses "pci address" to mean several different things in different places. "BDF" was the most succinct way I could think of to make it clear that we're talking a normal DDDD:BB:SS.F style PCI address.

Generally, ACRONYMs in code are OBNOXIOUS, where ACRONYM is, obviously, pretty standard among l33t for Alphabetical Code for Remembering Odd Names You Make up, and OBNOXIOUS is similarly pretty standard (among French developers with a knack for non sense) for Obfuscated Blurbs Not Ordinarily Crossing/Intersecting Our Usual Standards.

As a general principle I absolutely disagree. Appropriate use of jargon - including acryonyms - is essential to making code readable without verbosity (which itself diminishes readability). Now, what jargon is appropriate for the context is a non-trivial question, I'll grant you.

In that specific case, though, I'm really interested in specifying that we expect hostBDF to be in the form B:D.F and matching what we find in the PCIDEVICE_... environment variable, and guestPath to be in the form GuestBridge/GuestDeviceID as matching entries in guest /sys/bus/pici/devices.

Regarding validation, I agree with you it can't be done easily that early. However, that made me think that we will need (in a later commit, no need to amend this PR) to add a validation that the loop rewriting the environment actually finds matching PCIDEVICE_ environment variables. If not, then things will fail, and I think we should at least Warn of not Error out.

Warn.. maybe. We absolutely shouldn't fail on this. It is entirely possible to have a working setup without any PCIDEVICE variables, and in fact it's quite likely if someone is creating a container manually (e.g. with podman) rather than with k8s. For such ad-hoc cases you can either manually set the right address in the payload, or have scripts inside the payload to locate the device via heuristics like "find the only VFIO-accessible PCI device of the type(s) I care about" (my test containers for development do exactly this).

Comment thread device.go Outdated
Comment thread device.go Outdated
Comment thread device.go Outdated
return updateSpecDevice(spec, devIdx, device.ContainerPath, device.VmPath, device.VmPath)
}

func VfioTranslateEnv(s *sandbox, env []string) []string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is where the magic happens. Now I finally get it. Smart.

I would add a comment showing the kind of inputs that you expects. Basically, something like "the container receives information about the PCI devices through environment variables that look like PCIDEVICE_ 0000:00:14.0=... etc, and explain the replacement you are doing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that function VfioTranslateEnv has no unit test.

Comment thread device.go Outdated
newEnv := make([]string, 0, len(env))

for _, e := range env {
if strings.HasPrefix(e, "PCIDEVICE_") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make PCIDEVICE_ configurable, although it currently seems pretty hard-coded in the rest of the stack. That function should probably take that as input. Ask around to see if this logic would not apply to other env variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Configurable where, though?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it clear: that would have to happen later, not in this PR. Just a note to do it later.

The agent has a config.go that lists some agent options. That would be a good starting point. I would make it a list of prefixes for the names of environment variables to rewrite, default to [ "PCIDEVICE_" ].

Also, this is something that you might want to be able to override with annotations. See https://github.com/kata-containers/runtime/blob/master/virtcontainers/pkg/annotations/annotations.go#L233. I have not checked if agent annotations work well though.

Copy link
Copy Markdown
Contributor

@bpradipt bpradipt Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable is actually specific to the Kubernetes device plugin being used. For example the Intel SRIOV plugin inserts PCIDEVICE_ (https://github.com/intel/sriov-network-device-plugin). Same with OpenShift SRIOV device plugin (https://github.com/openshift/sriov-network-device-plugin).
I don't think there is any standardization.
Having this environment variable prefix part of runtime config will be preferred

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 17, 2020

New push fixes some of Christoph's comments (others I still need to look at).

I now have a first draft of the kata-runtime changes which go with this at kata-containers/runtime#2963

@c3d
Copy link
Copy Markdown
Member

c3d commented Sep 17, 2020

New push fixes some of Christoph's comments (others I still need to look at).

I now have a first draft of the kata-runtime changes which go with this at kata-containers/runtime#2963

Cool, I will look at it. I will also update the comment I made regarding the runtime part with your link.

@dgibson dgibson added needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository labels Sep 18, 2020
@dgibson dgibson linked an issue Sep 18, 2020 that may be closed by this pull request
@dgibson dgibson self-assigned this Sep 18, 2020
Comment thread agent.go Outdated
Comment thread device.go
// If no ContainerPath is provided, we won't be able to match and
// update the device in the OCI spec device list. This is an error.
if device.ContainerPath == "" {
// Take the guest device with the given DDDD:BB:SS.F PCI address,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment refer to "DDDD:BB:DD.F" ? (Domain:Bus:Dev.Func )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess S stands for slot? D is already taken by "Domain", so having it twice would be ambiguous.

Comment thread device.go
// If no hostPath is provided, we won't be able to match and
// update the device in the OCI spec device list. This is an
// error.
if hostPath == "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a check for empty vmPath as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's empty, it will fail on the syscall.Stat() below with a somewhat sensible error message, so I would not create a special case for that.

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 24, 2020

New update is a substantial rework, rebased on #850

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 24, 2020

/test-ubuntu

@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 24, 2020

Codecov Report

Merging #845 into master will decrease coverage by 1.38%.
The diff coverage is 19.76%.

@@            Coverage Diff             @@
##           master     #845      +/-   ##
==========================================
- Coverage   57.72%   56.33%   -1.39%     
==========================================
  Files          17       17              
  Lines        2370     2437      +67     
==========================================
+ Hits         1368     1373       +5     
- Misses        840      904      +64     
+ Partials      162      160       -2     

dgibson and others added 7 commits October 29, 2020 11:47
Currently TestGetDeviceName checks just one pair of sysfs path with
/dev node path, with a fair bit of setup to do that.  To allow future
tests with different pairs of sys/dev paths, factor out the guts of
the test into a parameterized helper `oneGetDeviceNameTest`.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently getDeviceName() and listenToUdevEvents() which supplies it with
information ignore events which don't supply a /dev path.  However, we have
upcoming use cases where we need to wait for a device (in the broad sense)
to be ready, even though it doesn't have an actual /dev node.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
hotplugTimeout controls how long we'll wait for a uevent indicating that a
device is ready in getDeviceName().  It's currently 3s, which is plenty of
time to allow for udev processing that we're usually dealing with.

However, we have upcoming cases where we may need to wait for a complete
SHPC PCI hotplug to complete.  SHPC has a 5s delay built into the protocol
(and therefore the guest implementations).

To accomodate this case, increase the timeout to 10s.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If extra devices are given in the OCI spec, the user will expect them
to be ready and available once the (inner) container starts executing.
So, if that means hotplugging something into the Kata VM, we need to
wait for that hotplug operation to complete before executing the
container.  For most devices that's handled by getDeviceName() which
waits for a uevent indicating the device is ready.

VFIO devices, however, don't have any handler so we don't explicitly
wait for them to be ready.  We usually get away with it, because we
force a PCI rescan in in finishCreateContainer() which blocks until
the rescan is complete.  That really only works by accident though,
completing a rescan doesn't necessarily mean all the device probing
logic is complete.  Worse, in some cases the PCI rescan can collide
with the hotplug processing and cause the device to go into a broken
state.

So, instead of relying on the rescan, accept information from the
runtime about what VFIO devices we expect, and explicitly wait for
them to be ready.

VFIO devices in Kata are (for now) weird - they will bind to whatever
the Kata VM's driver for them is, so could appear as any kind of
device (char, block, network interface, etc.).  That means waiting for
just the PCI device to be ready isn't foolproof, since there could be
some extra processing time for the driver to complete probing and
create the appropriate secondary devices.

But, really, that was true of the forced rescan as well, so this is a
clear improvement.

fixes #781
Depends-on: github.com/kata-containers/runtime#2981

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

I'm not sure why, but this seems to expose a problem in TestStorageHandlers
where in some cases it relies on sb.deviceWatchers being initialized, but
it isn't.  So, fix that up as well.

fixes #781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
VFIO devices can be added to a Kata container and they will be passed
through to the sandbox guest.  However, inside the guest those devices
will generally bind to a native guest driver, so they will no longer
appear as VFIO devices within the guest.  This behaviour differs from
runc or other conventional container runtimes.

This code allows the agent to match the behaviour of other runtimes,
if instructed to by kata-runtime.  VFIO devices its informed about
with the "vfio" type instead of the existing "vfio-vm" type will be
rebound to the vfio-pci driver within the guest.

Fixes: #833

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Each VFIO device passed into the guest could represent a whole IOMMU group
of devices on the host.  Since these devices aren't DMA isolated from each
other, they most appear as the same IOMMU group in the guest as well.

qemu should enforce that for us, but double check it, since things can't
work otherwise.  This also means we determine the guest IOMMU group ID,
which we'll be needing later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
updateSpecDeviceList is used to update the container configuration to
change devices major/minor numbers configured by the Kata client based
on host details to values suitable for the sandbox VM, which may
differ.  It takes a 'device' object, but the only things it actually
uses from there are ContainerPath and VmPath.

Refactor this as updateSpecDevice(), taking the host and guest paths
to the device as explicit parameters.  This makes the function more
self-contained and will enable some future extensions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
…e host

Currently, updateSpecDevice() assumes that the proper device path in the
inner container is the same as the device path specified in the outer OCI
spec on the host.

Usually that's correct.  However for VFIO group devices we actually need
the container to see the VM's device path, since it's normal to correlate
that with IOMMU group information from sysfs which will be different in the
guest and which we can't namespace away.

So, add an extra "finalPath" parameter to updateSpecDevice to allow callers
to chose the device path that should be used for the inner container.  All
current callers pass the same thing as ContainerPath, but that will change
in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
No really new logic here, we just need to expand the filters to allow the
udev events for VFIO group devices through.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Add the vfio devices to the inner container spec so that libcontainer will
make them.  Requires several hacks to work around problems for now.

Fixes: #833

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Copy link
Copy Markdown
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see any serious functional issue, so approving.

A few minor suggestions, like a typo in a commit message, comments, etc.

Comment thread device_test.go
}

func TestGetDeviceName(t *testing.T) {
func oneGetDeviceNameTest(t *testing.T, sysName, devName, watchFor string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for consistency with the other non-exported test functions, I'd call it testGetOneDeviceName. See for example testAddDeviceSuccessful.

Comment thread device.go
for key, value := range s.sysToDevMap {
if strings.Contains(key, devID) {
devName = value
devName, found = value, true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the style change 😄

Comment thread agent.go

// Timeout waiting for a device to be hotplugged
var hotplugTimeout = 3 * time.Second
var hotplugTimeout = 10 * time.Second
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with a comment by @devimc that this is high enough to have us being killed by the upper layer. I would simply add a comment to that effect, indicating that this timeout is likely to be ineffective. Or maybe drop it to 9s so that we'd have a chance to report the problem before being summarily killed.

Comment thread device.go
return nil
}

func vfioDeviceHandler(ctx context.Context, device pb.Device, spec *pb.Spec, s *sandbox, devIdx devIndex) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The fact that there were errors in my description only highlights how complex the process is.

Alternatively, consider making a diagram in a doc, then adding a comment with a reference to that diagram.

Comment thread device.go
// If no ContainerPath is provided, we won't be able to match and
// update the device in the OCI spec device list. This is an error.
if device.ContainerPath == "" {
// Take the guest device with the given DDDD:BB:SS.F PCI address,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess S stands for slot? D is already taken by "Domain", so having it twice would be ambiguous.

Comment thread device.go
fieldLogger := agentLog.WithField("host-dev", device.ContainerPath)

var rebindToVfio bool
var group string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, github has no good way to comment on the commit message.

other, they most appear as the same IOMMU group in the guest as well.

Typo: s/most/must/.

Comment thread device.go
// If no hostPath is provided, we won't be able to match and
// update the device in the OCI spec device list. This is an
// error.
if hostPath == "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's empty, it will fail on the syscall.Stat() below with a somewhat sensible error message, so I would not create a special case for that.

Comment thread device.go
stat := syscall.Stat_t{}
if err := syscall.Stat(device.VmPath, &stat); err != nil {
if err := syscall.Stat(vmPath, &stat); err != nil {
return err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to wrap this error in some context? Something like

err = grpcStatus.Errorf(codes.Internal, "Invalid device VM path %s: %s", vmPath, err)

@fidencio
Copy link
Copy Markdown
Member

fidencio commented May 4, 2021

Folks, this PR has been stale and I'm closing it as we're walking towards the final 1.13.0 release of the kata-containers-1.x.
From now on, all the development will happen as part of https://github.com/kata-containers/kata-containers/ and fixes should be provided to that repo.

Thanks a lot for your contribution, we're sorry that for some reason it got stale, and we hope to see you around for new adventures in the kata-containers-2.x land.

@fidencio fidencio closed this May 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kata should be able to passthrough VFIO devices *as* VFIO devices

4 participants