Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

Provide information to agent to let it safely wait for VFIO devices to complete hotplug#2981

Closed
dgibson wants to merge 5 commits into
kata-containers:masterfrom
dgibson:bug2664
Closed

Provide information to agent to let it safely wait for VFIO devices to complete hotplug#2981
dgibson wants to merge 5 commits into
kata-containers:masterfrom
dgibson:bug2664

Conversation

@dgibson
Copy link
Copy Markdown
Contributor

@dgibson dgibson commented Sep 23, 2020

This is the runtime part of the fix for #2664 and kata-containers/agent#781.

The agent part is kata-containers/agent#850

@dgibson dgibson self-assigned this Sep 23, 2020
@dgibson dgibson added needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository labels Sep 23, 2020
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 23, 2020

/test-ubuntu

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 24, 2020

/test-ubuntu

@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 24, 2020

Codecov Report

Merging #2981 (f3f282e) into master (2a98f43) will increase coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2981      +/-   ##
==========================================
+ Coverage   50.35%   50.63%   +0.27%     
==========================================
  Files         120      119       -1     
  Lines       15918    16719     +801     
==========================================
+ Hits         8016     8466     +450     
- Misses       6812     7117     +305     
- Partials     1090     1136      +46     

Copy link
Copy Markdown

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dgibson - I have some questions

Comment thread virtcontainers/device/drivers/vfio.go Outdated
Comment thread virtcontainers/qemu.go
Copy link
Copy Markdown

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgibson - a few comments.

Comment thread virtcontainers/device/drivers/vfio.go
Comment thread virtcontainers/kata_agent.go
Comment thread virtcontainers/qemu.go Outdated
Comment thread virtcontainers/qemu.go Outdated
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 30, 2020

/test

Copy link
Copy Markdown

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgibson.

lgtm

Comment thread virtcontainers/kata_agent.go
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Sep 30, 2020

/test

@dgibson dgibson closed this Oct 1, 2020
@dgibson dgibson reopened this Oct 1, 2020
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 1, 2020

/retest

@dgibson dgibson requested a review from amshinde October 1, 2020 05:56
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 7, 2020

/test-vfio

@dgibson dgibson added the do-not-merge PR has problems or depends on another label Oct 7, 2020
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 7, 2020

/test-vfio

Comment thread virtcontainers/qemu.go
Comment thread virtcontainers/clh.go Outdated
@dgibson dgibson marked this pull request as draft October 8, 2020 00:40
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 28, 2020

/test-vfio

1 similar comment
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 29, 2020

/test-vfio

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 29, 2020

Bother. I've fixed the failure with clh, but now there's another failure in the VFIO check which I don't have any theory to explain.

@devimc
Copy link
Copy Markdown

devimc commented Oct 29, 2020

/test-vfio

@devimc
Copy link
Copy Markdown

devimc commented Oct 29, 2020

@dgibson this is the error:

rpc error: code = DeadlineExceeded desc = Timeout reached after 10s waiting for device 0000:00:02.0/0000:01:01.0

I have restarted the job and fixed the teardown process, The runtime logs will be available here http://jenkins.katacontainers.io/job/kata-containers-runtime-vfio-PR/261/ once the job finish

@devimc
Copy link
Copy Markdown

devimc commented Oct 29, 2020

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Oct 29, 2020

@dgibson this is the error:

rpc error: code = DeadlineExceeded desc = Timeout reached after 10s waiting for device 0000:00:02.0/0000:01:01.0

Yeah, I got that far. I don't know why that error is occuring - it doesn't happen when I try it by hand, it's not obvious what's different about the CI environment, and it's far from easy to replicate exactly what the CI is doing.

@jodh-intel
Copy link
Copy Markdown

@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Nov 25, 2020

/test-vfio

dgibson and others added 5 commits December 2, 2020 14:50
Bring in the QomGet function.

shortlog:

Chenbin (1):
      typo fix

David Gibson (1):
      Add qom-get function

Jakob-Naucke (1):
      Add support for hot-plugging IBM VFIO-AP devices

Julio Montes (3):
      travis: Run coveralls after success
      github: enable github actions
      travis: disable amd64 jobs

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
hotplugVFIODevice() has several different paths depending if we're plugging
into a root port or a PCIE<->PCI bridge and if we're using a regular
or mediated VFIO device.

We're going to want some common code on the successful exit path here, so
refactor the function to allow that without duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
For several device types which correspond to a PCI device in the guest
we record the device's PCI path in the guest.  We don't currently do that
for VFIO devices, but we're going to need to for better handling of
SR-IOV devices.

To accomplish this, we have to determine the guest PCI path from the
information the VMM gives us:

For qemu, we query the slot of the device and its bridge from QMP.

For cloud-hypervisor, the device add interface gives us a guest PCI
address.  In fact this represents a design error in the clh API -
there's no way it can really know the guest PCI address in general.
It works in this case, because clh doesn't use PCI bridges, so the
device will always be on the root bus.  Based on that, the PCI path is
simply the device's slot number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We send information about several kinds of devices to the agent so that it
can apply specific handling.  We don't currently do this with VFIO devices.
However we need to do that so that the agent can properly wait for VFIO
devices to be ready (previously it did that using a PCI rescan which may
not be reliable and has some very bad side effects).

This patch collates and sends the relevant information.

Depends-on: github.com/kata-containers/agent#850
fixes #2664

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The "lazy attach" mechanism [1] was added to hotplugs LBS (Large BAR space)
devices after re-scanning the PCI bus, fixing LBS hotplug in kata containers.
Since PCI rescan is removed in kata-containers/agent#782, lazy attach is not
longer needed.

fixes #2664

[1] #2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
@dgibson
Copy link
Copy Markdown
Contributor Author

dgibson commented Dec 18, 2020

I'm no longer planning to pursue this in Kata1, I'll be following up in Kata 2 instead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

do-not-merge PR has problems or depends on another needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants