Skip to content

Qemu: use nbd sockets for multipath disks#1315

Merged
openshift-merge-robot merged 1 commit intocoreos:masterfrom
darkmuggle:feature/nbd-disks
Apr 3, 2020
Merged

Qemu: use nbd sockets for multipath disks#1315
openshift-merge-robot merged 1 commit intocoreos:masterfrom
darkmuggle:feature/nbd-disks

Conversation

@darkmuggle
Copy link
Copy Markdown
Contributor

This changes up the multipath support by using a NBD server to open the
the qcow2 disks.

This was explored due to #1296 (comment)

if err != nil {
return err
}
defer inst.Destroy()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we were leaving a bunch of swtpm defunct processes laying around. I had ~200 on my system.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably right, but we use prctl(PR_SET_PDEATHSIG) though which should have prevented that anyways.

if err != nil {
return err
}
defer inst.Destroy()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably right, but we use prctl(PR_SET_PDEATHSIG) though which should have prevented that anyways.

Comment thread mantle/platform/qemu.go Outdated
}
if inst.qemu != nil {
// check if qemu is dead before trying to kill it
_, notFound := os.FindProcess(int(inst.Pid()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wait, we're the only thing that should be calling waitpid() on it, so it can't have gone away - it should be safe to call os.Kill() even on a zombie right?

(really what we want are pidfds but we can't rely on those yet)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for checking that was otherwise there's an exit error for Wait() having already been called.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right of course, because the main process blocks in Wait(). So I think the right way to fix this though is to nilify the pointers after calling Wait(). We already guard against nil above:

   if inst.qemu != nil {
       inst.qemu.Wait()
       inst.qemu = nil
   }

And the same for the inst.nbdServers, clear out the array via inst.nbdServers = nil afterwards.

Copy link
Copy Markdown
Member

@cgwalters cgwalters Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Rust, this would be Option<Process> and we'd do something like:

if let Some(proc) = inst.qemu.take() {
   proc.wait()
}

And the "move semantics" here is a lot clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated that. exec.cmd.Kill() already calls Wait() anyway. https://github.com/coreos/mantle/blob/cl/system/exec/exec.go#L71

Comment thread mantle/platform/qemu.go Outdated
Comment thread mantle/platform/qemu.go Outdated
Comment thread mantle/platform/qemu.go Outdated
@darkmuggle darkmuggle added WIP PR still being worked on and removed WIP PR still being worked on labels Apr 2, 2020
@darkmuggle
Copy link
Copy Markdown
Contributor Author

Triple checked this. I've ensured:

  • regular disks work
  • all the temp files and sockets are pruned
  • no orphaned qemu-nbd or swtmp left laying around.

@ashcrow ashcrow requested review from bgilbert and removed request for ashcrow April 3, 2020 14:00
This changes up the multipath support by using a NBD server to open the
the qcow2 disks.
@cgwalters
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, darkmuggle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,darkmuggle]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d898552 into coreos:master Apr 3, 2020
@darkmuggle darkmuggle deleted the feature/nbd-disks branch April 27, 2020 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants