WIP: run compose in small supermin VM by dustymabe · Pull Request #124 · coreos/coreos-assembler

dustymabe · 2018-09-26T02:14:24Z

Allows for unprivileged runs of rpm-ostree compose tree.

dustymabe · 2018-09-26T02:15:09Z

this is a WIP. i'm still experimenting a bit, but wanted to push my code up and socialize this a bit.

I'm planning on breaking this into a few separate commits.

dustymabe · 2018-09-26T02:25:14Z

this is a WIP. i'm still experimenting a bit, but wanted to push my code up and socialize this a bit.

IOW maybe hold off on nit comments and focus on architectural ones

cgwalters · 2018-09-26T13:31:32Z

+export manifest=${configdir}/manifest.yaml
+export superminpreparedir="${workdir}/tmp/supermin-prepare.d"
+export superminbuilddir="${workdir}/tmp/supermin-build.d"
+export cachesimg="${workdir}/caches.qcow2"


Some of this could be split into a prep commit.

yep. i want to do that and also was thinking about making global variables all caps.

cgwalters · 2018-09-26T13:33:35Z

+    rpms+=' systemd' # for clean reboot
+    rpms+=' dhcp-client bind-export-libs iproute' # networking
+    rpms+=' rpm-ostree distribution-gpg-keys' # to run the compose
+    rpms+=' selinux-policy selinux-policy-targeted policycoreutils' #selinux


Maybe nicer to maintain this as a supermin.txt file or so in our source?

One thing to think about too is that supermin also supports --build explicitly which we could do as part of our container build. (The whole libguestfs/supermin was designed around pulling content dynamically from the host, but in our case the container is static)

Maybe nicer to maintain this as a supermin.txt file or so in our source?

i.e. just the rpm requirements in a separate file?

One thing to think about too is that supermin also supports --build explicitly which we could do as part of our container build. (The whole libguestfs/supermin was designed around pulling content dynamically from the host, but in our case the container is static)

yeah I do a --build below, right? I'd rather not do the --build as part of the container build because it will just add a rootfs filesystem to our container and will make it larger to download with not real benefit. that's why I put it as part of the coreos-assembler init

Well, the benefit is it's faster to run each time.

Note that coreos-assembler init is intended to be run only once. If the VM image is cached there, then when you pull a new container, it will be out of date.

We have cache/...but really every time the container changes the VM image is going to need to change anyways. And with this, the container doesn't do much without the VM image so...we're trading off download/disk space versus CPU usage every update.

cgwalters · 2018-09-26T13:35:49Z

+    qemu-kvm -nodefaults -nographic -m 2048 -no-reboot \
+              -kernel "${superminbuilddir}/kernel" \
+              -initrd "${superminbuilddir}/initrd" \
+              -netdev user,id=eth0,hostname=supermin,smb="${workdir}",hostfwd=tcp:127.0.0.1:8000-:8000 \


Since today we run with --net=host, suddenly we'll need to worry about port conflicts. It is tempting to go back to the default bridged networking. I forget if there was a reason I added --net=host to the default arguments...oh oh right, it's because I use dns=dnsmasq on my host and podman today can't handle that. Urgh. I can deal with working around that here.

But in general...for sharing content between the container and VM I lean a bit towards using say rsync-over-ssh-over-virtio or something like that - there's no good reason to use TCP here.

actually I can remove the hostfwd I don't actually need it. I was experimenting to see if pulling via HTTP was any faster than doing the ostree pull-local over smb. it wasn't, so I'll just remove this.

Since today we run with --net=host

I've actually been running without --net=host in my tests, so my plan was to actually remove that. Are you saying that would be problematic?

It's OK to remove if that turns out to be useful, I'll figure out how to make my current setup work with default bridged.

cgwalters · 2018-09-26T13:36:09Z

+              -drive if=none,id=drive-scsi0-0-0-0,snapshot=on,file="${superminbuilddir}/root" \
+              -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
+              -drive if=none,id=drive-scsi0-0-0-1,discard=unmap,file="${cachesimg}" \
+              -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 \


Where did you get this list of arguments?

handcrafted. some of it I copied from other files in this repo. the scsi lines are there so I can fstrim inside the VM and have it provide free space back to the host. I copied those by setting up a libvirt VM with virtio-scsi and discard=unmap and then inspecting the qemu arguments..

see: https://dustymabe.com/2013/06/11/recover-space-from-vm-disk-images-by-using-discardfstrim/

cgwalters · 2018-09-26T13:36:34Z

+
+# set up networking
+/usr/sbin/dhclient eth0 & 
+sleep 2 # wait for dhcp


haha, yeah it's possible I don't need that at all, I just figured I'd need to wait some amount of time :)

Are either of you having issues with dhclient not being able to find libirs-export.so.160?
Consider adding /sbin/ldconfig to the init file.

this is my list of rpms for "networking" that I had listed: rpms+=' dhcp-client bind-export-libs iproute' # networking

I have those rpms installed into the supermin appliance. However, dhclient only worked after I relinked the libraries. If you aren't having this issue, great! I'm trying to follow your script through step by step to understand.

cgwalters · 2018-09-26T13:37:59Z

+    #bash -i
+
+    # Sync over to repo in $hostworkdir
+    if [ "${unifiedcore}" = "1" ]; then


I am a bit confused by the unifiedcore conditionals here. Also in general...I would be OK for now just hard requiring unified core. We can eat the cost of making RHCOS work with it.

yeah I was hoping to still work with RHCOS - I'll drop it if we can hard require it

I'll take ownership of making --unified-core work with RHCOS.

cgwalters · 2018-09-26T13:45:30Z

High level...I understand the rationale for this, not having this container require --privileged would be really helpful.

The main question is - do we support both modes? It's going to be a serious pain to maintain. One thing that will suck in this mode is things like - what happens if rpm-ostree dumps core? Getting coredumps out of these supermin VMs is going to require some hacks.

One angle I look at this from is that IMO a very important property of ostree is "it's just files". It works on top of whatever block storage you have. It was also designed from the very start to operate unprivileged - there are just some last bits to get over the hump for rpm-ostree.

Building from that angle, on the client side we don't do anything with VMs - it's all again just files.

Shoving the whole operation in a VM to work around recursive containerization bugs...well, it's just unfortunate.

But, it is probably the right thing to do.

dustymabe · 2018-09-26T14:05:58Z

first off, thanks for the quick look through :)

The main question is - do we support both modes?

I thought about that. There are some definite drawbacks to running in a VM. It's definitely slower. The first build takes some time. Incremental builds are much better (probably the longest operation during those is dracut initrd generation). That being said I really think we should only support one mode of operation so we don't see issues when building locally vs in the build system.

there are just some last bits to get over the hump for rpm-ostree.

yep, this is a short term solution until we clear that hurdle

jlebon · 2018-09-26T14:45:17Z

This is cool. I never actually tried out supermin before.

There are some definite drawbacks to running in a VM. It's definitely slower.

How much slower are we talking about? Are virtualized incremental builds much slower than native incremental builds?

dustymabe · 2018-09-26T14:57:50Z

This is cool. I never actually tried out supermin before.

yeah I had never dug into it deep before. it's really slick software

There are some definite drawbacks to running in a VM. It's definitely slower.

How much slower are we talking about? Are virtualized incremental builds much slower than native incremental builds?

Actually I'm not worried about incremental builds with the speed here as those are reasonable I think. The first build takes a while as there is a lot of file I/O going over the smb file share from supermin VM back to host (coreos-assembler container). I'm looking at other ways to speed this up. Still investigating :)

cgwalters · 2018-09-27T18:43:53Z

Are you actively hacking on this still? If so you may want to de-prioritize a bit as for RHCOS, we may face a very quick requirement to run without privileges but just to generate the oscontainers. I thought of a new approach to do this in rpm-ostree that should be less invasive/hacky.

dustymabe · 2018-09-27T19:31:55Z

Are you actively hacking on this still?

I took a few tangents and found a few bugs, but yeah, was planning to get back here and clean things up as we had discussed in the comments above.

I thought of a new approach to do this in rpm-ostree that should be less invasive/hacky.

so basically rpm-ostree would be able to run unpriv by default?

dustymabe · 2018-09-28T13:53:56Z

putting this on hold until I hear back from @cgwalters on his experiment.

Allows for unprivileged runs of rpm-ostree compose tree.

This simplifies things a bit. Also the xattr copying from bare-user to archive repo over smb file share has been fixed.

Promaethius · 2018-09-29T03:33:24Z

Does this supercede #113?

dustymabe · 2018-09-30T23:55:59Z

Does this supercede #113?

not really. lorax is used to build ISO images. This just moves the treecompose we're already doing today to inside a VM so that we can run the process completely unprivileged. I think colin is doing some work on the ostree/rpm-ostree side to make it so we can run unprivileged even without running in a VM so I added the hold label to this PR.

Promaethius · 2018-10-01T19:10:09Z

Does this supercede #113?

not really. lorax is used to build ISO images. This just moves the treecompose we're already doing today to inside a VM so that we can run the process completely unprivileged. I think colin is doing some work on the ostree/rpm-ostree side to make it so we can run unprivileged even without running in a VM so I added the hold label to this PR.

Ah ok, thank you for clarifying.
In messing around with supermin briefly, it seemed to be a good way to generate iPXE compatible images without worrying about Anaconda.

cgwalters · 2018-10-17T20:50:02Z

The run-in-Docker work in rpm-ostree is probably going to stall out for a while so let's see about getting this in.

dustymabe · 2018-10-17T22:52:13Z

yeah there will be a series of PRs for this soon

jlebon · 2018-10-26T20:45:00Z

Just gonna keep track of issues I'm hitting here while testing this:

If src/config is a link, the supermin appliance doesn't find it
Seems like cached RPMs are not shared with the host?

Hmm, and now I'm getting:

Writing rpmdb... error: Plugin selinux: hook tsm_pre failed
error: Failed to update rpmdb (rpmtsRun code -1)

Which I've definitely seen before but don't remember where it comes from. I think it's from rpm-plugin-selinux? Gotta dig deeper.

jlebon · 2018-10-26T20:55:58Z

The main question is - do we support both modes?

I thought about that. There are some definite drawbacks to running in a VM. It's definitely slower. The first build takes some time. Incremental builds are much better (probably the longest operation during those is dracut initrd generation). That being said I really think we should only support one mode of operation so we don't see issues when building locally vs in the build system.

This is my real concern. The simplicity of the current approach is just so much nicer and easier to hack on. But I agree at the same time that we should really try hard to support just one mode.

Part of me wonders if this shouldn't actually be external to coreos-assembler. If access to /dev/kvm is easier to get than CAP_SYS_ADMIN, then nothing stops us from just booting a full f28 VM and just running it in there right? With the necessary 9p mounts so that we still operate on the same ${workdir} inside or outside the VM. Sure, it'll be slower, but the advantage is that the workflow will be mostly identical modulo the 9p mounts between the /dev/kvm and CAP_SYS_ADMIN environments. Heck, we could even ship a helper container image which contains the VM image already and a script for the mounts.

Not sure how much of this makes sense, just thinking out loud.

dustymabe · 2018-10-26T21:00:58Z

Just gonna keep track of issues I'm hitting here while testing this:

are you using my master branch (not this PR branch?)

1. If `src/config` is a link, the supermin appliance doesn't find it

can't remember if I've seen that or not

2. Seems like cached RPMs are not shared with the host?

no they live inside a disk attached to the supermin VM.

jlebon · 2018-10-26T21:03:55Z

are you using my master branch (not this PR branch?)

Yup!

no they live inside a disk attached to the supermin VM.

Heh, yeah shortly after posting that comment I saw the caches.qcow2.

cgwalters · 2018-10-26T21:08:29Z

then nothing stops us from just booting a full f28 VM and just running it in there right?

Like, we copy all of the container contents in there?

jlebon · 2018-10-26T21:37:39Z

It would be something outside of the container. So the Jenkins job would launch a VM in which we pull the container-assembler image and run it as usual. E.g. something like:

podTemplate(label: 'vm-container', cloud: 'openshift', containers: [
    containerTemplate(name: 'jnlp', image: 'docker-registry.default.svc:5000/fedora-coreos/vm-container',
                      args: '${computer.jnlpmac} ${computer.name}')
    ]) {
  node('vm-container') {
    sh 'qemu-kvm -drive if=file=/fedora28.qcow2 -net user,hostfwd=tcp::2222-:22 --fsdev local,path=${workdir},... ...'
    sh 'ssh root@localhost -p2222 mount -t 9p -o rw,trans=virtio,version=9p2000.L ... /srv/workdir'
    sh 'ssh root@localhost -p2222 podman run -v ...:/srv/workdir --workdir /srv/workdir quay.io/.../coreos-assembler fetch'
    sh 'ssh root@localhost -p2222 podman run -v ...:/srv/workdir --workdir /srv/workdir quay.io/.../coreos-assembler build'
  }
}

So basically the same workflow coreos-assembler-wise, we're just spawning a VM to do it. Let me try this out and see how realistic this is.

cgwalters · 2018-10-26T21:46:10Z

So the Jenkins job would launch a VM in which we pull the container-assembler image and run it as usual.

I think that will work, but one tricky aspect is going to be ensuring the VM pulls the same container content, and that will definitely matter in some cases (e.g. imagine the build system is using a pinned version).

It feels like it'd be better if we just copy the running container content into the VM, which is basically what supermin is helping us do.

jlebon · 2018-10-26T21:59:14Z

I think that will work, but one tricky aspect is going to be ensuring the VM pulls the same container content, and that will definitely matter in some cases (e.g. imagine the build system is using a pinned version).

That's just a variable substitution in that podman invocation above, right?

It feels like it'd be better if we just copy the running container content into the VM, which is basically what supermin is helping us do.

Essentially in that workflow, there wouldn't be a parent coreos-assembler container, the one we want is the one we run in the VM. The vm-container image just contains the VM image to spawn + qemu-kvm. I agree that introducing another image is annoying. OTOH, since it's just to run qemu-kvm it wouldn't require too much maintenance ideally.

dustymabe · 2018-10-29T16:12:27Z

So basically the same workflow coreos-assembler-wise, we're just spawning a VM to do it. Let me try this out and see how realistic this is.

yeah the biggest problem I see here is "copying content". It gets a lot more heavy we're not just relying on openshift to manage our container images, but also we've got a VM disk image (or are we using supermin again?) Where do we cache things?

I'd like to minimze the I/O if possible.

OSTree was designed from the very beginning of its existence to support SELinux well instead of being something wedged on. rpm-ostree builds on that foundation. We don't want to have anything to do with librpm's SELinux code. And with unified core, we usually don't, but that `rpm-plugin-selinux` code does get loaded. Disable it here. The main reason I'm submitting this patch is to help an effort in coreos-assembler to use a "supermin" virtual machine: coreos/coreos-assembler#124

OSTree was designed from the very beginning of its existence to support SELinux well instead of being something wedged on. rpm-ostree builds on that foundation. We don't want to have anything to do with librpm's SELinux code. And with unified core, we usually don't, but that `rpm-plugin-selinux` code does get loaded. Disable it here. The main reason I'm submitting this patch is to help an effort in coreos-assembler to use a "supermin" virtual machine: coreos/coreos-assembler#124 Closes: #1647 Approved by: jlebon

jlebon · 2018-10-29T19:07:44Z

Was chatting with Dusty about this, and it essentially came down to whether we want to keep the complexity out of coreos-assembler, or build it in (what this PR does). My main concern was the increased complexity for the local dev case. Though given that both FCOS and RHCOS pipelines will need fully unprivileged composes, I agree that it's odd not to bake it in. Hopefully once unprivileged rpm-ostree composes become supported we can unwind some of this stuff.

Anyway, with coreos/fedora-coreos-config#22 fixed, this works for me locally now, though I'd like to reproduce the I/O issues that Dusty was seeing when trying to compose directly into the mounts so we can avoid doing the caches.qcow2 thing.

jlebon · 2018-10-30T17:12:57Z

So I was hacking on this and getting more familiar with 9p to try hard to avoid copying data back and forth. My impressions are that it's not quite prod ready yet. There's various issues that make composing directly into the mounts an issue. The two major ones are:

uid/gid mapping: with security_model=none, the server side (qemu) directly passes through the real uid/gid to the client (guest kernel) of the underlying file. The issue arises then with apps trying to chown(path, 0, 0) like librpm does, which will get EPERM. I played withsecurity_model=mapped-xattr, which uses user xattrs to completely separate the host ids from the guest ids, though there's corner cases then like getting ELOOP when trying to dereference a symlink. I also had to chmod -R a+rX upfront on the host and chown -R root:root on the guest as well.
there is an issue with 9p where calling fstat after unlink (which librpm does) will return an error (https://bugzilla.redhat.com/show_bug.cgi?id=1114221) because the server tries to fetch the xattrs from the filepath instead of the fd. There were discussions and potential patches upstream but it still hasn't been fully resolved.

Random idea: teach coreos-assembler fetch to fetch directly into a qcow2 filesystem instead, i.e. instead of having a cache/ directory, we only have a cache.qcow2? Everything else we need is super small (e.g. commit metadata) and could be copied over from the mounts each time.

dustymabe · 2018-10-30T18:09:08Z

My impressions are that it's not quite prod ready yet.

yep. I hit a ton of issues when trying to "compose into" a bare user repo mounted over 9p. What I did get to work was a "pull-local" into an archive repo that is mounted over 9p so that's what I'm using now.

Random idea: teach coreos-assembler fetch to fetch directly into a qcow2 filesystem instead, i.e. instead of having a cache/ directory, we only have a cache.qcow2? Everything else we need is super small (e.g. commit metadata) and could be copied over from the mounts each time.

I think I need more details on this to understand what you are proposing. Maybe grab me in IRC.

This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is way faster for local dev, where iterating fast on the content matters. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path; we only cross data over the mount point once; when we pull-local back into the archive repo. The pkgcache, dnf metadata, and build repo are all in the same filesystem. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>

This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is faster for local dev, where iterating fast on the content will matter. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path, `fetch` now directly populates the qcow2 cache so that the split `fetch`/`build` approach keeps working as expected. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. - Capture rc and bubble that up to the `runvm` caller. - Add virtio-rng device. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>

jlebon · 2018-10-31T17:31:26Z

OK, I've reworked this PR in #190!

This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is faster for local dev, where iterating fast on the content will matter. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path, `fetch` now directly populates the qcow2 cache so that the split `fetch`/`build` approach keeps working as expected. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. - Capture rc and bubble that up to the `runvm` caller. - Add virtio-rng device. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>

dustymabe · 2018-11-06T21:48:10Z

closing in favor of #190

cgwalters reviewed Sep 26, 2018

View reviewed changes

dustymabe added the WIP PR still being worked on label Sep 26, 2018

dustymabe added the hold waiting on something label Sep 28, 2018

dustymabe added 2 commits September 28, 2018 10:02

WIP: run compose in small supermin VM

ef51b37

Allows for unprivileged runs of rpm-ostree compose tree.

build: supermin: assume unified-core is the default

965a077

This simplifies things a bit. Also the xattr copying from bare-user to archive repo over smb file share has been fixed.

cgwalters mentioned this pull request Oct 29, 2018

core: Disable librpm SELinux plugin when writing rpmdb coreos/rpm-ostree#1647

Closed

jlebon mentioned this pull request Oct 31, 2018

Use supermin in unprivileged environments #190

Merged

dustymabe closed this Nov 6, 2018

jlebon mentioned this pull request Mar 27, 2020

Revert "cmdlib: Use -hda/-hdb instead of virtio-scsi incantations" #1292

Merged

Conversation

dustymabe commented Sep 26, 2018

Uh oh!

dustymabe commented Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dustymabe commented Sep 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Promaethius Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Promaethius Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Sep 26, 2018

Uh oh!

dustymabe commented Sep 26, 2018

Uh oh!

jlebon commented Sep 26, 2018

Uh oh!

dustymabe commented Sep 26, 2018

Uh oh!

cgwalters commented Sep 27, 2018

Uh oh!

dustymabe commented Sep 27, 2018

Uh oh!

dustymabe commented Sep 28, 2018

Uh oh!

Promaethius commented Sep 29, 2018

Uh oh!

dustymabe commented Sep 30, 2018

Uh oh!

Promaethius commented Oct 1, 2018

Uh oh!

cgwalters commented Oct 17, 2018

Uh oh!

dustymabe commented Oct 17, 2018

Uh oh!

jlebon commented Oct 26, 2018

Uh oh!

jlebon commented Oct 26, 2018

Uh oh!

dustymabe commented Oct 26, 2018

Uh oh!

jlebon commented Oct 26, 2018

Uh oh!

dustymabe commented Sep 26, 2018 •

edited

Loading

Promaethius Oct 4, 2018 •

edited

Loading

Promaethius Oct 4, 2018 •

edited

Loading