WIP: run compose in small supermin VM#124
WIP: run compose in small supermin VM#124dustymabe wants to merge 2 commits intocoreos:masterfrom dustymabe:dusty
Conversation
|
this is a WIP. i'm still experimenting a bit, but wanted to push my code up and socialize this a bit. I'm planning on breaking this into a few separate commits. |
IOW maybe hold off on nit comments and focus on architectural ones |
| export manifest=${configdir}/manifest.yaml | ||
| export superminpreparedir="${workdir}/tmp/supermin-prepare.d" | ||
| export superminbuilddir="${workdir}/tmp/supermin-build.d" | ||
| export cachesimg="${workdir}/caches.qcow2" |
There was a problem hiding this comment.
Some of this could be split into a prep commit.
There was a problem hiding this comment.
yep. i want to do that and also was thinking about making global variables all caps.
| rpms+=' systemd' # for clean reboot | ||
| rpms+=' dhcp-client bind-export-libs iproute' # networking | ||
| rpms+=' rpm-ostree distribution-gpg-keys' # to run the compose | ||
| rpms+=' selinux-policy selinux-policy-targeted policycoreutils' #selinux |
There was a problem hiding this comment.
Maybe nicer to maintain this as a supermin.txt file or so in our source?
One thing to think about too is that supermin also supports --build explicitly which we could do as part of our container build. (The whole libguestfs/supermin was designed around pulling content dynamically from the host, but in our case the container is static)
There was a problem hiding this comment.
Maybe nicer to maintain this as a
supermin.txtfile or so in our source?
i.e. just the rpm requirements in a separate file?
One thing to think about too is that supermin also supports
--buildexplicitly which we could do as part of our container build. (The whole libguestfs/supermin was designed around pulling content dynamically from the host, but in our case the container is static)
yeah I do a --build below, right? I'd rather not do the --build as part of the container build because it will just add a rootfs filesystem to our container and will make it larger to download with not real benefit. that's why I put it as part of the coreos-assembler init
There was a problem hiding this comment.
Well, the benefit is it's faster to run each time.
Note that coreos-assembler init is intended to be run only once. If the VM image is cached there, then when you pull a new container, it will be out of date.
We have cache/...but really every time the container changes the VM image is going to need to change anyways. And with this, the container doesn't do much without the VM image so...we're trading off download/disk space versus CPU usage every update.
| qemu-kvm -nodefaults -nographic -m 2048 -no-reboot \ | ||
| -kernel "${superminbuilddir}/kernel" \ | ||
| -initrd "${superminbuilddir}/initrd" \ | ||
| -netdev user,id=eth0,hostname=supermin,smb="${workdir}",hostfwd=tcp:127.0.0.1:8000-:8000 \ |
There was a problem hiding this comment.
Since today we run with --net=host, suddenly we'll need to worry about port conflicts. It is tempting to go back to the default bridged networking. I forget if there was a reason I added --net=host to the default arguments...oh oh right, it's because I use dns=dnsmasq on my host and podman today can't handle that. Urgh. I can deal with working around that here.
But in general...for sharing content between the container and VM I lean a bit towards using say rsync-over-ssh-over-virtio or something like that - there's no good reason to use TCP here.
There was a problem hiding this comment.
actually I can remove the hostfwd I don't actually need it. I was experimenting to see if pulling via HTTP was any faster than doing the ostree pull-local over smb. it wasn't, so I'll just remove this.
Since today we run with
--net=host
I've actually been running without --net=host in my tests, so my plan was to actually remove that. Are you saying that would be problematic?
There was a problem hiding this comment.
It's OK to remove if that turns out to be useful, I'll figure out how to make my current setup work with default bridged.
| -drive if=none,id=drive-scsi0-0-0-0,snapshot=on,file="${superminbuilddir}/root" \ | ||
| -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \ | ||
| -drive if=none,id=drive-scsi0-0-0-1,discard=unmap,file="${cachesimg}" \ | ||
| -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 \ |
There was a problem hiding this comment.
Where did you get this list of arguments?
There was a problem hiding this comment.
handcrafted. some of it I copied from other files in this repo. the scsi lines are there so I can fstrim inside the VM and have it provide free space back to the host. I copied those by setting up a libvirt VM with virtio-scsi and discard=unmap and then inspecting the qemu arguments..
see: https://dustymabe.com/2013/06/11/recover-space-from-vm-disk-images-by-using-discardfstrim/
|
|
||
| # set up networking | ||
| /usr/sbin/dhclient eth0 & | ||
| sleep 2 # wait for dhcp |
There was a problem hiding this comment.
haha, yeah it's possible I don't need that at all, I just figured I'd need to wait some amount of time :)
There was a problem hiding this comment.
Are either of you having issues with dhclient not being able to find libirs-export.so.160?
Consider adding /sbin/ldconfig to the init file.
There was a problem hiding this comment.
this is my list of rpms for "networking" that I had listed: rpms+=' dhcp-client bind-export-libs iproute' # networking
There was a problem hiding this comment.
I have those rpms installed into the supermin appliance. However, dhclient only worked after I relinked the libraries. If you aren't having this issue, great! I'm trying to follow your script through step by step to understand.
| #bash -i | ||
|
|
||
| # Sync over to repo in $hostworkdir | ||
| if [ "${unifiedcore}" = "1" ]; then |
There was a problem hiding this comment.
I am a bit confused by the unifiedcore conditionals here. Also in general...I would be OK for now just hard requiring unified core. We can eat the cost of making RHCOS work with it.
There was a problem hiding this comment.
yeah I was hoping to still work with RHCOS - I'll drop it if we can hard require it
There was a problem hiding this comment.
I'll take ownership of making --unified-core work with RHCOS.
|
High level...I understand the rationale for this, not having this container require The main question is - do we support both modes? It's going to be a serious pain to maintain. One thing that will suck in this mode is things like - what happens if rpm-ostree dumps core? Getting coredumps out of these supermin VMs is going to require some hacks. One angle I look at this from is that IMO a very important property of ostree is "it's just files". It works on top of whatever block storage you have. It was also designed from the very start to operate unprivileged - there are just some last bits to get over the hump for rpm-ostree. Building from that angle, on the client side we don't do anything with VMs - it's all again just files. Shoving the whole operation in a VM to work around recursive containerization bugs...well, it's just unfortunate. But, it is probably the right thing to do. |
|
first off, thanks for the quick look through :)
I thought about that. There are some definite drawbacks to running in a VM. It's definitely slower. The first build takes some time. Incremental builds are much better (probably the longest operation during those is dracut initrd generation). That being said I really think we should only support one mode of operation so we don't see issues when building locally vs in the build system.
yep, this is a short term solution until we clear that hurdle |
|
This is cool. I never actually tried out
How much slower are we talking about? Are virtualized incremental builds much slower than native incremental builds? |
yeah I had never dug into it deep before. it's really slick software
Actually I'm not worried about incremental builds with the speed here as those are reasonable I think. The first build takes a while as there is a lot of file I/O going over the smb file share from supermin VM back to host (coreos-assembler container). I'm looking at other ways to speed this up. Still investigating :) |
|
Are you actively hacking on this still? If so you may want to de-prioritize a bit as for RHCOS, we may face a very quick requirement to run without privileges but just to generate the oscontainers. I thought of a new approach to do this in rpm-ostree that should be less invasive/hacky. |
I took a few tangents and found a few bugs, but yeah, was planning to get back here and clean things up as we had discussed in the comments above.
so basically rpm-ostree would be able to run unpriv by default? |
|
putting this on hold until I hear back from @cgwalters on his experiment. |
Allows for unprivileged runs of rpm-ostree compose tree.
This simplifies things a bit. Also the xattr copying from bare-user to archive repo over smb file share has been fixed.
|
Does this supercede #113? |
not really. lorax is used to build ISO images. This just moves the treecompose we're already doing today to inside a VM so that we can run the process completely unprivileged. I think colin is doing some work on the ostree/rpm-ostree side to make it so we can run unprivileged even without running in a VM so I added the hold label to this PR. |
Ah ok, thank you for clarifying. |
|
The run-in-Docker work in rpm-ostree is probably going to stall out for a while so let's see about getting this in. |
|
yeah there will be a series of PRs for this soon |
|
Just gonna keep track of issues I'm hitting here while testing this:
Hmm, and now I'm getting: Which I've definitely seen before but don't remember where it comes from. I think it's from |
This is my real concern. The simplicity of the current approach is just so much nicer and easier to hack on. But I agree at the same time that we should really try hard to support just one mode. Part of me wonders if this shouldn't actually be external to Not sure how much of this makes sense, just thinking out loud. |
are you using my master branch (not this PR branch?)
can't remember if I've seen that or not
no they live inside a disk attached to the supermin VM. |
Yup!
Heh, yeah shortly after posting that comment I saw the |
Like, we copy all of the container contents in there? |
|
It would be something outside of the container. So the Jenkins job would launch a VM in which we pull the container-assembler image and run it as usual. E.g. something like: So basically the same workflow |
I think that will work, but one tricky aspect is going to be ensuring the VM pulls the same container content, and that will definitely matter in some cases (e.g. imagine the build system is using a pinned version). It feels like it'd be better if we just copy the running container content into the VM, which is basically what supermin is helping us do. |
That's just a variable substitution in that
Essentially in that workflow, there wouldn't be a parent |
yeah the biggest problem I see here is "copying content". It gets a lot more heavy we're not just relying on openshift to manage our container images, but also we've got a VM disk image (or are we using supermin again?) Where do we cache things? I'd like to minimze the I/O if possible. |
OSTree was designed from the very beginning of its existence to support SELinux well instead of being something wedged on. rpm-ostree builds on that foundation. We don't want to have anything to do with librpm's SELinux code. And with unified core, we usually don't, but that `rpm-plugin-selinux` code does get loaded. Disable it here. The main reason I'm submitting this patch is to help an effort in coreos-assembler to use a "supermin" virtual machine: coreos/coreos-assembler#124
OSTree was designed from the very beginning of its existence to support SELinux well instead of being something wedged on. rpm-ostree builds on that foundation. We don't want to have anything to do with librpm's SELinux code. And with unified core, we usually don't, but that `rpm-plugin-selinux` code does get loaded. Disable it here. The main reason I'm submitting this patch is to help an effort in coreos-assembler to use a "supermin" virtual machine: coreos/coreos-assembler#124 Closes: #1647 Approved by: jlebon
|
Was chatting with Dusty about this, and it essentially came down to whether we want to keep the complexity out of coreos-assembler, or build it in (what this PR does). My main concern was the increased complexity for the local dev case. Though given that both FCOS and RHCOS pipelines will need fully unprivileged composes, I agree that it's odd not to bake it in. Hopefully once unprivileged rpm-ostree composes become supported we can unwind some of this stuff. Anyway, with coreos/fedora-coreos-config#22 fixed, this works for me locally now, though I'd like to reproduce the I/O issues that Dusty was seeing when trying to compose directly into the mounts so we can avoid doing the |
|
So I was hacking on this and getting more familiar with 9p to try hard to avoid copying data back and forth. My impressions are that it's not quite prod ready yet. There's various issues that make composing directly into the mounts an issue. The two major ones are:
Random idea: teach |
yep. I hit a ton of issues when trying to "compose into" a bare user repo mounted over 9p. What I did get to work was a "pull-local" into an archive repo that is mounted over 9p so that's what I'm using now.
I think I need more details on this to understand what you are proposing. Maybe grab me in IRC. |
This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is way faster for local dev, where iterating fast on the content matters. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path; we only cross data over the mount point once; when we pull-local back into the archive repo. The pkgcache, dnf metadata, and build repo are all in the same filesystem. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>
This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is faster for local dev, where iterating fast on the content will matter. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path, `fetch` now directly populates the qcow2 cache so that the split `fetch`/`build` approach keeps working as expected. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. - Capture rc and bubble that up to the `runvm` caller. - Add virtio-rng device. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>
|
OK, I've reworked this PR in #190! |
This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is faster for local dev, where iterating fast on the content will matter. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path, `fetch` now directly populates the qcow2 cache so that the split `fetch`/`build` approach keeps working as expected. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. - Capture rc and bubble that up to the `runvm` caller. - Add virtio-rng device. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>
This is a rebased rework of coreos#124 with some modifications: - We auto-detect if we have CAP_SYS_ADMIN and if not, fall back to using supermin. My position is that both approaches will be in use in CI contexts and that the privileged case is faster for local dev, where iterating fast on the content will matter. I've also hopefully implemented things in a way that maintains almost the exact same logic build-wise between the two flows so there's not too much divergence. Anyway, totally open to revisiting this if needed! - In the virtualized path, `fetch` now directly populates the qcow2 cache so that the split `fetch`/`build` approach keeps working as expected. - We drop the repo-build/ repo since it's essentially also a cache and duplicates content from the archive repo. This is also needed to ensure that the pkgcache repo and the repo we commit into are both on the same file system. - The supermin appliance is reused if already generated; the `runvm` command just takes the command you want to run verbatim and plops it into a file the appliance is already coded to check from. Some other minor fixes: - We handle symlinked repos. - Split out supermin packages into a separate file. - Capture rc and bubble that up to the `runvm` caller. - Add virtio-rng device. Originally based on a patch by: Dusty Mabe <dusty@dustymabe.com>
|
closing in favor of #190 |
Allows for unprivileged runs of rpm-ostree compose tree.