Skip to content

Merge layering into master#3072

Closed
cgwalters wants to merge 43 commits intoopenshift:masterfrom
cgwalters:layering-merge-master
Closed

Merge layering into master#3072
cgwalters wants to merge 43 commits intoopenshift:masterfrom
cgwalters:layering-merge-master

Conversation

@cgwalters
Copy link
Copy Markdown
Member

Let's keep this PR as a running update. Goal: evaluate specific risky areas of code that might block a merge to master.

cheesesashimi and others added 30 commits March 7, 2022 16:10
…ering-test-package

adds e2e-layering test package
…layering-branch

sync layering with master
Part of openshift/enhancements#1032

We'll add the new-format image into the payload alongside the old
one until we can complete the transition.

(There may actually be a separate `rhel-coreos-extensions` image
 e.g. too, so this is just the start)

Note this PR is just laying groundwork; the new format container
will not be used by default.
Add `rhel-coreos` image in configmap and `oscontainer` in controllerconfig
This will keep layered and non-layered update logging consistent
To properly handle compression.  Prep for using butane.
Prep for using Butane APIs more directly as part of the layering
work.

The logic is also a bit reworked to generate a single Butane fragment
which we convert to Ignition in one go, instead of converting
butane into ignition repeatedly and using config merging.

There's only one wrinkle with doing this, which is that today in
the templates we have multiple files which are drop-ins for `crio.service`;
and we need to group those together.

(I think it would be cleaner to have them in a single file in the
 templates, but for clarity let's handle this)
We were testing this indirectly.
Update vendoring github.com/coreos/fcct → github.com/coreos/butane
Created MCONamespace constant and used in all *.go files except for
test/helpers/utils.go which would create a cyclic import
I don't know that we'll ultimately end up with imagestreams with these
names but for right now, I think these are at least the ones we think
we want that have purposes behind them.
Extends clientbuilder to build openshift Image and Build clients, as our
controllers will need to deal with those objects as part of the "build
controller" for layering.
Adds a shared ImageInformer to our shared controller context so we can watch imagestreams
in our controllers.
Allows using a label to designate a pool as a "layred" pool.

Adds an imagestream informer to watch for imagestream updates and
queue the pool again for image deployment.

Updates tests with new render controller signature.

Layered pools will create resources (buildconfigs/imagestreams owned by the pool) to build
in-cluster derived images.

Layered pools:
- Ensure that a base coreos imagestream exists
- Ensure that pool-specific imagestreams exist
- Ensure that pool-specific buildconfig exist
- Render machineconfig into images in imagestreams
- Derive images from the coreos imagestream using  the content from that
rendered-config image stream using aforementioned buildconfig
- Will be enqueued again (so node controller can sync them) when their corresponding imagestream gets
updated
For a "layred" pool, this makes the node controller:
- Wait for an image that is equivalent to the rendered config
- Once that image arrives, assign it as an additional (along with desiredConfig) desiredImage annotation on the
node

The "done" signal for a node at this point is still when currentConfig ==
desiredConfig, it's just that how we're going to get there now is via
the image rather than via applying the config directly.

Also refactored setDesiredMachineConfigAnnotation to a generic function that sets an arbitrary annotation
(because duplicating for image annotations made "verify" unhappy)
This adds some daemon constants for current/desired image annotations,
as well as extending our rpm-ostree "client bindings" to be able to:
- Execute a container rebase
- Parse more of the `rpm-ostree --status` JSON output so we can figure
out whether we're booted into a deployment or not (for live-apply
cases)
This give rbac permissions to the machine-config-controller service
account to manipulate Builds/BuildRequests and Imagestreams in the
machine-config-operator namespace, as well as push images into the internal registry.

This gives rbac permissions to the machine-config-daemon serviceaccount
to retrieve images from the internal registry.

This also updates the operator's sync so that these new registry role files will be
generated and deployed properly with the mco.
This creates a separate update flow for "layered" pools, so they will be
ignored/not be handled by the standard update function.

This will not impact any pool that is not layered.

This also tells the config drift detector to ignore OSImageURL for
layered pools because it will not match.
A lot of this is version bumps, but go-containerregistry (the thing I used to create my tiny machineconfig wrapper images without a docker build) additionally drags in the following dependencies that nothing else uses:
- github.com/spf13/viper/
- github.com/containerd/stargz-snapshotter/
- github.com/docker/cli/
- github.com/docker/docker/pkg/homedir/

This isn't terrible, so I'm leaving it for now since it works, but
at some point this might get obviated by buildah or something
…ntroller

In cluster build/"build controller" proof-of-concept
Use MCONamespace constant in getPullSecret()
Move log statement to UpdateTuningArgs
@cgwalters
Copy link
Copy Markdown
Member Author

This is based on #3060 and then a git merge layering from master.

@cgwalters
Copy link
Copy Markdown
Member Author

/test e2e-gcp-op
/test e2e-agnostic-upgrade

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2022
Comment thread install/image-references
from:
kind: DockerImage
name: registry.svc.ci.openshift.org/openshift:machine-os-content
- name: rhel-coreos
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually doing this bit blocks on https://issues.redhat.com/browse/ART-3883

@cgwalters cgwalters marked this pull request as ready for review April 7, 2022 18:12
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 7, 2022
@cgwalters
Copy link
Copy Markdown
Member Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 7, 2022
@openshift-ci openshift-ci Bot requested review from mkenigs and sinnykumari April 7, 2022 18:13
Copy link
Copy Markdown
Member

@jkyros jkyros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the cans I know I kicked:

- apiGroups: ["operator.openshift.io"]
resources: ["etcds"]
verbs: ["get", "list", "watch"]
- apiGroups: ["build.openshift.io"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a bunch of sloppy RBAC that we need to be more deliberate about.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, right I think we can actually make these bits a namespaced role, not a clusterrole. But not a blocker IMO.

apiextinformers.WithNamespace(targetNamespace), apiextinformers.WithTweakListOptions(assignFilterLabels))
configSharedInformer := configinformers.NewSharedInformerFactory(configClient, resyncPeriod()())
operatorSharedInformer := operatorinformers.NewSharedInformerFactory(operatorClient, resyncPeriod()())
imageSharedInformer := imageinformers.NewSharedInformerFactory(imageClient, resyncPeriod()())
Copy link
Copy Markdown
Member

@jkyros jkyros Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to think about whether we should filter this to our namespace, there's probably a ton of image traffic. (although if we feature gate this and never start the informer, it doesn't matter)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great example of a potential risk! Let's tighten that up indeed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of watches and any unusual increases are being monitored so this is a good callout 👍


glog.Infof("Pool %s is s special pool, rendering config to imagestream", pool.Name)
// jkyros: Just put this here for now since it's convenient
err = ctrl.experimentalRenderToImageStream(pool, generated)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we cool with the "push to imagestream approach" or do we need to figure out a service endpoint and retrieve this from inside the build?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of this PR, I'd like to just evaluate "risks to not-layered".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It does add and version bump some dependencies but nothing that seemed overtly dangerous )

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 7, 2022

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op b78d77a link true /test e2e-gcp-op
ci/prow/e2e-agnostic-upgrade b78d77a link true /test e2e-agnostic-upgrade
ci/prow/bootstrap-unit b78d77a link false /test bootstrap-unit
ci/prow/images b78d77a link true /test images
ci/prow/e2e-aws b78d77a link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cheesesashimi
Copy link
Copy Markdown
Member

There's a more intensive series of tests from TRT that we should run this PR against. I'll figure out how to do that and follow up once I know more.

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a general read. The changes seem pretty well gated behind a few annotations existing, etc.

A few notes:

  1. There's a decent amount of refactoring on our postConfigChange action code, which is a bit hard to read from the diff. I assume though the idea is that we don't actually change that flow, just how the functions are used, so we should be good as long as we can validate that
  2. The changes above (and refactors to the update functions) likely will minorly affect the hypershift code, but we should be able to consolidate. Just something to consider when/if we merge

Comment thread pkg/operator/sync.go
}
newformat, ok := cm.Data["baseOperatingSystemContainer"]
if !ok {
return "", "", fmt.Errorf("Missing baseOperatingSystemContainer from configmap")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is baseOperatingSystemContainer mandatory to have? On a non-layered system is it just empty (but the key exists?)

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 17, 2022

@cgwalters: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2022
@cgwalters
Copy link
Copy Markdown
Member Author

cgwalters commented May 3, 2022

Just a note, I think we need to:

@openshift-bot
Copy link
Copy Markdown
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022
@openshift-bot
Copy link
Copy Markdown
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2022
@openshift-bot
Copy link
Copy Markdown
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci Bot closed this Oct 1, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 1, 2022

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. layering lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants