Merge master into layering by cgwalters · Pull Request #3060 · openshift/machine-config-operator

cgwalters · 2022-04-05T16:46:14Z

Keeping up with things

The call to TempDir a few lines above already created this directory, so this call to MkdirAll is completely unecessary

When we added the nodeip-configuration service for None platform deployments, we broke some existing users who were relying on the (largely undefined) previous behavior Kubelet used to select its node ip. While it is possible to work around this by overriding the node ip selection logic, that's very cumbersome and not an acceptable user experience. This change adds a KUBELET_NODEIP_HINT env variable that can be used to override the default behavior of runtimecfg when selecting a node ip. When the variable is unset, the old behavior of selecting an address on the interface of the default route will take effect. When the variable is set, its value will be passed to runtimecfg like a VIP for the IPI platforms. This will cause runtimecfg to prefer an address in the same subnet as the one provided in KUBELET_NODEIP_HINT. If no such address is found, it will fall back to the default route logic as before. KUBELET_NODEIP_HINT can be set using a systemd environment file. The file must be named /etc/default/nodeip-configuration with contents such as (replacing the IP as appropriate): KUBELET_NODEIP_HINT=192.0.2.1 This file should be created using a machine-config manifest that is passed to the installer so it will take effect on initial deployment. The node ip cannot be changed after the node registers initially so this cannot be done as a day 2 operation. Note that the IP specified in the hint does not necessarily need to exist in the environment, it just needs to be in the correct subnet. No traffic will be sent to this address. Co-authored-by: Dan Winship <danwinship@redhat.com>

The machine config controller did not previously have a metrics handler so one must be added in order for us to do any alerting/metrics work. This requires setting up: - Cluster Roles - Cluster Role Bindings - ServiceMonitor for metrics - Service for metrics - oauth-proxy sidecar to deploymentfor machine-config-controller - mcc-proxy-tls secret for machine-config-controller - metrics handler function in machine-config-controller common - Cluster Roles - Cluster Role Bindings - ServiceMonitor for metrics - Service for metrics - oauth-proxy sidecar to deploymentfor machine-config-controller - mcc-proxy-tls secret for machine-config-controller - metrics handler function in machine-config-controller common I cribbed off of: 557303f And then to add oauth: 3ab692f

Adds certificate helper functions to: - extract certificates from PEM bundles - find the certificate that has the latest expiry date when provided a list

Adds functionality to the node controller such that: 1.) when a paused machine config pool attempts to sync 2.) if the kubelet-ca has been updated in the pool's 'spec' config 3.) the MCC will set metric to the NotAfter date of the kube-apiserver-to-kubelet-signer certificate 5.) once the pool is unpaused, that metric will be reset to zero

Testutil package from the prometheus client used in the node_controller tests, needed to add as dependency. Commands run: ``` $ go mod tidy $ go mod vendor $ make verify ```

Adds an e2e test that steps through the rotation of the kubelet-apiserver-to-kubelet-signer by: - pausing a pool - rotating the certificate - checking that the proper metric is emitted - unpausing the pool - checking that the metric stops being emitted

Node controller now requires a MachineConfigInformer as part of its New() function, updates bootstrap_tests to match

As we now tear down and reconfigure br-ex on every reboot, we must provide a means to stabilize interface selection in scenarios with multiple default route interfaces. Signed-off-by: Andreas Karis <ak.karis@gmail.com>

Signed-off-by: Andreas Karis <ak.karis@gmail.com>

Update controllerconfig CRD and relevant switch statements in pkg to handle Nutanix platform. Also Update install/0000_80_machine-config-operator_00_namespace.yaml Add `openshift-nutanix-infra` to list of namespaces.

Right now Fedora doesn't ship Go 1.17, only Go 1.18beta. That version emits a different error message for incompatible TLS versions. Adjust our unit test to handle both. (Also, a motivation for me is to cross-check the new CI configuration after openshift/release#27015 )

server/api_test: Adjust expected error message for Go 1.18

Created MCONamespace constant and used in all *.go files except for test/helpers/utils.go which would create a cyclic import

…-certificate Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool

Remove the restriction on the runtime-request-timeout option in the kubeletconfig. Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>

…nodes in paused pool"

…-74-controller-alert-certificate Revert "Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool"

…2802-mco-74-controller-alert-certificate" This reverts commit b80e6a1, reversing changes made to 57267b7. This "un-reverts" the reversion so we can put PR 2802 back in with the fix to resourcemerge.

Resourcemerge did not previously merge a container's Resources.Requests in ensureContainer(), which meant that during upgrade cases where we update the container object directly with changes (instead of applying/re-applying the manifests), Resources.Requests changes would not propagate to the updated object. This makes ensureContainer update Resources.Requests if it has changed, which keeps that structure from getting scraped off when we update. ( Which will keep us from failing tests, since at least cpu and memory in that structure are required fields )

Make our resourcemerge fork update a container's Resources.Requests, un-revert openshift#2802

This will keep layered and non-layered update logging consistent

bootstrap_test.go: remove unused constants

The main motivation here is to work around coreos/rpm-ostree#3523 (Which is itself a workaround for a RHEL8 systemd bug) Basically this e2e is invoking `rpm-ostree kargs` in a pretty tight loop which triggers that bug. To read the kernel command line, we can just read `/proc/cmdline` instead. (Now, this is the *actual* cmdline instead of just rpm-ostree's view of it, but it should be fine)

…latform Add Nutanix Platform to Machine Config Operator

Today, typing `make` does nothing, which is not very useful. By listing this rule first, `make` will default to `make binaries`.

Fix description typo in osImageURL CRD parameter

e2e: Use `/proc/cmdline` instead of `rpm-ostree kargs`

openshift-ci · 2022-04-05T17:01:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kikisdeliveryservice · 2022-04-05T19:21:51Z

retesting all of these (single node didnt bootstrap)
/retest

kikisdeliveryservice · 2022-04-05T23:33:13Z

ok ignore the single node comment above bc they apparently never worked in this branch, but the gcp-op failures seem.. not like flakes? for ex:

        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/01-ctrcfg-logLevel] on node ci-op-5zk9gyv0-b7394-v48gf-worker-c-kgchx: 
....

kikisdeliveryservice · 2022-04-05T23:33:59Z

also that bot report above seems wrong?
/skip

cgwalters · 2022-04-07T17:34:36Z

/retest

cgwalters · 2022-04-07T17:52:34Z

/retest

cgwalters · 2022-04-07T18:18:15Z

That's...weird, it's like that job somehow lost our override adding rhel-coreos in CI. I wonder if there's some confusion based on the git merges causing the CI setup to be reused across master/layering branches?
/test images

cgwalters · 2022-04-07T18:39:06Z

/retest

cgwalters · 2022-04-13T00:15:33Z

/retest

openshift-ci · 2022-04-13T02:09:24Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-op-single-node	`b78d77a`	link	false	`/test e2e-gcp-op-single-node`
ci/prow/okd-images	fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5	link	false	`/test okd-images`
ci/prow/verify	fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5	link	true	`/test verify`
ci/prow/unit	fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5	link	true	`/test unit`
ci/prow/bootstrap-unit	fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5	link	false	`/test bootstrap-unit`
ci/prow/e2e-agnostic-upgrade	fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5	link	true	`/test e2e-agnostic-upgrade`
ci/prow/e2e-gcp-single-node	`b78d77a`	link	false	`/test e2e-gcp-single-node`
ci/prow/e2e-gcp-op	`b78d77a`	link	true	`/test e2e-gcp-op`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2022-04-29T16:41:52Z

OK let's do #3126 first, then I think it may make sense to instead force-rebase layering on master.

cgwalters · 2022-05-03T16:44:41Z

Trying a rebase of layering on current master, there are some conflicts to work through. Split out one bit in #3133

openshift-bot · 2022-08-01T19:27:18Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-merge-robot · 2022-08-01T19:27:27Z

@cgwalters: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2022-09-01T00:30:27Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-10-01T08:00:35Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-10-01T08:01:19Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkenigs and others added 30 commits December 23, 2021 14:25

MCD: remove redundant MkdirAll call in update.go

1afbe15

The call to TempDir a few lines above already created this directory, so this call to MkdirAll is completely unecessary

common/helpers: add certificate functions

35b4e81

Adds certificate helper functions to: - extract certificates from PEM bundles - find the certificate that has the latest expiry date when provided a list

Update vendor/modules with prometheus testutil

dc42079

Testutil package from the prometheus client used in the node_controller tests, needed to add as dependency. Commands run: ``` $ go mod tidy $ go mod vendor $ make verify ```

test/e2e-boostrap: node controller mcLister

f371cf0

Node controller now requires a MachineConfigInformer as part of its New() function, updates bootstrap_tests to match

configure-ovs.sh: Provide store hint for default route interface

a8754fa

As we now tear down and reconfigure br-ex on every reboot, we must provide a means to stabilize interface selection in scenarios with multiple default route interfaces. Signed-off-by: Andreas Karis <ak.karis@gmail.com>

configure-ovs-network: Use lower metric for br-ex than for br-ex1

95ec36a

Signed-off-by: Andreas Karis <ak.karis@gmail.com>

Add Nutanix Platform to Machine Config Operator

d2b2442

Update controllerconfig CRD and relevant switch statements in pkg to handle Nutanix platform. Also Update install/0000_80_machine-config-operator_00_namespace.yaml Add `openshift-nutanix-infra` to list of namespaces.

bootstrap_test.go: remove unused constants

1ca9adc

Merge pull request openshift#3019 from cgwalters/go118-api-unit

d4b1a8c

server/api_test: Adjust expected error message for Go 1.18

Create MCONamespace constant

943350e

Created MCONamespace constant and used in all *.go files except for test/helpers/utils.go which would create a cyclic import

Merge pull request openshift#2802 from jkyros/mco-74-controller-alert…

57267b7

…-certificate Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool

Remove runtime request timeout restriction

b326856

Remove the restriction on the runtime-request-timeout option in the kubeletconfig. Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>

Revert "Send alert when MCO can't safely apply updated Kubelet CA on …

6144a92

…nodes in paused pool"

Merge pull request openshift#3027 from DennisPeriquet/revert-2802-mco…

b80e6a1

…-74-controller-alert-certificate Revert "Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool"

Revert "Merge pull request openshift#3027 from DennisPeriquet/revert-…

a0c0b2e

…2802-mco-74-controller-alert-certificate" This reverts commit b80e6a1, reversing changes made to 57267b7. This "un-reverts" the reversion so we can put PR 2802 back in with the fix to resourcemerge.

Fix description typo in osImageURL CRD parameter

52c1a5b

Merge pull request openshift#3028 from jkyros/unrevert-pr-2802

5ad20c3

Make our resourcemerge fork update a container's Resources.Requests, un-revert openshift#2802

Move log statement to UpdateTuningArgs

0e37c4a

This will keep layered and non-layered update logging consistent

Merge pull request openshift#3023 from mkenigs/unused-constants

5070577

bootstrap_test.go: remove unused constants

Merge pull request openshift#2942 from nutanix-cloud-native/nutanix-p…

fce8f7c

…latform Add Nutanix Platform to Machine Config Operator

build-sys: Default to make binaries

41100ba

Today, typing `make` does nothing, which is not very useful. By listing this rule first, `make` will default to `make binaries`.

Merge pull request openshift#3029 from javipolo/fix_crd_description_typo

d94d193

Fix description typo in osImageURL CRD parameter

Merge pull request openshift#3034 from cgwalters/config-drift-no-kargs

0528d71

e2e: Use `/proc/cmdline` instead of `rpm-ostree kargs`

openshift-ci Bot requested a review from sinnykumari April 5, 2022 16:46

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2022

cgwalters changed the base branch from master to layering April 5, 2022 16:47

Merge branch 'master' into layering

b78d77a

cgwalters force-pushed the layering branch from fbcbb7e to b78d77a Compare April 5, 2022 16:59

cgwalters mentioned this pull request Apr 7, 2022

Merge layering into master #3072

Closed

cgwalters marked this pull request as draft April 29, 2022 16:42

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2022

openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 1, 2022

openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2022

openshift-ci Bot closed this Oct 1, 2022

Conversation

cgwalters commented Apr 5, 2022

Uh oh!

openshift-ci Bot commented Apr 5, 2022

Uh oh!

kikisdeliveryservice commented Apr 5, 2022

Uh oh!

kikisdeliveryservice commented Apr 5, 2022

Uh oh!

kikisdeliveryservice commented Apr 5, 2022

Uh oh!

cgwalters commented Apr 7, 2022

Uh oh!

cgwalters commented Apr 7, 2022

Uh oh!

cgwalters commented Apr 7, 2022

Uh oh!

cgwalters commented Apr 7, 2022

Uh oh!

cgwalters commented Apr 13, 2022

Uh oh!

openshift-ci Bot commented Apr 13, 2022

Uh oh!

cgwalters commented Apr 29, 2022

Uh oh!

cgwalters commented May 3, 2022

Uh oh!

openshift-bot commented Aug 1, 2022

Uh oh!

openshift-merge-robot commented Aug 1, 2022

Uh oh!

openshift-bot commented Sep 1, 2022

Uh oh!

openshift-bot commented Oct 1, 2022

Uh oh!

openshift-ci Bot commented Oct 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants