Start of centralizing node writes via nodeWriter by cgwalters · Pull Request #3141 · openshift/machine-config-operator

cgwalters · 2022-05-10T00:38:36Z

Part of #3135

daemon: Have nodeWriter maintain ref to lister and node name

The NodeWriter is an instance of the "actor" pattern effectively;
it processes messages to update its internal state.

However, it's a bit redundant to have every update take the node
lister and node name - we only use this on the daemon side,
where there is exactly one node we will be updating.

Have the clusterNodeWriter instance of this interface cache
references to that data and avoid passing it on every message.

Prep for reworking the NodeWriter to have its own clientset.

controller: Export default resync period function

So I can use this from the daemon code.

daemon: Have nodewriter use kubelet credentials

daemon: Change hypershift path to use nodewriter

Ensure that this code path is using kubelet credentials.

The NodeWriter is an instance of the "actor" pattern effectively; it processes messages to update its internal state. However, it's a bit redundant to have every update take the node lister and node name - we only use this on the daemon side, where there is exactly one node we will be updating. Have the clusterNodeWriter instance of this interface cache references to that data and avoid passing it on every message. Prep for reworking the NodeWriter to have its own clientset.

So I can use this from the daemon code.

cgwalters · 2022-05-10T00:42:03Z

Digging into this, there's a fair amount of stuff that's needed:

Move all Event notifications to use nodeWriter
Fix up a few places that are calling setAnnotations directly to use nodeWriter
Drop privs from the RBAC

However, I think after that it would really make the most sense to change nodeWriter into "nodeActor" and have it also be the sole thing providing read access to the node object to the rest of the MCD. This would obviously touch more code. But one immediate thing this would clean up is that today the MCD (i think) watches all nodes via informers, which is clearly just wrong. We should only watch our own node, and then the nodeActor provides a "node changed" event to the rest of the MCD.

Ensure that this code path is using kubelet credentials.

cgwalters · 2022-05-10T01:06:05Z

Ah yes, I forgot to start the informer; classic. Will look at fixing that up.

cgwalters · 2022-05-10T15:54:28Z

/test e2e-gcp-op

cgwalters · 2022-05-10T16:03:55Z

Does this direction make sense to everyone?

cgwalters · 2022-05-10T17:48:08Z

To back up to a high level on this - we want to use the kubelet credentials for everything ideally, but the problem is kubelet can't read e.g. MachineConfig objects today. One option is to just add that power - then we use kubelet creds for everything. It's a bit tempting.

This patch is splitting up our node mutations to use the nodeWriter which uses kubelet credentials, then our service account is mostly just there to read machineconfigs (and our node - but a future enhancement to this work could also change the nodeWriter into daemonNodeController or something that mediates access to the node).

cgwalters · 2022-05-10T18:13:34Z

I do kind of feel the right long term direction is to change things so that we unify hypershift and self-driving cases.

Probably, we should indeed move away from a daemonset for self-driving too. Once we do that, it becomes natural then to e.g. put machineconfig in a configmap (or really, should be a secret) like hypershift is doing. Then the daemon/pod doesn't need any special roles.

Things like ssh access monitoring, config drift could be implemented instead with a periodically scheduled pod and not a daemonset. (Although at the cost of polling vs being event driven)

Another possible in between model is daemonset-per-pool - here we can then mount the current/desired machineconfig into the pods specifically rather than having the daemonset fetch them.

cgwalters · 2022-05-10T18:32:44Z

One option is to just add that power - then we use kubelet creds for everything. It's a bit tempting.

I looked at this briefly; I think it'd require us to carry a patch to the apiserver to do that. Definitely ugly. I think I'd lean towards going this direction instead, but open to debate.

cgwalters · 2022-05-10T19:25:38Z

/approve cancel
per my SOP for big changes - big PRs should get another approver and a lgtm.

kikisdeliveryservice · 2022-05-10T20:48:49Z

/test unit
/test verify

kikisdeliveryservice

As a first pass (without reviewing the discussion point on the PR), I think this is a solid optimization that yields more readable and centralized code. Always felt odd to have to pass nodenames through daemon functions bc well.. shouldn't it know where it is? 😆

kikisdeliveryservice · 2022-05-10T20:51:28Z

q: similar block above but that leads to a ctrl.common.WriteTerminationError while this is fatalf with Writetermination error for block below. Is this consistent? (Might be, I might be misunderstanding).

(Forgot to look at this one, will do in the next pass)

cgwalters · 2022-05-10T22:06:07Z

OK cool. I took the first two commits and put them in #3143 - I think we can get that in now.

I'll add some more work on top of this one for the followup listed above.

This ensures it uses kubelet credentials.

We should be able to rely on `nodeWriter` using kubelet credentials now.

cgwalters · 2022-05-11T20:56:21Z

/test unit
/test verify
/test e2e-gcp-op

cgwalters · 2022-05-11T21:34:38Z

/retest

I originally missed that `loadAnnotations` was actually calling this API which should have been nodeWriter internal. But going to fix it, then I hit on the problem that the code flow in the daemon relied on getting an updated node object back. Fixing that required some code churn to also pass back the mutated node object through the response message. I think what we *really* want actually is to wait for the informer to update and pass that back instead. Which then really leads to making this stuff part of a control loop on the daemon and not randomly blocking here. For now though, let's keep the daemon code as is and just change this bit to use nodeWriter for kubelet credentials.

cgwalters · 2022-05-11T22:16:56Z

OK, let's see how this goes in a CI run; only compile tested locally.

yuqi-zhang

Approving the general direction. Also rebased #3135 onto this

openshift-ci · 2022-05-11T23:00:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kikisdeliveryservice · 2022-05-11T23:01:20Z

level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Quota Check": error(MissingQuota): compute.googleapis.com/n2_cpus is not available in us-central1 because the required number of resources (12) is more than remaining quota of 8

Trying this again, will report if see same failure
/test e2e-gcp-op

kikisdeliveryservice · 2022-05-11T23:22:50Z

new error, will report:

level=error msg=Error: Error waiting for instance to create: Quota 'N2_CPUS' exceeded.  Limit: 24.0 in region us-central1.
level=error

Reported and it's an issue with some deprovisioning hitting us today.

cgwalters · 2022-05-12T18:27:28Z

{Operator progressing (ProfileProgressing): Waiting for 1/6 Profiles to be applied Operator progressing (ProfileProgressing): Waiting for 1/6 Profiles to be applied}

Hmm, this was a bit concerning at first because I thought NTO might be generating some machineconfig, but AFAICS that isn't the case. I looked at their logs but didn't see the problem.
/test e2e-agnostic-upgrade

EDIT: OK, looks like this is a common failure

yuqi-zhang · 2022-05-12T21:10:51Z

The installer PR merged. Let's try

/test e2e-gcp-op

sinnykumari · 2022-05-13T11:40:27Z

/test e2e-gcp-op

openshift-ci · 2022-05-13T12:50:42Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic-upgrade	`de3fa84`	link	true	`/test e2e-agnostic-upgrade`
ci/prow/e2e-gcp-op	`de3fa84`	link	true	`/test e2e-gcp-op`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2022-05-13T13:06:10Z

E0513 12:40:14.615182 1711 writer.go:199] Marking Degraded due to: failed to cordon node (5 tries): [timed out waiting for the condition, cordon error: nodes "ci-op-bsr23qd1-1354f-4tpmg-worker-b-6dh2l" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:machine-config-daemon" cannot patch resource "nodes" in API group "" at the cluster scope]

OK, I think we need to take this PR together with #3135

cgwalters added 2 commits May 9, 2022 19:29

controller: Export default resync period function

15593b7

So I can use this from the daemon code.

openshift-ci Bot requested review from kikisdeliveryservice and sinnykumari May 10, 2022 00:39

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2022

daemon: Change hypershift path to use nodewriter

9fa89af

Ensure that this code path is using kubelet credentials.

cgwalters marked this pull request as draft May 10, 2022 00:49

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 10, 2022

cgwalters force-pushed the nodewriter-rework branch from 9b7f193 to b8683ac Compare May 10, 2022 15:53

cgwalters mentioned this pull request May 10, 2022

Bug 2086728: Move node draining to controller. Remove MCD privs #3135

Merged

openshift-ci Bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2022

kikisdeliveryservice reviewed May 10, 2022

View reviewed changes

cgwalters mentioned this pull request May 10, 2022

daemon: Have nodeWriter maintain ref to lister and node name #3143

Merged

yuqi-zhang reviewed May 11, 2022

View reviewed changes

Comment thread pkg/daemon/writer.go Outdated

cgwalters added 2 commits May 11, 2022 15:14

daemon: Have nodewriter use kubelet credentials

c00f51c

daemon: Move event broadcasting to use nodeWriter

2ba5dd8

This ensures it uses kubelet credentials.

cgwalters force-pushed the nodewriter-rework branch from b8683ac to 2ba5dd8 Compare May 11, 2022 20:54

daemon: Drop RBAC for nodes

404b198

We should be able to rely on `nodeWriter` using kubelet credentials now.

cheesesashimi mentioned this pull request May 11, 2022

Bug 2086728: Improves Config Drift Monitor e2e tests #3146

Merged

cgwalters marked this pull request as ready for review May 11, 2022 22:16

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2022

openshift-ci Bot requested review from kikisdeliveryservice and raisaat May 11, 2022 22:16

yuqi-zhang approved these changes May 11, 2022

View reviewed changes

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022

cgwalters mentioned this pull request May 12, 2022

[layering] Separate build controller to handle builds for layered pools #3071

Merged

cgwalters closed this May 13, 2022

Conversation

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

kikisdeliveryservice commented May 10, 2022

Uh oh!

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

Uh oh!

kikisdeliveryservice May 10, 2022

Choose a reason for hiding this comment

Uh oh!

cgwalters May 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cgwalters commented May 10, 2022

Uh oh!

Uh oh!

cgwalters commented May 11, 2022

Uh oh!

cgwalters commented May 11, 2022

Uh oh!

cgwalters commented May 11, 2022

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 11, 2022

Uh oh!

kikisdeliveryservice commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kikisdeliveryservice commented May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented May 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi-zhang commented May 12, 2022

Uh oh!

sinnykumari commented May 13, 2022

Uh oh!

openshift-ci Bot commented May 13, 2022

Uh oh!

cgwalters commented May 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kikisdeliveryservice commented May 11, 2022 •

edited

Loading

kikisdeliveryservice commented May 11, 2022 •

edited

Loading

cgwalters commented May 12, 2022 •

edited

Loading