pkg/daemon: add a current config on disk check by runcom · Pull Request #612 · openshift/machine-config-operator

runcom · 2019-04-09T13:40:31Z

Signed-off-by: Antonio Murdaca runcom@linux.com

- What I did

Add an on disk fingerprint about what our current config is on a node (avoiding resync if we changed it or restored from an etcd backup). We check what we have on disk with what we have in annotations before triggering a sync.

- How to verify it

- Description for the changelog

runcom · 2019-04-09T13:41:04Z

/hold

Was just discussing this with @derekwaynecarr but thought to start experimenting if this is needed/working

runcom · 2019-04-09T15:34:49Z

this is wrong (note to self)

cgwalters · 2019-04-09T20:47:06Z

This isn't a fingerprint, it's the whole config right? I find this idea that we serialize all of the files back into JSON as a file in the filesystem kind of funny. But I guess it's not a huge amount of data...

oc get -o yaml machineconfigs/rendered-worker-74e29c89eb37d7aaa30c33c481fae174 | gzip | wc -c 10197

We can certainly assume 10k of data is free.

On the kind-of bikeshed topic, I think /var is probably a better place for this. The ostree model is that /etc is data humans might edit (e.g. "ssh to node and vi").

fixed naming and moved under /var/machine-config-daemon

cgwalters · 2019-04-09T21:00:53Z

Don't we need to do this before we load currentConfig?

uhm, we may, I can't see why though, would you explain (I feel dumb)

If the current config isn't available in the apiserver, won't the Get above fail before we can find our cached version?

oh gosh yeah

No wait, the scenario here is that after a restore we still have something in the apiserver, which is really what we want on the machine itself, if current isn't on the apiserver, there's not much we can do anyway (which is the BZ we got from David yesterday but this PR isn't tackling that).

This is something I wrote to understand this better:

- current=configA == desired=configA upgrade - current=configA != desired=configB (desired being generated by upgrade and it's being applied as new current) roll out new configs and MCD syncs - current=configB == desired=configB ... trigger a backup restore pre-upgrade which isn't calling informers - current=configA == desired=configA but now what's on disk is still configB we now need something which retriggers the MCD to sync to configA

configA is what we need anyway, so it must be in the apiserver after a restore

can i ask: when is a backup restore triggered? i feel like i understand what this pr is supposed to do, but not fully why?

there's work still ongoing on others to spec out how/when/where an etcd backup is performed (and a subsequent restore)

@cgwalters can you check my answer above? I think this is working as intended but I'm waiting on a way to test this out

runcom · 2019-04-16T15:05:12Z

This is good to review and get in - as it's treated as a bug. The testing goes later once the other teams/steps are complete

kikisdeliveryservice · 2019-04-16T20:52:07Z

aws route errors:
/test e2e-aws-op

kikisdeliveryservice · 2019-04-16T21:01:55Z

I know that this is a bug, but I cant find a BZ/something else for it? Could we add Bug to title?

runcom · 2019-04-16T21:04:40Z

I know that this is a bug, but I cant find a BZ/something else for it? Could we add Bug to title?

there's no BZ for this, came out from a conversation with Derek on Slack :( not sure we need one tho

kikisdeliveryservice · 2019-04-16T21:09:16Z

@runcom cool! just wanted to make sure I didn't miss it somewhere.

kikisdeliveryservice · 2019-04-16T21:10:21Z

weird test didn't re-run

/retest

cgwalters · 2019-04-16T22:05:27Z

So I read your comment a few times...I am finding it really hard to capture the flow/state of things in my head right now.

Edit: ignore this
It feels like this would all be a lot simpler to reason about if basically we had /var/machine-config-daemon/config<hash>.json and we kept those for every config version that was relevant to us (current, desired) and used it if it was present instead of talking to the api server.

~~This would also help address the other BZ about having things be deleted from the apiserver.~~

That said...after thinking about this more I think you're right, it will work. What I was worried about is "if we're saying currentConfig is desiredConfig, how do we then actually honor the real desiredConfig?". But if they are different then we'll do a config transition and then on the next boot we should transition to the real desiredConfig?

OK here's another way to look at it - we're basically saying we can't trust our annotations to describe our current state, the real state is a file on disk. So why are we looking at the annotation at all, versus setting it to the right thing if it's different from what's on disk?

So why are we looking at the annotation at all, versus setting it to the right thing if it's different from what's on disk?

RIght, I had the same thought indeed, assuming we're talking about the currentConfig (desiredConfig has to come from somewhere/apiserver otherwise we can't really progress right?). So I guess a natural follow up to this would be to avoid talking to the apiserver when querying the currentConfig and just use what it's reflected on disk right?

That said...after thinking about this more I think you're right, it will work. What I was worried about is "if we're saying currentConfig is desiredConfig, how do we then actually honor the real desiredConfig?". But if they are different then we'll do a config transition and then on the next boot we should transition to the real desiredConfig?

right, that's my understand of what will happen if there's a real drift between currentOnAPI vs currentOnDisk.

Basically, the first sync would reconcile currentOnAPI with currentOnDisk, after we're done with that, there will be another immediate sync to honor desiredConfig (which, in our disaster recovery story is gonna be almost always the same as currentOnAPI, unless you backup an etcd with current != desired at the time you actually backup, right?)

It feels like this would all be a lot simpler to reason about if basically we had /var/machine-config-daemon/config<hash>.json and we kept those for every config version that was relevant to us (current, desired) and used it if it was present instead of talking to the api server.

I'm not sure how could we keep desired on disk 🤔 that has always have to come from apiserver right? I feel I'm missing something

or are we saying that in this case the currentConfig is the desired from the fingerprint? would it go from fingerprint->current-> desired? or....?

is there any case where if we have fingerprint, current and desired we would not want to ultimately end up in the desiredConfig?

fingerprint->current-> desired? or....?

exactly that but in the DR case, fingerprint=configB,current=configA,desired=configA. So it's just one real sync+reboot

is there any case where if we have fingerprint, current and desired we would not want to ultimately end up in the desiredConfig?

nope, there shouldn't be any such case afaict. Desired is always evaluated at sync+1 if fingerprint and current drift

One thing bugging me too is...why don't we

return fingerprintMC, desiredConfig, nil ?

Uhm, yeah, so using my flow from above comments, would that work in this case where I take a snapshot exactly in the middle of a sync where current != desired (cause yolo!):

current=configA != desired=configB snapshot upgrade - current=configB != desired=configC roll out new configs and MCD syncs - current=configC == desired=configC ... trigger a backup restore pre-upgrade which isn't calling informers - current=configA != desired=configB but now what's on disk is still configC

should we first go from configC on disk to configA on disk, and only then configB? Your code would go straight from configC on disk to configB (which avoids a sync, but should we anyway or do we risk missing something?)

(which avoids a sync, but should we anyway or do we risk missing something?)

I am not sure what we'd miss. It sounds like you're trying to be more conservative here just on general principle? I don't object to that to be clear. But I don't understand how this would be different from any other config change.

yup, it's really me being just conservative indeed, I guess it would be fine anyway to avoid a sync, so yeah, changing

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom · 2019-04-17T23:09:41Z

alrighty, let's see what tests say and then pull the trigger on this for the DR story

runcom · 2019-04-17T23:09:52Z

/hold cancel

kikisdeliveryservice · 2019-04-18T15:33:15Z

/lgtm

openshift-ci-robot · 2019-04-18T15:33:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-04-18T17:28:56Z

/retest

Mimick (and copy from) pkg/controller testing, hopefully one day I'll spend a week trying to abstract all the duplicate code in pkg/controller testings and now pkg/daemon...but not today. This patch is adding tests for openshift#612 which would greatly benefit from some testing. This can be used as a start to add more tests *hint hint*

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2019

openshift-ci-robot requested review from cgwalters and kikisdeliveryservice April 9, 2019 13:40

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 9, 2019

runcom commented Apr 9, 2019

View reviewed changes

Comment thread pkg/daemon/daemon.go Outdated

Copy link
Copy Markdown

Member Author

runcom Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong (note to self)

runcom force-pushed the current-fingerprint branch from 301a880 to 381e8c0 Compare April 9, 2019 16:15

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 9, 2019

runcom force-pushed the current-fingerprint branch 2 times, most recently from 485a3b6 to 6f5f21a Compare April 9, 2019 19:19

cgwalters reviewed Apr 9, 2019

View reviewed changes

runcom changed the title ~~WIP: pkg/daemon: add a current config on disk check~~ pkg/daemon: add a current config on disk check Apr 16, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2019

runcom force-pushed the current-fingerprint branch from 6f5f21a to 686b0e5 Compare April 16, 2019 15:04

runcom added the jira label Apr 16, 2019

cgwalters reviewed Apr 16, 2019

View reviewed changes

runcom removed the jira label Apr 16, 2019

pkg/daemon: add a current config on disk check

070dedd

Signed-off-by: Antonio Murdaca <runcom@linux.com>

runcom force-pushed the current-fingerprint branch from 686b0e5 to 070dedd Compare April 17, 2019 23:09

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2019

openshift-ci-robot assigned kikisdeliveryservice Apr 18, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2019

openshift-merge-robot merged commit 0314083 into openshift:master Apr 18, 2019

runcom deleted the current-fingerprint branch April 18, 2019 20:47

This was referenced Apr 18, 2019

pkg/daemon: add MCD skeleton tests and test prepUpdateFromCluster #645

Merged

pkg/daemon: sync using current MC on disk if current on etcd not found #647

Closed

move etcd to openshift-etcd #648

Merged

Conversation

runcom commented Apr 9, 2019

Uh oh!

runcom commented Apr 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented Apr 16, 2019

Uh oh!

kikisdeliveryservice commented Apr 16, 2019

Uh oh!

kikisdeliveryservice commented Apr 16, 2019

Uh oh!

runcom commented Apr 16, 2019

Uh oh!

kikisdeliveryservice commented Apr 16, 2019

Uh oh!

kikisdeliveryservice commented Apr 16, 2019

Uh oh!

cgwalters Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kikisdeliveryservice Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom Apr 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runcom commented Apr 17, 2019

Uh oh!

runcom commented Apr 17, 2019

Uh oh!

kikisdeliveryservice commented Apr 18, 2019

Uh oh!

openshift-ci-robot commented Apr 18, 2019

Uh oh!

runcom commented Apr 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

cgwalters Apr 16, 2019 •

edited

Loading

kikisdeliveryservice Apr 16, 2019 •

edited

Loading

runcom Apr 16, 2019 •

edited

Loading