Bug 1868158: gcp, azure: Handle azure vips similar to GCP by squeed · Pull Request #2011 · openshift/machine-config-operator

squeed · 2020-08-19T12:05:31Z

This PR does the following things:

Rename gcp-routes-controller to apiserver-watcher, since it is generic
Remove obsolete service-management mode from gcp-routes-controller
Change downfile directory to /run/cloud-routes from /run/gcp-routes
Write $VIP.up as well as $VIP.down
Add an azure routes script that fixes hairpin.

Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running.

The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer.

Unlike GCP, we don't need to do this for external traffic, only local clients.

- How to verify it
Install on azure, ensure connections to the internal API load balancer are reliable - both when the local apiserver process is running and stopped.

- Description for the changelog
Masters on azure can now reliably connect to the apiserver service, without encountering hairpin issues

openshift-ci-robot · 2020-08-19T12:51:30Z

@squeed: This pull request references Bugzilla bug 1868158, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1868158: gcp, azure: Handle azure vips similar to GCP

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

squeed · 2020-08-19T12:51:52Z

/test e2e-azure

squeed · 2020-08-19T12:52:06Z

/cc @sttts

sttts · 2020-08-19T12:55:37Z

ascii figures 😍

squeed · 2020-08-19T14:27:57Z

/test e2e-gcp-op

squeed · 2020-08-19T15:02:22Z

oops, typo'd the image key. Good thing the tests mostly failed...

squeed · 2020-08-19T15:52:09Z

/refresh
(azure results have disappeared)

squeed · 2020-08-19T15:52:39Z

/test e2e-azure

sttts · 2020-08-21T09:57:49Z

/retest

sttts · 2020-08-21T10:00:03Z

when will this happen?

(This isn't a behavior change, just removing some dead code and a corresponding re-indentation)

It could happen if we somehow switch to RRDNS.

sttts · 2020-08-21T10:03:04Z

you rely on a fresh base image (i.e. reboot) to remove the old static pod?

We always reboot for config changes today, yes.

Though this gets into #1190 and in fact due to the way the MCO works today there will be a window where both are running unfortunately.

We probably need to change the new code to at least detect the case where the old static pod exists and exit.

I could also keep it as the same filename; the filename definitely doesn't matter.

The old static pod doesn't matter; it writes to /run/gcp-routes while the new one is /run/cloud-routes, so they can happily coexist (and should, until the service is swapped).

sttts · 2020-08-21T10:04:16Z

a separate commit with just the copied file from gcp would help to review the differences.

It's pretty different from GCP, so it needs a review.

deads2k · 2020-08-21T14:22:23Z

azure quota limits

/retest

deads2k · 2020-08-21T15:29:14Z

/hold

holding so this doesn't merge until it looks like azure does what we want.

yuqi-zhang

I don't really have the background knowledge to validate the functionality, so I think I should defer the lgtm to someone with more networking knowledge.

In terms of the operation here I suppose we're really just extending the existing GCP watcher to also work on Azure, which seems fine to me.

yuqi-zhang · 2020-08-21T15:45:07Z

Did you mean to continue here?

Hah. Maybe?

yuqi-zhang · 2020-08-21T15:59:29Z

It'd be nice to also add some platform specific descriptions to how the service operates on that platform, so its more clear how differences are handled

Not sure what you mean exactly; apiserver-watcher is identical on azure and gzp. I did add pointers to the cloud-provider-specific scripts, so maybe that's helpful?

kikisdeliveryservice · 2020-08-21T16:25:19Z

Will hold approval pending azure passing, the outstanding comments above (and below) addressed and @sttts giving a LGTM

/assign @sttts

cgwalters · 2020-08-21T16:30:42Z

We always reboot for config changes today, yes.

Though this gets into #1190 and in fact due to the way the MCO works today there will be a window where both are running unfortunately.

We probably need to change the new code to at least detect the case where the old static pod exists and exit.

cgwalters · 2020-08-21T16:31:17Z

Please always use http://redsymbol.net/articles/unofficial-bash-strict-mode/

Also it's really unfortunate we keep accumulating this nontrivial bash code; like I said in the OVS review it is possible today to have this in the MCD since we pull that binary and execute on the host.

I agree; I don't like adding all this bash. If it helps, I extract it and run it through shellcheck automatically. I could probably add that to make verify.

For 4.7, should we add an item to rewrite all this in go?

I would love this to be in go! defer to @cgwalters / @runcom on whether waiting to 4.7 makes sense.

cgwalters · 2020-08-21T16:39:15Z

/approve
Overall design seems sane at a high level, the descriptions of the azure/GCP differences are great!

deads2k · 2020-09-10T13:34:23Z

/retest

openshift-bot · 2020-09-10T14:15:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-10T14:28:58Z

/retest

Please review the full test history for this PR and help us cut down flakes.

cgwalters · 2020-09-10T15:08:54Z

/lgtm cancel
per #2011 (comment)

I'll look at fixing it.

This PR does the following things: - Rename gcp-routes-controller to apiserver-watcher, since it is generic - Remove obsolete service-management mode from gcp-routes-controller - Change downfile directory to /run/cloud-routes from /run/gcp-routes - Write $VIP.up as well as $VIP.down - Add an azure routes script that fixes hairpin. Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running. The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer. Unlike GCP, we don't need to do this for external traffic, only local clients.

cgwalters · 2020-09-10T15:19:20Z

I'm holding the mutex 🔒 around force pushing updates here.

cgwalters · 2020-09-10T16:18:13Z

/test e2e-azure

deads2k · 2020-09-10T17:48:15Z

/retest

deads2k · 2020-09-10T17:48:44Z

I'm holding the mutex around force pushing updates here.

Thanks

cgwalters · 2020-09-10T18:15:25Z

OK we have a green azure run here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2011/pull-ci-openshift-machine-config-operator-master-e2e-azure/1304091886448807936
I'm digging into the node logs now to make sure we fixed the systemd unit issue.

cgwalters · 2020-09-10T18:19:01Z

Confirmed we fixed the ordering cycle by looking at the journal from the current run versus the previous:

walters@toolbox /tmp> grep 'systemd.*Found ordering cycle' journal-old
Sep 09 17:46:41.436748 ci-op-llmvs4bh-9bd41-b2fmn-master-0 systemd[1]: openvswitch.service: Found ordering cycle on ovsdb-server.service/start
Sep 09 17:46:41.436884 ci-op-llmvs4bh-9bd41-b2fmn-master-0 systemd[1]: basic.target: Found ordering cycle on paths.target/start
Sep 09 17:50:22.102635 ci-op-llmvs4bh-9bd41-b2fmn-master-0 systemd[1]: network-online.target: Found ordering cycle on network.target/start
Sep 09 17:50:22.102706 ci-op-llmvs4bh-9bd41-b2fmn-master-0 systemd[1]: NetworkManager.service: Found ordering cycle on openvswitch.service/start
Sep 09 17:50:22.102793 ci-op-llmvs4bh-9bd41-b2fmn-master-0 systemd[1]: NetworkManager.service: Found ordering cycle on basic.target/start
walters@toolbox /tmp> grep 'systemd.*Found ordering cycle' journal
walters@toolbox /tmp [1]>

/approve

openshift-ci-robot · 2020-09-10T18:43:22Z

@squeed: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi	`be57854`	link	`/test e2e-metal-ipi`
ci/prow/okd-e2e-aws	`be57854`	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2020-09-10T18:47:00Z

Eh we had prior approvals on the old code and the new one just fixes systemd ordering issues so
/lgtm

openshift-ci-robot · 2020-09-10T18:47:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, mfojtik, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-09-10T18:51:04Z

@squeed: All pull requests linked via external trackers have merged:

Bugzilla bug 1868158 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1868158: gcp, azure: Handle azure vips similar to GCP

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k · 2020-09-10T19:26:15Z

YAY!

The original introduction of this service probably used `gcpRoutesController` which happens to be the same as the MCO image because we didn't have a reference to it, and plumbing the image substitution through all the abstraction layers in the code is certainly not obvious. Prep for openshift#2011 which wants to abstract the GCP work to also handle Azure and it was confusing that `machine-config-daemon-pull.service` was referencing an image with a GCP name.

openshift-ci-robot requested review from ericavonb and kikisdeliveryservice August 19, 2020 12:05

squeed changed the title ~~gcp, azure: Handle azure vips similar to GCP~~ Bug 1868158: gcp, azure: Handle azure vips similar to GCP Aug 19, 2020

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 19, 2020

openshift-ci-robot requested a review from sttts August 19, 2020 12:52

sttts reviewed Aug 19, 2020

View reviewed changes

sttts reviewed Aug 21, 2020

View reviewed changes

Comment thread cmd/apiserver-watcher/run.go Outdated

sttts reviewed Aug 21, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2020

yuqi-zhang reviewed Aug 21, 2020

View reviewed changes

openshift-ci-robot assigned sttts Aug 21, 2020

cgwalters reviewed Aug 21, 2020

View reviewed changes

Comment thread cmd/apiserver-watcher/README.md Outdated

cgwalters reviewed Aug 21, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2020

kikisdeliveryservice removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2020

jlebon reviewed Sep 10, 2020

View reviewed changes

Comment thread templates/master/00-master/azure/units/openshift-azure-routes.path.yaml Outdated

openshift-ci-robot assigned cgwalters Sep 10, 2020

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 10, 2020

miabbott mentioned this pull request Sep 10, 2020

RFE: MCO should alert on systemd ordering cycle #2074

Closed

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 10, 2020

openshift-merge-robot merged commit bec98e9 into openshift:master Sep 10, 2020

sdodson mentioned this pull request Sep 23, 2020

Bug 1881143: Backport handle azure vips for pods #2111

Closed

miabbott mentioned this pull request Oct 15, 2020

Publish redhat-coreos config sources openshift/os#413

Merged

cgwalters mentioned this pull request Nov 19, 2020

Move api-int record from coredns to /etc/hosts #2236

Closed

staebler mentioned this pull request Jan 19, 2022

Bug 2042655: Alibaba hairpin #2919

Merged

Conversation

squeed commented Aug 19, 2020

Uh oh!

openshift-ci-robot commented Aug 19, 2020

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

squeed commented Aug 19, 2020

Uh oh!

sttts commented Aug 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sttts Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deads2k commented Aug 21, 2020

Uh oh!

deads2k commented Aug 21, 2020

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kikisdeliveryservice commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cgwalters commented Aug 21, 2020

Uh oh!

deads2k commented Sep 10, 2020

Uh oh!

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

openshift-bot commented Sep 10, 2020

Uh oh!

sttts Aug 21, 2020 •

edited

Loading

kikisdeliveryservice commented Aug 21, 2020 •

edited

Loading

openshift-ci-robot commented Sep 10, 2020 •

edited

Loading