Skip to content

Bug 1823950: Reverse haproxy and keepalived check timings#2075

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cybertron:keepalived-haproxy-timing
Sep 16, 2020
Merged

Bug 1823950: Reverse haproxy and keepalived check timings#2075
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
cybertron:keepalived-haproxy-timing

Conversation

@cybertron
Copy link
Copy Markdown
Member

Since we moved to keepalived healthchecking against haproxy, we want
haproxy to handle most failures so the VIP doesn't have to move.
However, previously the time it took for haproxy to recognize an
outage on a node was longer than it was for keepalived, which resulted
in the VIP moving before haproxy removed the failing backend.

This change makes the haproxy interval 1 second, so it should notice
outages in 2 seconds or less (because it has a fall value of 2).
The keepalived interval is changed to 2, which means it will detect
failures in 2 to 4 seconds (also a fall value of 2). This means
haproxy should deal with api outages before keepalived does and
allow the VIP to stay on the same node.

- What I did

- How to verify it

- Description for the changelog

@kikisdeliveryservice kikisdeliveryservice added the 4.7 Work deferred for 4.7 label Sep 11, 2020
@cybertron
Copy link
Copy Markdown
Member Author

/retest

@cybertron cybertron changed the title Reverse haproxy and keepalived check timings Bug 1823950: Reverse haproxy and keepalived check timings Sep 11, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@cybertron: This pull request references Bugzilla bug 1823950, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1823950: Reverse haproxy and keepalived check timings

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 11, 2020
@cybertron
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I should change this timeout too.

Since we moved to keepalived healthchecking against haproxy, we want
haproxy to handle most failures so the VIP doesn't have to move.
However, previously the time it took for haproxy to recognize an
outage on a node was longer than it was for keepalived, which resulted
in the VIP moving before haproxy removed the failing backend.

This change makes the haproxy interval 1 second, so it should notice
outages in 2 seconds or less (because it has a fall value of 2).
The keepalived interval is changed to 2, which means it will detect
failures in 2 to 4 seconds (also a fall value of 2). This means
haproxy should deal with api outages before keepalived does and
allow the VIP to stay on the same node.
@cybertron cybertron force-pushed the keepalived-haproxy-timing branch from 119aacb to 0f7517f Compare September 11, 2020 21:27
@bcrochet
Copy link
Copy Markdown
Member

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 11, 2020
@yboaron
Copy link
Copy Markdown
Contributor

yboaron commented Sep 13, 2020

@cybertron Don't we need this fix for 4.6? I can see that the PR was labeled with 4.7

@cybertron
Copy link
Copy Markdown
Member Author

@cybertron Don't we need this fix for 4.6? I can see that the PR was labeled with 4.7

Yes. I think the label got added because I pushed it without a bug reference initially. @kikisdeliveryservice Are you okay with dropping the 4.7 label? This is needed to complete a 4.6 bug fix.

@cybertron
Copy link
Copy Markdown
Member Author

/test e2e-metal-ipi

1 similar comment
@cybertron
Copy link
Copy Markdown
Member Author

/test e2e-metal-ipi

@kikisdeliveryservice kikisdeliveryservice removed the 4.7 Work deferred for 4.7 label Sep 15, 2020
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

go for it! 😄

(there originally wasn't a bz attached to this)

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bcrochet, cybertron, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@cybertron: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 0f7517f link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 5b5d261 into openshift:master Sep 16, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@cybertron: All pull requests linked via external trackers have merged:

Bugzilla bug 1823950 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1823950: Reverse haproxy and keepalived check timings

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 16, 2020
Since we moved to keepalived healthchecking against haproxy, we want
haproxy to handle most failures so the VIP doesn't have to move.
However, previously the time it took for haproxy to recognize an
outage on a node was longer than it was for keepalived, which resulted
in the VIP moving before haproxy removed the failing backend.

This change makes the haproxy interval 1 second, so it should notice
outages in 2 seconds or less (because it has a fall value of 2).
The keepalived interval is changed to 2, which means it will detect
failures in 2 to 4 seconds (also a fall value of 2). This means
haproxy should deal with api outages before keepalived does and
allow the VIP to stay on the same node.

This ports
openshift#2075 to
OpenStack platform.
mandre added a commit to mandre/machine-config-operator that referenced this pull request Sep 23, 2020
Since we moved to keepalived healthchecking against haproxy, we want
haproxy to handle most failures so the VIP doesn't have to move.
However, previously the time it took for haproxy to recognize an
outage on a node was longer than it was for keepalived, which resulted
in the VIP moving before haproxy removed the failing backend.

This change makes the haproxy interval 1 second, so it should notice
outages in 2 seconds or less (because it has a fall value of 2).
The keepalived interval is changed to 2, which means it will detect
failures in 2 to 4 seconds (also a fall value of 2). This means
haproxy should deal with api outages before keepalived does and
allow the VIP to stay on the same node.

This ports
openshift#2075 to
OpenStack platform.
vrutkovs pushed a commit to vrutkovs/machine-config-operator that referenced this pull request Oct 13, 2020
Since we moved to keepalived healthchecking against haproxy, we want
haproxy to handle most failures so the VIP doesn't have to move.
However, previously the time it took for haproxy to recognize an
outage on a node was longer than it was for keepalived, which resulted
in the VIP moving before haproxy removed the failing backend.

This change makes the haproxy interval 1 second, so it should notice
outages in 2 seconds or less (because it has a fall value of 2).
The keepalived interval is changed to 2, which means it will detect
failures in 2 to 4 seconds (also a fall value of 2). This means
haproxy should deal with api outages before keepalived does and
allow the VIP to stay on the same node.

This ports
openshift#2075 to
OpenStack platform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants