Fix AROBrokenDNSMasq#4528
Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom Dec 16, 2023
Merged
Conversation
On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure. Also extend to 4.13.27 and 4.14.7
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Contributor
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to evel this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot
pushed a commit
to openshift-cherrypick-robot/cluster-version-operator
that referenced
this pull request
Jan 2, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot
pushed a commit
to openshift-cherrypick-robot/cluster-version-operator
that referenced
this pull request
Jan 4, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot
pushed a commit
to openshift-cherrypick-robot/cluster-version-operator
that referenced
this pull request
Jan 12, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot
pushed a commit
to openshift-cherrypick-robot/cluster-version-operator
that referenced
this pull request
Jan 20, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure.
Also extend to 4.13.27 and 4.14.7