Fix AROBrokenDNSMasq by sdodson · Pull Request #4528 · openshift/cincinnati-graph-data

sdodson · 2023-12-16T03:46:38Z

On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure.

Also extend to 4.13.27 and 4.14.7

On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure. Also extend to 4.13.27 and 4.14.7

openshift-ci · 2023-12-16T03:47:15Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-12-16T04:06:14Z

@sdodson: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to evel this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528

965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528

Fix AROBrokenDNSMasq

1fdb464

On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure. Also extend to 4.13.27 and 4.14.7

sdodson added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Dec 16, 2023

openshift-ci Bot requested review from PratikMahajan and wking December 16, 2023 03:47

openshift-merge-bot Bot merged commit 2a5c7ab into openshift:master Dec 16, 2023

LalatenduMohanty mentioned this pull request Dec 16, 2023

Extending the AROBrokenDNSMasq risk to 4.14.7 and 4.13.27 #4526

Closed

wking mentioned this pull request Dec 20, 2023

OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls openshift/cluster-version-operator#1009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AROBrokenDNSMasq#4528

Fix AROBrokenDNSMasq#4528
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
sdodson:fixup-AROBrokenDNSMasq

sdodson commented Dec 16, 2023

Uh oh!

openshift-ci Bot commented Dec 16, 2023

Uh oh!

openshift-ci Bot commented Dec 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sdodson commented Dec 16, 2023

Uh oh!

openshift-ci Bot commented Dec 16, 2023

Uh oh!

openshift-ci Bot commented Dec 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant