Skip to content

Fix AROBrokenDNSMasq#4528

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
sdodson:fixup-AROBrokenDNSMasq
Dec 16, 2023
Merged

Fix AROBrokenDNSMasq#4528
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
sdodson:fixup-AROBrokenDNSMasq

Conversation

@sdodson
Copy link
Copy Markdown
Member

@sdodson sdodson commented Dec 16, 2023

On non ARO clusters the query was emitting multiple zero values. Also the ARO operator will only exist on ARO clusters so no need to pay attention to whether or not the cluster is on Azure.

Also extend to 4.13.27 and 4.14.7

On non ARO clusters the query was emitting multiple zero values. Also the
ARO operator will only exist on ARO clusters so no need to pay attention
to whether or not the cluster is on Azure.

Also extend to 4.13.27 and 4.14.7
@sdodson sdodson added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Dec 16, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 16, 2023

@sdodson: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 2a5c7ab into openshift:master Dec 16, 2023
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to evel this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
wking added a commit to wking/cluster-version-operator that referenced this pull request Dec 20, 2023
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 2, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 4, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 12, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jan 20, 2024
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on
failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates
round that does anything useful has a fresh Cincinnati pull" to "some
syncAvailableUpdates rounds have a fresh Cincinnati pull, but others
just re-eval some Recommended=Unknown conditional updates".  Then
syncAvailableUpdates calls setAvailableUpdates.

However, until this commit, setAvailableUpdates had been bumping
LastAttempt every time, even in the just-re-eval conditional updates"
case.  That meant we never tripped the:

        } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) {
                klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339))

condition to trigger a fresh Cincinnati pull.  Which could lead to
deadlocks like:

1. Cincinnati serves vulnerable PromQL, like [1].
2. Clusters pick up that broken PromQL, try to evaluate, and fail.
   Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like [2].
4. Cases:
   a. Before 965bfb2, and also after this commit, Clusters pick up
      the fixed PromQL, try to evaluate, and start succeeding.  Hooray!
   b. Clusters with 965bfb2 but without this commit say "it's been
      a long time since we pulled fresh Cincinanti information, but it
      has not been long since my last attempt to eval this broken
      PromQL, so let me skip the Cincinnati pull and re-eval that old
      PromQL", which fails.  Re-eval-and-fail loop continues.

To break out of 4.b, clusters on impacted releases can roll their CVO
pod:

  $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod

which will clear out LastAttempt and trigger a fresh Cincinnati pull.
I'm not sure if there's another recovery method...

[1]: openshift/cincinnati-graph-data#4524
[2]: openshift/cincinnati-graph-data#4528
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant