Bug 1952576: csv_succeeded metric not present#2213
Bug 1952576: csv_succeeded metric not present#2213josefkarasek wants to merge 1 commit intooperator-framework:masterfrom
Conversation
|
@josefkarasek: This pull request references Bugzilla bug 1952576, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi @josefkarasek. Thanks for your PR. I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: josefkarasek The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
`csv_succeeded` metric is lost between pod restarts. This is because this metric is only emitted when CSV.Status is changed. Signed-off-by: Josef Karasek <jkarasek@redhat.com>
| } | ||
|
|
||
| // always emit csv metrics | ||
| metrics.EmitCSVMetric(clusterServiceVersion, outCSV) |
There was a problem hiding this comment.
Are there any cardinality concerns with always emitting CSV metrics?
There was a problem hiding this comment.
I think the metric is using a good, qualified name, which is always unique for one CSV
csv_succeeded{name="etcdoperator.v0.9.4",namespace="olm",version="0.9.4"} 1
There was a problem hiding this comment.
Agreed, that doesn't look crazy to me, but it looks like we're emitting other metrics besides that csv_succeeded one when I was poking around that metrics package - I'm trying to wrap my head around how all those CSV-related metrics, emitting on a per-step basis, can lead to problems affecting the core monitoring stack.
There was a problem hiding this comment.
Other approach to fixing this bug could be to emit the metric for all CSVs during pod startup and then update it only when a change happens
There was a problem hiding this comment.
@timflannagan I don't think there's any cardinality concern here. csv_succeeded is a prometheus gauge, and within EmitCSVMetrics we're always first deleting the old metric for the csv being synced, and then emitting a new metric and setting the gauge value to 1 or 0 (succeeded/did not succeeded). Even if we were not deleting the old metric, iirc metric points for a unique set of values are only emitted once, i.e they're always unique data points in the set of emitted metrics, which is what @josefkarasek clarified in the first comment.
However, I'm not convinced this actually solves the problem. @josefkarasek the original problem was that we were only edge triggering this metric, i.e whenever the controller syncs a clusterserviceversion (syncClusterServiceVersion holds the logic for when that does happen), and that happens only when there's a change in the CSV object on cluster. But we need some way to level drive this metric too, which is what the first part of your last comment is.
update it only when a change happens
edge triggered
emit the metric for all CSVs during pod startup
level driven
There was a problem hiding this comment.
I'm assuming that all CSVs are queued up during pod start and reconciled. My assumption is that this approach is edge+level driven at the same time. From what you're saying it sounds like this assumption doesn't hold.
There was a problem hiding this comment.
Although CSVs are queued up during pod start, that is still edge trigger, the trigger here being queuing of the CSV. True level driven is when you query the state of the cluster and attempt to reconcile with the desired state. There's always that chance with edge triggers that we'll miss an event, so querying for existing CSVs and emitting metrics for them on pod restart is the most full proof method to solve this problem.
|
/ok-to-test |
|
How can I fix |
|
@josefkarasek The bug bot is complaining about the current state of the BZ: #2213 (comment): In order to fix this, update the BZ's "Target Release" dropdown button to the 4.9.0, instead of the default value (empty release, "---"), save the BZ, and then comment |
|
/bugzilla refresh |
|
@josefkarasek: This pull request references Bugzilla bug 1952576, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: jianzhangbjz. Note that only operator-framework members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
anik120
left a comment
There was a problem hiding this comment.
We'd also want to add test for this as proof of concept.
| } | ||
|
|
||
| // always emit csv metrics | ||
| metrics.EmitCSVMetric(clusterServiceVersion, outCSV) |
There was a problem hiding this comment.
@timflannagan I don't think there's any cardinality concern here. csv_succeeded is a prometheus gauge, and within EmitCSVMetrics we're always first deleting the old metric for the csv being synced, and then emitting a new metric and setting the gauge value to 1 or 0 (succeeded/did not succeeded). Even if we were not deleting the old metric, iirc metric points for a unique set of values are only emitted once, i.e they're always unique data points in the set of emitted metrics, which is what @josefkarasek clarified in the first comment.
However, I'm not convinced this actually solves the problem. @josefkarasek the original problem was that we were only edge triggering this metric, i.e whenever the controller syncs a clusterserviceversion (syncClusterServiceVersion holds the logic for when that does happen), and that happens only when there's a change in the CSV object on cluster. But we need some way to level drive this metric too, which is what the first part of your last comment is.
update it only when a change happens
edge triggered
emit the metric for all CSVs during pod startup
level driven
|
Also, @timflannagan we don't need the bug number in the PR title right? We'll only need it when we downstream the PR? |
|
Closing in favor of #2216 |
1 similar comment
|
Closing in favor of #2216 |
|
@josefkarasek: This pull request references Bugzilla bug 1952576. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
csv_succeededmetric is lost between pod restarts.This is because this metric is only emitted when CSV.Status is changed.
Description of the change:
Emit
csv_succeeded/csv_abnormalmetric during every CSV sync loop.Motivation for the change:
csv_succeededmetric is lost between pod restarts.Reviewer Checklist
/doc