feat(metrics) Limit Cardinality of CSV metrics by awgreene · Pull Request #1099 · operator-framework/operator-lifecycle-manager

awgreene · 2019-10-30T17:06:56Z

This commit introduces a change that limits the number of metrics that
an OLM cluster reports at any given time for a CSV.

The first metric introduced is called csv_succeeded, which tracks CSVs that
have reached the succeeded phase. The following information is
provided about the CSV via labels: name, version. The value of this
metric will always be 0 or 1.

The second metric introduced is called csv_abnormal, which is reported
whenever the CSV is updated and has not reached the succeeded phase. The
following information is provided about the CSV via labels: name,
version, phase, reason. Whenever a CSV is updated, the existing
timeseries is deleted and replaced by an updated version.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /docs
Commit messages sensible and descriptive

awgreene · 2019-10-30T17:16:23Z

Example Cases:

When a CSV is updated but not in the succeeded phase

# HELP csv_abnormal Successful CSV Install
# TYPE csv_abnormal gauge
csv_abnormal{name="etcd-operator.v1.0.0",phase="Failed",reason="UnsupportedOperatorGroup",version="1.0.0"} 1.0
...
...
...
# HELP csv_succeeded Successful CSV Install
# TYPE csv_succeeded gauge
csv_succeeded{name="packageserver",version="1.0.0"} 1.0
csv_succeeded{name="etcd-operator.v1.0.0",version="1.0.0"} 0.0

Note: When a CSV is updated, the old metric is deleted.

When a CSV reaches the succeeded phase

# HELP csv_succeeded Successful CSV Install
# TYPE csv_succeeded gauge
csv_succeeded{name="packageserver",version="1.0.0"} 1.0
csv_succeeded{name="etcd-operator.v1.0.0",version="1.0.0"} 1.0

Note: Notice that the csv_abnormal timeseries was removed

awgreene · 2019-10-30T17:17:21Z

@ecordell it should be noted that this data will get messy if multiple instances of the same operator are installed.

awgreene · 2019-10-30T19:23:18Z

/test e2e-gcp-upgrade

awgreene · 2019-10-31T11:49:32Z

/retest

ecordell

/lgtm

ecordell · 2019-10-31T11:59:10Z

/hold

while I spin up a cluster to look at this :)

awgreene · 2019-10-31T13:02:22Z

/hold

while I spin up a cluster to look at this :)

Sure - let me know if you find anything interesting!

ecordell · 2019-10-31T13:08:36Z

Checking CSV succeeded:
blue is packageserver, which is always good
green is etcd-operator - I killed its deployment to watch it go unhealthy

This is csv_abnormal - I just see it for installready, when olm detected the deployment was deleted and reconciles from the CSV.

This looks good, but I did expect to see more values for "abnormal". I think what's happening is that we're deleting the abnormal states from prometheus before prometheus can scrape them.

Knowing when to delete them is tricky, since the scrape time is configurable. I think it's 20s by default. We may need to do something where we keep the "old" timeseries around for a while before deleting them.

This commit introduces a change that limits the number of metrics that an OLM cluster reports at any given time for a CSV. The first metric introduced is called csv_up, which tracks CSVs that have reached the succeeded phase. The following information is provided about the CSV via labels: namespace, name, version. The value of this metric will always be 0 or 1. The second metric introduced is called csv_abnormal, which is reported whenever the CSV is updated and has not reached the succeeded phase. The following information is provided about the CSV via labels: namespace, name, version, phase, reason. Whenever a CSV is updated, the existing timeseries is deleted and replaced by an updated version.

kevinrizza · 2019-10-31T16:01:56Z

Checking CSV succeeded:
blue is packageserver, which is always good
green is etcd-operator - I killed its deployment to watch it go unhealthy

This is csv_abnormal - I just see it for installready, when olm detected the deployment was deleted and reconciles from the CSV.

This looks good, but I did expect to see more values for "abnormal". I think what's happening is that we're deleting the abnormal states from prometheus before prometheus can scrape them.

Knowing when to delete them is tricky, since the scrape time is configurable. I think it's 20s by default. We may need to do something where we keep the "old" timeseries around for a while before deleting them.

@ecordell Maybe this is a silly question, but what value do we get from seeing every step here rather than just seeing "it's in a good state" or "it's not in a good state" ? Is there some unique knowledge we would gain from seeing all of those steps? I was thinking that we care more about "what step are we stuck in" in the same way that I can learn that by kubectl getting the CSV

My concern is that being aware of the state of these things and deleting them later seems like it adds a medium amount of complexity to managing the metric endpoint's state for the value that it adds.

ecordell · 2019-10-31T17:08:13Z

That's a good point, and I do think it would be reasonable to merge this as-is and accept that not all states will necessarily get picked up by prometheus.

It would be nice to see the full lifecycle of operators reported if possible though. Perhaps we do that as a follow up.

ecordell · 2019-10-31T17:10:11Z

/lgtm
/hold cancel

openshift-ci-robot · 2019-10-31T17:10:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, ecordell

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ecordell]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ecordell · 2019-10-31T17:11:24Z

/retest

awgreene · 2019-10-31T18:17:07Z

/retest

awgreene · 2019-10-31T19:20:22Z

/retest

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 30, 2019

openshift-ci-robot requested review from ecordell and njhale October 30, 2019 17:07

awgreene force-pushed the limit-cardinality-on-csv-metrics branch from fba91bb to f8c01b9 Compare October 30, 2019 17:09

awgreene force-pushed the limit-cardinality-on-csv-metrics branch 6 times, most recently from 37278d8 to c2adb1c Compare October 30, 2019 18:44

awgreene force-pushed the limit-cardinality-on-csv-metrics branch from c2adb1c to 571ec70 Compare October 30, 2019 19:27

awgreene changed the title ~~WIP: feat(metrics) Limit Cardinality of CSV metrics~~ feat(metrics) Limit Cardinality of CSV metrics Oct 30, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2019

ecordell reviewed Oct 30, 2019

View reviewed changes

Comment thread pkg/metrics/metrics.go Outdated

awgreene force-pushed the limit-cardinality-on-csv-metrics branch from 571ec70 to 288425b Compare October 31, 2019 02:09

ecordell reviewed Oct 31, 2019

View reviewed changes

openshift-ci-robot assigned ecordell Oct 31, 2019

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 31, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2019

awgreene force-pushed the limit-cardinality-on-csv-metrics branch from 288425b to 2a93602 Compare October 31, 2019 14:54

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2019

awgreene force-pushed the limit-cardinality-on-csv-metrics branch from 2a93602 to 5884308 Compare October 31, 2019 15:23

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 31, 2019

openshift-merge-robot merged commit eb9a999 into operator-framework:master Oct 31, 2019

awgreene mentioned this pull request Nov 11, 2019

Bug 1774621: Add OLM CSV metrics openshift/telemeter#253

Merged

timflannagan mentioned this pull request Jun 23, 2021

Bug 1952576: csv_succeeded metric not present #2213

Closed

5 tasks

Conversation

awgreene commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgreene commented Oct 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Cases:

When a CSV is updated but not in the succeeded phase

When a CSV reaches the succeeded phase

Uh oh!

awgreene commented Oct 30, 2019

Uh oh!

awgreene commented Oct 30, 2019

Uh oh!

Uh oh!

awgreene commented Oct 31, 2019

Uh oh!

ecordell left a comment

Choose a reason for hiding this comment

Uh oh!

ecordell commented Oct 31, 2019

Uh oh!

awgreene commented Oct 31, 2019

Uh oh!

ecordell commented Oct 31, 2019

Uh oh!

kevinrizza commented Oct 31, 2019

Uh oh!

ecordell commented Oct 31, 2019

Uh oh!

ecordell commented Oct 31, 2019

Uh oh!

openshift-ci-robot commented Oct 31, 2019

Uh oh!

ecordell commented Oct 31, 2019

Uh oh!

awgreene commented Oct 31, 2019

Uh oh!

awgreene commented Oct 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awgreene commented Oct 30, 2019 •

edited

Loading

awgreene commented Oct 30, 2019 •

edited

Loading