Implement Continuous Profiling by awgreene · Pull Request #112 · openshift/operator-framework-olm

awgreene · 2021-07-01T19:09:39Z

No description provided.

openshift-ci · 2021-07-01T19:09:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [awgreene]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

benluddy · 2021-07-19T14:29:26Z

 COPY . .
 RUN make build/olm bin/cpb

+RUN chmod +x /build/bin/collect-profiles


I'm surprised that this would be necessary. What's special about how this binary is being produced?

I was too. Now that the PR is in a working state I will attempt to remove it and diagnose why the binary could not be ran.

awgreene · 2021-07-20T18:57:58Z

/retest

timflannagan

Nice, these changes look pretty far along. I'll take another pass later.

I had a couple of comments/questions/etc. before I had to context switch to something else.

I'll play around with this work locally later.

timflannagan · 2021-07-21T19:15:35Z

+	var cfg config.Configuration
+	return &cobra.Command{
+		Use:   "collect-profiles configMapName:url",
+		Short: "Retrieves the pprof data from a UNL and stores it in a configMap.",


timflannagan · 2021-07-21T19:17:06Z

+		Short: "Retrieves the pprof data from a UNL and stores it in a configMap.",
+		Long: `The collect-profiles command makes https requests against pprof URLs
+		provided as arguments and stores that information in immutable configMaps.`,
+		Version: "0.0.1",


I don't see any precedent for versioning OLM binaries using this flag. Any reason we would want to include this flag going forward?

Let's drop it then!

timflannagan · 2021-07-21T19:18:08Z

+	corev1 "k8s.io/api/core/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/types"
+	"k8s.io/klog/v2"


Would be possible to avoid vendoring another logging client and use one of the existing ones? IIRC logrus should at least be available.

I like klog though :(

On a recheck of these changes, it looks like klog is already present, and it's only being marked as an explicit dependency in the go.sum now. I don't have a strong opinion and was namely interested in reducing the vendor surface area, so it should be fine to continue using this package.

timflannagan · 2021-07-21T19:36:02Z

+	if err != nil {
+		return err
+	}
+	if fi.Size() == 0 {


Trying to wrap my head around when a file would exist but contain zero bytes. A $ touch <file> kind of operation?

It's a bit of a chicken and an egg problem.

The service-ca operator cannot generate certs for use with client authentication. We elected to have the CronJob generate a self-signed cert and rotate the cert with each run. The tls secret containing the self-signed cert must be mounted to the pod, so it has to exist before the pod can populate it. You cannot omit the tls.key or tls.crt values from the data of a TLS Secret, so they had to be set to "". The initial run of the CronJob will have two empty files names tls.key and tls.crt.

The entire point of the above is to handle that initial run, update the secret, exit gracefully, OLM/Catlog Operator Deployments will then notice the updated secret and use that new cert to validate client requests, thereby allowing subsequent CronJob runs to use that secret to retrieve the pprof data.

timflannagan · 2021-07-21T19:40:59Z

+	restConfig, err := rest.InClusterConfig()
+	if err != nil {
+		return err
+	}


Any hesitance towards always relying on the in-cluster config? The only immediate concern would be attempting to develop locally and running the binary from a remote host, which should error/produce a panic if it cannot load the in-cluster configuration. The latter can be implemented as a follow-up later down the line, but trying to wrap my head around whether there are similar situations to look out for here.

It seems reasonable if people decide to use this binary outside of containers, but I'd like to see this work tacked on post code freeze.

timflannagan · 2021-07-21T19:42:59Z

+		PersistentPreRunE: func(*cobra.Command, []string) error {
+			return cfg.Load()
+		},


Had a quick question as to why we wouldn't be able to bake this into the RunE handler. Affects startup performance in the latter case?

I simply thought that retrieving the KUBECONFIG outside of the command logic made more sense.

timflannagan · 2021-07-21T19:49:23Z

+				return nil
+			}
+
+			jobConfig, err := config.GetConfig(configMountPath)


What happens when the configMountPath variable (or even the default /etc/config location) doesn't exist? It seems like it would produce an error attempting to open that file when reading the code in the config package. Is that the expected behavior we'd like, or would we rather handle that error and attempt to create that path for them?

So when shipped with openshift and using the manifests generated by make manifests, CVO will always ensure that the default jobConfig configMap exists. If a user chose to delete this map, it would normalize with a subsequent run.

It seems reasonable to return a default jobConfig if the file isn't found.

On second thought, it seems better for the job to exit if a user provides an invalid configuration.

Consider the only configuration available to users, disabled = true. When this field is set to true, the container "completes" and exits gracefully. If a user "disabled" the job incorrectly (ie, false instead of False), by defaulting to values the container would collect the profiles and then exit gracefully without the user noticing that the job had not been disabled. Alternatively, the existing implementation raises an container Error.

timflannagan · 2021-07-21T19:53:24Z

+
+func newCmd() *cobra.Command {
+	var cfg config.Configuration
+	return &cobra.Command{


Do we want to also include the SilenceErrors: true field: https://pkg.go.dev/github.com/spf13/cobra.

When encountering an error while running this binary, it would still surface the error message, but avoid printing out the --help prompt for this command.

This is a great idea!

Err, sorry I commented the wrong option - I meant the SilenceUsage: true field.

timflannagan · 2021-07-21T19:58:17Z

+}
+
+func GetConfig(path string) (*config, error) {
+	file, err := os.Open(filepath.Join(path, "disabled"))


What kind of hierarchy do we expect with the profiling configuration file? I would expect we'd specify something like disable: true in a single file vs. checking the presence of the ./path/disabled file. Maybe this kind of information would be better served in the long description format of the rootCmd structure.

timflannagan · 2021-07-21T20:02:37Z

+			Namespace: "openshift-operator-lifecycle-manager",
+		},
+	}
+	err := client.Get(ctx, types.NamespacedName{Namespace: "openshift-operator-lifecycle-manager", Name: "pprof-cert"}, secret)


Would it make sense for either of these namespace or secret name values to be configurable that default to the OLM and "pprof-secret" values? If not, these values could be live as constant variables in this package.

Let's go with constants given that no other project will probably ever use this binary.

timflannagan · 2021-07-21T20:21:26Z

+			if err := verifyCertAndKeyExist(certPath, keyPath); err != nil {
+				klog.Infof("error verifying provided cert and key: %v", err)
+				klog.Info("generating a new cert and key")
+				return populateServingCert(cmd.Context(), cfg.Client)


Should we be returning here, or just attempting to create/update/etc. this serving cert secret? What happens if this cert/key path doesn't exist? Do we run this binary once, those files are created, and we run this binary again?

Should we be returning here, or just attempting to create/update/etc.

We should return here. The CronJob runs every 15 minutes. Mounted secrets are only refreshed every 15 minutes. Even if we update the tls secret, the OLM and Catalog pods will take at least a minute to recieve a notification that the mounted files have changed and accept the new cert.

What happens if this cert/key path doesn't exist?

If the secret doesn't exist an error is returned.
If the files are empty (which happens on the first run) this method is called earlier and initializes the values. No error is returned.

Do we run this binary once, those files are created, and we run this binary again?

Correct. First run is just to initialize the secret. This happens once on a cluster. Subsequent runs (even on upgrades) do not require initialization.

awgreene · 2021-07-22T04:22:52Z

Waiting on #129
/hold

This commit introduces a cronJob which extracts heap profiles from the OLM and Catalog Operator deployments from exposed services. These heap profiles are then saved in configMaps in the openshift-operator-lifecycle-manager namespace. Requests against the aformentioned services are made using an HTTPS request. The client certificate used by the cronJob is recycled with each run. Co-authored-by: Vu Dinh <vudinh@outlook.com> Co-authored-by: Ben Luddy <bluddy@redhat.com> Signed-off-by: Alexander Greene <greene.al1991@gmail.com>

benluddy · 2021-07-22T16:06:32Z

/lgtm

benluddy · 2021-07-22T17:43:43Z

/test e2e-gcp

timflannagan · 2021-07-22T18:03:27Z

Removing the hold as the downstream port has landed.

/hold cancel

timflannagan · 2021-07-22T18:07:18Z

/retest

openshift-bot · 2021-07-22T18:22:52Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

lilic · 2021-08-10T08:10:44Z

Hey 👋 @awgreene since there is not much description here, I am curious if you performed any performance impact on the cluster, as continuous profiling and storing to configmaps to me seems very expensive. Or if there is a proposal for this somewhere where I can read more about this and the impact it will have? Thanks!

awgreene · 2021-08-10T12:42:47Z

Hello @lilic - It should not be expensive as the solution generates at a maximum 2 configMaps worth of data, which has a hard limit set by the object store (I believe it's 2 mb per configMap so a maximum of 4mb).

Continuous profiling was implemented with a cronjob which essentially does three things:

The job attempts to clean up all configMaps it has created (identified with a label) that are not the most recent versions. The job will attempt to delete each configMap and if any errors were encountered it will exit before creating any new configMaps. This approach prevents the job from creating new configMaps when it has issues deleting old configMaps. It also leaves the most recent configMap available if it has any issues creating a new configMap in step 2.
The job then gathers and stores the new profiling data in a configMap.
Delete the previous most recent configMap identified in step 1.

awgreene · 2021-08-10T13:13:23Z

Some additional context @lilic

Historically the OLM and Catalog operators haven't had profiling data available when an issue was submitted. Profiling can only be enabled by:

Marking the operator as unmanaged by CVO
Modifying the operator's deployment to include a -profiling argument

This meant that in order to collect profiling information a customer would need to:

Submit a ticket
Have a member of the OLM team request profiling data and mention the steps above
Customer performs steps above, attempts to recreate the incident, and reports back to us
OLM checks if customer managed to capture the issue in profiling data

In an effort to improve that workflow, OLM now enables profiling by default and saves the latest successful scrape. ConfigMaps are automatically collected by must-gather.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2021

openshift-ci Bot requested review from benluddy and gallettilance July 1, 2021 19:09

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2021

timflannagan reviewed Jul 1, 2021

View reviewed changes

Comment thread manifests/0000_50_olm_02-services.yaml Outdated

awgreene force-pushed the continuous-profiling branch 10 times, most recently from a3e879d to c41b3ad Compare July 2, 2021 05:01

awgreene force-pushed the continuous-profiling branch 2 times, most recently from a784f83 to 904608f Compare July 9, 2021 18:05

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2021

awgreene force-pushed the continuous-profiling branch from 904608f to f2b8fba Compare July 14, 2021 18:40

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2021

awgreene force-pushed the continuous-profiling branch 7 times, most recently from c3523ca to 2d5d275 Compare July 19, 2021 13:15

benluddy reviewed Jul 19, 2021

View reviewed changes

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2021

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2021

benluddy reviewed Jul 20, 2021

View reviewed changes

Comment thread manifests/0000_50_olm_00-pprof-rbac.yaml

Comment thread cmd/collect-profiles/main.go Outdated

awgreene force-pushed the continuous-profiling branch 2 times, most recently from eef9e1e to f413010 Compare July 20, 2021 18:45

timflannagan reviewed Jul 21, 2021

View reviewed changes

timflannagan reviewed Jul 22, 2021

View reviewed changes

Comment thread pkg/profiling/config/config.go

openshift-ci Bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 22, 2021

awgreene force-pushed the continuous-profiling branch from 3a2e00f to c24b792 Compare July 22, 2021 12:56

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2021

awgreene force-pushed the continuous-profiling branch from c24b792 to 85ff1f8 Compare July 22, 2021 15:18

awgreene and others added 2 commits July 22, 2021 11:54

fix psm deployment

a212166

awgreene force-pushed the continuous-profiling branch from 85ff1f8 to a212166 Compare July 22, 2021 15:56

openshift-ci Bot assigned benluddy Jul 22, 2021

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2021

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2021

openshift-merge-robot merged commit e5ee153 into openshift:master Jul 22, 2021

openshift-ci-robot mentioned this pull request Feb 5, 2026

OCPBUGS-31547: Remove the collect-profiles job #1215

Merged

Conversation

awgreene commented Jul 1, 2021

Uh oh!

openshift-ci Bot commented Jul 1, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awgreene commented Jul 20, 2021

Uh oh!

timflannagan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgreene Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgreene Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timflannagan Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awgreene Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

awgreene Jul 21, 2021 •

edited

Loading

awgreene Jul 22, 2021 •

edited

Loading

timflannagan Jul 21, 2021 •

edited

Loading

awgreene Jul 22, 2021 •

edited

Loading

awgreene Jul 22, 2021 •

edited

Loading

awgreene commented Aug 10, 2021 •

edited

Loading

awgreene commented Aug 10, 2021 •

edited

Loading