Skip to content

Implement Continuous Profiling#112

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
awgreene:continuous-profiling
Jul 22, 2021
Merged

Implement Continuous Profiling#112
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
awgreene:continuous-profiling

Conversation

@awgreene
Copy link
Copy Markdown
Contributor

@awgreene awgreene commented Jul 1, 2021

No description provided.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jul 1, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from benluddy and gallettilance July 1, 2021 19:09
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 1, 2021
Comment thread manifests/0000_50_olm_02-services.yaml Outdated
@awgreene awgreene force-pushed the continuous-profiling branch 10 times, most recently from a3e879d to c41b3ad Compare July 2, 2021 05:01
@awgreene awgreene force-pushed the continuous-profiling branch 2 times, most recently from a784f83 to 904608f Compare July 9, 2021 18:05
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2021
@awgreene awgreene force-pushed the continuous-profiling branch from 904608f to f2b8fba Compare July 14, 2021 18:40
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2021
@awgreene awgreene force-pushed the continuous-profiling branch 7 times, most recently from c3523ca to 2d5d275 Compare July 19, 2021 13:15
Comment thread manifests/0000_50_olm_07-collect-profiles.cronjob.yaml
Comment thread operator-lifecycle-manager.Dockerfile Outdated
COPY . .
RUN make build/olm bin/cpb

RUN chmod +x /build/bin/collect-profiles
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this would be necessary. What's special about how this binary is being produced?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too. Now that the PR is in a working state I will attempt to remove it and diagnose why the binary could not be ran.

Comment thread manifests/0000_50_olm_07-collect-profiles.cronjob.yaml
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2021
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2021
Comment thread manifests/0000_50_olm_00-pprof-rbac.yaml
Comment thread cmd/collect-profiles/main.go Outdated
@awgreene awgreene force-pushed the continuous-profiling branch 2 times, most recently from eef9e1e to f413010 Compare July 20, 2021 18:45
@awgreene
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Contributor

@timflannagan timflannagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, these changes look pretty far along. I'll take another pass later.

I had a couple of comments/questions/etc. before I had to context switch to something else.

I'll play around with this work locally later.

Comment thread cmd/collect-profiles/main.go
Comment thread cmd/collect-profiles/main.go Outdated
var cfg config.Configuration
return &cobra.Command{
Use: "collect-profiles configMapName:url",
Short: "Retrieves the pprof data from a UNL and stores it in a configMap.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/UNL/URL

Comment thread cmd/collect-profiles/main.go Outdated
Short: "Retrieves the pprof data from a UNL and stores it in a configMap.",
Long: `The collect-profiles command makes https requests against pprof URLs
provided as arguments and stores that information in immutable configMaps.`,
Version: "0.0.1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any precedent for versioning OLM binaries using this flag. Any reason we would want to include this flag going forward?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop it then!

Comment thread cmd/collect-profiles/main.go Outdated
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/klog/v2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be possible to avoid vendoring another logging client and use one of the existing ones? IIRC logrus should at least be available.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like klog though :(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a recheck of these changes, it looks like klog is already present, and it's only being marked as an explicit dependency in the go.sum now. I don't have a strong opinion and was namely interested in reducing the vendor surface area, so it should be fine to continue using this package.

if err != nil {
return err
}
if fi.Size() == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to wrap my head around when a file would exist but contain zero bytes. A $ touch <file> kind of operation?

Copy link
Copy Markdown
Contributor Author

@awgreene awgreene Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a chicken and an egg problem.

The service-ca operator cannot generate certs for use with client authentication. We elected to have the CronJob generate a self-signed cert and rotate the cert with each run. The tls secret containing the self-signed cert must be mounted to the pod, so it has to exist before the pod can populate it. You cannot omit the tls.key or tls.crt values from the data of a TLS Secret, so they had to be set to "". The initial run of the CronJob will have two empty files names tls.key and tls.crt.

The entire point of the above is to handle that initial run, update the secret, exit gracefully, OLM/Catlog Operator Deployments will then notice the updated secret and use that new cert to validate client requests, thereby allowing subsequent CronJob runs to use that secret to retrieve the pprof data.

Comment on lines +23 to +25
restConfig, err := rest.InClusterConfig()
if err != nil {
return err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any hesitance towards always relying on the in-cluster config? The only immediate concern would be attempting to develop locally and running the binary from a remote host, which should error/produce a panic if it cannot load the in-cluster configuration. The latter can be implemented as a follow-up later down the line, but trying to wrap my head around whether there are similar situations to look out for here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable if people decide to use this binary outside of containers, but I'd like to see this work tacked on post code freeze.

Comment on lines +77 to +91
PersistentPreRunE: func(*cobra.Command, []string) error {
return cfg.Load()
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick question as to why we wouldn't be able to bake this into the RunE handler. Affects startup performance in the latter case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply thought that retrieving the KUBECONFIG outside of the command logic made more sense.

Comment thread cmd/collect-profiles/main.go
return nil
}

jobConfig, err := config.GetConfig(configMountPath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the configMountPath variable (or even the default /etc/config location) doesn't exist? It seems like it would produce an error attempting to open that file when reading the code in the config package. Is that the expected behavior we'd like, or would we rather handle that error and attempt to create that path for them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when shipped with openshift and using the manifests generated by make manifests, CVO will always ensure that the default jobConfig configMap exists. If a user chose to delete this map, it would normalize with a subsequent run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to return a default jobConfig if the file isn't found.

Copy link
Copy Markdown
Contributor Author

@awgreene awgreene Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it seems better for the job to exit if a user provides an invalid configuration.

Consider the only configuration available to users, disabled = true. When this field is set to true, the container "completes" and exits gracefully. If a user "disabled" the job incorrectly (ie, false instead of False), by defaulting to values the container would collect the profiles and then exit gracefully without the user noticing that the job had not been disabled. Alternatively, the existing implementation raises an container Error.


func newCmd() *cobra.Command {
var cfg config.Configuration
return &cobra.Command{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also include the SilenceErrors: true field: https://pkg.go.dev/github.com/spf13/cobra.

When encountering an error while running this binary, it would still surface the error message, but avoid printing out the --help prompt for this command.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea!

Copy link
Copy Markdown
Contributor

@timflannagan timflannagan Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, sorry I commented the wrong option - I meant the SilenceUsage: true field.

Comment thread pkg/profiling/config/config.go Outdated
}

func GetConfig(path string) (*config, error) {
file, err := os.Open(filepath.Join(path, "disabled"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of hierarchy do we expect with the profiling configuration file? I would expect we'd specify something like disable: true in a single file vs. checking the presence of the ./path/disabled file. Maybe this kind of information would be better served in the long description format of the rootCmd structure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment thread cmd/collect-profiles/main.go Outdated
Namespace: "openshift-operator-lifecycle-manager",
},
}
err := client.Get(ctx, types.NamespacedName{Namespace: "openshift-operator-lifecycle-manager", Name: "pprof-cert"}, secret)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for either of these namespace or secret name values to be configurable that default to the OLM and "pprof-secret" values? If not, these values could be live as constant variables in this package.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with constants given that no other project will probably ever use this binary.

if err := verifyCertAndKeyExist(certPath, keyPath); err != nil {
klog.Infof("error verifying provided cert and key: %v", err)
klog.Info("generating a new cert and key")
return populateServingCert(cmd.Context(), cfg.Client)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be returning here, or just attempting to create/update/etc. this serving cert secret? What happens if this cert/key path doesn't exist? Do we run this binary once, those files are created, and we run this binary again?

Copy link
Copy Markdown
Contributor Author

@awgreene awgreene Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be returning here, or just attempting to create/update/etc.

We should return here. The CronJob runs every 15 minutes. Mounted secrets are only refreshed every 15 minutes. Even if we update the tls secret, the OLM and Catalog pods will take at least a minute to recieve a notification that the mounted files have changed and accept the new cert.

Copy link
Copy Markdown
Contributor Author

@awgreene awgreene Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this cert/key path doesn't exist?

If the secret doesn't exist an error is returned.
If the files are empty (which happens on the first run) this method is called earlier and initializes the values. No error is returned.

Do we run this binary once, those files are created, and we run this binary again?

Correct. First run is just to initialize the secret. This happens once on a cluster. Subsequent runs (even on upgrades) do not require initialization.

Comment thread cmd/collect-profiles/main.go
Comment thread pkg/profiling/config/config.go
@awgreene
Copy link
Copy Markdown
Contributor Author

Waiting on #129
/hold

@openshift-ci openshift-ci Bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 22, 2021
@awgreene awgreene force-pushed the continuous-profiling branch from 3a2e00f to c24b792 Compare July 22, 2021 12:56
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2021
@awgreene awgreene force-pushed the continuous-profiling branch from c24b792 to 85ff1f8 Compare July 22, 2021 15:18
awgreene and others added 2 commits July 22, 2021 11:54
This commit introduces a cronJob which extracts heap profiles
from the OLM and Catalog Operator deployments from exposed services.
These heap profiles are then saved in configMaps in the
openshift-operator-lifecycle-manager namespace.

Requests against the aformentioned services are made using an HTTPS
request. The client certificate used by the cronJob is recycled with
each run.

Co-authored-by: Vu Dinh <vudinh@outlook.com>
Co-authored-by: Ben Luddy <bluddy@redhat.com>
Signed-off-by: Alexander Greene <greene.al1991@gmail.com>
@awgreene awgreene force-pushed the continuous-profiling branch from 85ff1f8 to a212166 Compare July 22, 2021 15:56
@benluddy
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2021
@benluddy
Copy link
Copy Markdown
Contributor

/test e2e-gcp

@timflannagan
Copy link
Copy Markdown
Contributor

Removing the hold as the downstream port has landed.

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2021
@timflannagan
Copy link
Copy Markdown
Contributor

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit e5ee153 into openshift:master Jul 22, 2021
@lilic
Copy link
Copy Markdown
Contributor

lilic commented Aug 10, 2021

Hey 👋 @awgreene since there is not much description here, I am curious if you performed any performance impact on the cluster, as continuous profiling and storing to configmaps to me seems very expensive. Or if there is a proposal for this somewhere where I can read more about this and the impact it will have? Thanks!

@awgreene
Copy link
Copy Markdown
Contributor Author

awgreene commented Aug 10, 2021

Hello @lilic - It should not be expensive as the solution generates at a maximum 2 configMaps worth of data, which has a hard limit set by the object store (I believe it's 2 mb per configMap so a maximum of 4mb).

Continuous profiling was implemented with a cronjob which essentially does three things:

  1. The job attempts to clean up all configMaps it has created (identified with a label) that are not the most recent versions. The job will attempt to delete each configMap and if any errors were encountered it will exit before creating any new configMaps. This approach prevents the job from creating new configMaps when it has issues deleting old configMaps. It also leaves the most recent configMap available if it has any issues creating a new configMap in step 2.
  2. The job then gathers and stores the new profiling data in a configMap.
  3. Delete the previous most recent configMap identified in step 1.

@awgreene
Copy link
Copy Markdown
Contributor Author

awgreene commented Aug 10, 2021

Some additional context @lilic

Historically the OLM and Catalog operators haven't had profiling data available when an issue was submitted. Profiling can only be enabled by:

  • Marking the operator as unmanaged by CVO
  • Modifying the operator's deployment to include a -profiling argument

This meant that in order to collect profiling information a customer would need to:

  • Submit a ticket
  • Have a member of the OLM team request profiling data and mention the steps above
  • Customer performs steps above, attempts to recreate the incident, and reports back to us
  • OLM checks if customer managed to capture the issue in profiling data

In an effort to improve that workflow, OLM now enables profiling by default and saves the latest successful scrape. ConfigMaps are automatically collected by must-gather.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants