Implement Continuous Profiling#112
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: awgreene The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a3e879d to
c41b3ad
Compare
a784f83 to
904608f
Compare
904608f to
f2b8fba
Compare
c3523ca to
2d5d275
Compare
| COPY . . | ||
| RUN make build/olm bin/cpb | ||
|
|
||
| RUN chmod +x /build/bin/collect-profiles |
There was a problem hiding this comment.
I'm surprised that this would be necessary. What's special about how this binary is being produced?
There was a problem hiding this comment.
I was too. Now that the PR is in a working state I will attempt to remove it and diagnose why the binary could not be ran.
eef9e1e to
f413010
Compare
|
/retest |
timflannagan
left a comment
There was a problem hiding this comment.
Nice, these changes look pretty far along. I'll take another pass later.
I had a couple of comments/questions/etc. before I had to context switch to something else.
I'll play around with this work locally later.
| var cfg config.Configuration | ||
| return &cobra.Command{ | ||
| Use: "collect-profiles configMapName:url", | ||
| Short: "Retrieves the pprof data from a UNL and stores it in a configMap.", |
| Short: "Retrieves the pprof data from a UNL and stores it in a configMap.", | ||
| Long: `The collect-profiles command makes https requests against pprof URLs | ||
| provided as arguments and stores that information in immutable configMaps.`, | ||
| Version: "0.0.1", |
There was a problem hiding this comment.
I don't see any precedent for versioning OLM binaries using this flag. Any reason we would want to include this flag going forward?
There was a problem hiding this comment.
Let's drop it then!
| corev1 "k8s.io/api/core/v1" | ||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "k8s.io/apimachinery/pkg/types" | ||
| "k8s.io/klog/v2" |
There was a problem hiding this comment.
Would be possible to avoid vendoring another logging client and use one of the existing ones? IIRC logrus should at least be available.
There was a problem hiding this comment.
I like klog though :(
There was a problem hiding this comment.
On a recheck of these changes, it looks like klog is already present, and it's only being marked as an explicit dependency in the go.sum now. I don't have a strong opinion and was namely interested in reducing the vendor surface area, so it should be fine to continue using this package.
| if err != nil { | ||
| return err | ||
| } | ||
| if fi.Size() == 0 { |
There was a problem hiding this comment.
Trying to wrap my head around when a file would exist but contain zero bytes. A $ touch <file> kind of operation?
There was a problem hiding this comment.
It's a bit of a chicken and an egg problem.
The service-ca operator cannot generate certs for use with client authentication. We elected to have the CronJob generate a self-signed cert and rotate the cert with each run. The tls secret containing the self-signed cert must be mounted to the pod, so it has to exist before the pod can populate it. You cannot omit the tls.key or tls.crt values from the data of a TLS Secret, so they had to be set to "". The initial run of the CronJob will have two empty files names tls.key and tls.crt.
The entire point of the above is to handle that initial run, update the secret, exit gracefully, OLM/Catlog Operator Deployments will then notice the updated secret and use that new cert to validate client requests, thereby allowing subsequent CronJob runs to use that secret to retrieve the pprof data.
| restConfig, err := rest.InClusterConfig() | ||
| if err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Any hesitance towards always relying on the in-cluster config? The only immediate concern would be attempting to develop locally and running the binary from a remote host, which should error/produce a panic if it cannot load the in-cluster configuration. The latter can be implemented as a follow-up later down the line, but trying to wrap my head around whether there are similar situations to look out for here.
There was a problem hiding this comment.
It seems reasonable if people decide to use this binary outside of containers, but I'd like to see this work tacked on post code freeze.
| PersistentPreRunE: func(*cobra.Command, []string) error { | ||
| return cfg.Load() | ||
| }, |
There was a problem hiding this comment.
Had a quick question as to why we wouldn't be able to bake this into the RunE handler. Affects startup performance in the latter case?
There was a problem hiding this comment.
I simply thought that retrieving the KUBECONFIG outside of the command logic made more sense.
| return nil | ||
| } | ||
|
|
||
| jobConfig, err := config.GetConfig(configMountPath) |
There was a problem hiding this comment.
What happens when the configMountPath variable (or even the default /etc/config location) doesn't exist? It seems like it would produce an error attempting to open that file when reading the code in the config package. Is that the expected behavior we'd like, or would we rather handle that error and attempt to create that path for them?
There was a problem hiding this comment.
So when shipped with openshift and using the manifests generated by make manifests, CVO will always ensure that the default jobConfig configMap exists. If a user chose to delete this map, it would normalize with a subsequent run.
There was a problem hiding this comment.
It seems reasonable to return a default jobConfig if the file isn't found.
There was a problem hiding this comment.
On second thought, it seems better for the job to exit if a user provides an invalid configuration.
Consider the only configuration available to users, disabled = true. When this field is set to true, the container "completes" and exits gracefully. If a user "disabled" the job incorrectly (ie, false instead of False), by defaulting to values the container would collect the profiles and then exit gracefully without the user noticing that the job had not been disabled. Alternatively, the existing implementation raises an container Error.
|
|
||
| func newCmd() *cobra.Command { | ||
| var cfg config.Configuration | ||
| return &cobra.Command{ |
There was a problem hiding this comment.
Do we want to also include the SilenceErrors: true field: https://pkg.go.dev/github.com/spf13/cobra.
When encountering an error while running this binary, it would still surface the error message, but avoid printing out the --help prompt for this command.
There was a problem hiding this comment.
This is a great idea!
There was a problem hiding this comment.
Err, sorry I commented the wrong option - I meant the SilenceUsage: true field.
| } | ||
|
|
||
| func GetConfig(path string) (*config, error) { | ||
| file, err := os.Open(filepath.Join(path, "disabled")) |
There was a problem hiding this comment.
What kind of hierarchy do we expect with the profiling configuration file? I would expect we'd specify something like disable: true in a single file vs. checking the presence of the ./path/disabled file. Maybe this kind of information would be better served in the long description format of the rootCmd structure.
| Namespace: "openshift-operator-lifecycle-manager", | ||
| }, | ||
| } | ||
| err := client.Get(ctx, types.NamespacedName{Namespace: "openshift-operator-lifecycle-manager", Name: "pprof-cert"}, secret) |
There was a problem hiding this comment.
Would it make sense for either of these namespace or secret name values to be configurable that default to the OLM and "pprof-secret" values? If not, these values could be live as constant variables in this package.
There was a problem hiding this comment.
Let's go with constants given that no other project will probably ever use this binary.
| if err := verifyCertAndKeyExist(certPath, keyPath); err != nil { | ||
| klog.Infof("error verifying provided cert and key: %v", err) | ||
| klog.Info("generating a new cert and key") | ||
| return populateServingCert(cmd.Context(), cfg.Client) |
There was a problem hiding this comment.
Should we be returning here, or just attempting to create/update/etc. this serving cert secret? What happens if this cert/key path doesn't exist? Do we run this binary once, those files are created, and we run this binary again?
There was a problem hiding this comment.
Should we be returning here, or just attempting to create/update/etc.
We should return here. The CronJob runs every 15 minutes. Mounted secrets are only refreshed every 15 minutes. Even if we update the tls secret, the OLM and Catalog pods will take at least a minute to recieve a notification that the mounted files have changed and accept the new cert.
There was a problem hiding this comment.
What happens if this cert/key path doesn't exist?
If the secret doesn't exist an error is returned.
If the files are empty (which happens on the first run) this method is called earlier and initializes the values. No error is returned.
Do we run this binary once, those files are created, and we run this binary again?
Correct. First run is just to initialize the secret. This happens once on a cluster. Subsequent runs (even on upgrades) do not require initialization.
|
Waiting on #129 |
3a2e00f to
c24b792
Compare
c24b792 to
85ff1f8
Compare
This commit introduces a cronJob which extracts heap profiles from the OLM and Catalog Operator deployments from exposed services. These heap profiles are then saved in configMaps in the openshift-operator-lifecycle-manager namespace. Requests against the aformentioned services are made using an HTTPS request. The client certificate used by the cronJob is recycled with each run. Co-authored-by: Vu Dinh <vudinh@outlook.com> Co-authored-by: Ben Luddy <bluddy@redhat.com> Signed-off-by: Alexander Greene <greene.al1991@gmail.com>
85ff1f8 to
a212166
Compare
|
/lgtm |
|
/test e2e-gcp |
|
Removing the hold as the downstream port has landed. /hold cancel |
|
/retest |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
Hey 👋 @awgreene since there is not much description here, I am curious if you performed any performance impact on the cluster, as continuous profiling and storing to configmaps to me seems very expensive. Or if there is a proposal for this somewhere where I can read more about this and the impact it will have? Thanks! |
|
Hello @lilic - It should not be expensive as the solution generates at a maximum 2 configMaps worth of data, which has a hard limit set by the object store (I believe it's 2 mb per configMap so a maximum of 4mb). Continuous profiling was implemented with a cronjob which essentially does three things:
|
|
Some additional context @lilic Historically the OLM and Catalog operators haven't had profiling data available when an issue was submitted. Profiling can only be enabled by:
This meant that in order to collect profiling information a customer would need to:
In an effort to improve that workflow, OLM now enables profiling by default and saves the latest successful scrape. ConfigMaps are automatically collected by must-gather. |
No description provided.