Skip to content

run bundle: Verify operator is installed/upgraded#3766

Closed
bharathi-tenneti wants to merge 0 commit intooperator-framework:masterfrom
bharathi-tenneti:runbundle-verify
Closed

run bundle: Verify operator is installed/upgraded#3766
bharathi-tenneti wants to merge 0 commit intooperator-framework:masterfrom
bharathi-tenneti:runbundle-verify

Conversation

@bharathi-tenneti
Copy link
Copy Markdown
Contributor

Description:
Verification logic will handle verifying the operator is installed beginning with the starting CSV and upgraded through the ending CSV.
Log what happened and final state (e.g. operator is running in namespace foo, watching namespace bar.
Use CSV status and deployment status as a way to verify the operator.

@bharathi-tenneti bharathi-tenneti changed the title run bundle: Verify operator is installed/upgraded <WIP> run bundle: Verify operator is installed/upgraded Aug 20, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2020
}

if csvKey.Name == "" {
return v1alpha1.InstallPlanFailed, fmt.Errorf("could not find installed CSV in install plan")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like here should return the CSV phase, not the InstallPlan's.

return true, nil
}
return false, nil
}, ctx.Done()); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here, when failed to get the CSV, why still need to check the Deployment? Based on my understanding, both CSV and deployment work well, we can say the operator is installed successfully. Anyone of them failed, we can say the operator failed to install.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, this is to get the deployment reason for failure.
@joelanford Is that what you were suggesting?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that if the CSV never succeeds, we should try to be helpful and tell a user why the CSV didn't succeed.

One common reason is that the operator deployment hasn't rolled out.

I think we should only check the deployment status if the error from the wait.PollImmediateUntil call is context.DeadlineExceeded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I see now, yeah, we should try to return the helpful info. Besides the Deployment object, the rule policy(role, clusterrole, rolebinding ...) leads to many failures.

@jianzhangbjz
Copy link
Copy Markdown
Contributor

@bharathi-tenneti Could you help add the unit test or e2e test for this feature? Thanks!

@camilamacedo86
Copy link
Copy Markdown
Contributor

Hi @bharathi-tenneti,

@bharathi-tenneti Could you help add the unit test or e2e test for this feature? Thanks!
+1

IMO: We ought to call the command in :

  • here to ensure that it works for Ansible.
  • here to ensure that it works for Helm
  • here to ensure that it works for Go.

@@ -17,9 +17,13 @@ package internal
import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing fragment since it is an aditional of a new command.

@bharathi-tenneti
Copy link
Copy Markdown
Contributor Author

@bharathi-tenneti Could you help add the unit test or e2e test for this feature? Thanks!

e2e tests and unit tests are coming in this sprint.

@bharathi-tenneti bharathi-tenneti changed the title <WIP> run bundle: Verify operator is installed/upgraded run bundle: Verify operator is installed/upgraded Aug 26, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2020
Comment on lines +125 to +129
for _, s := range dep.Status.Conditions {
if s.Type != "Progressing" && s.Type != "Available" {
return nil, fmt.Errorf("Deployment failed: %v", s.Message)
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a quick test by creating a deployment with an image I know doesn't exist. The deployment conditions I see are:

  conditions:
  - lastTransitionTime: "2020-08-26T17:11:29Z"
    lastUpdateTime: "2020-08-26T17:11:29Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: "2020-08-26T17:11:29Z"
    lastUpdateTime: "2020-08-26T17:11:29Z"
    message: ReplicaSet "test-dep-97bfdcd79" is progressing.
    reason: ReplicaSetUpdated
    status: "True"
    type: Progressing

And the underlying pod has this:

  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-08-26T17:11:29Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-08-26T17:11:29Z"
    message: 'containers with unready status: [test-fake-jwl-image]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-08-26T17:11:29Z"
    message: 'containers with unready status: [test-fake-jwl-image]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-08-26T17:11:29Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: test-fake-jwl-image
    imageID: ""
    lastState: {}
    name: test-fake-jwl-image
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "test-fake-jlanford-image"
        reason: ImagePullBackOff

We probably need to be checking all of the deployment conditions and looking for any of them that are indicative of failure. And then if we find one, perhaps we could try digging further, such that we eventually end up being able to tell the user something like:

failed to run operator: timed out waiting for csv success: deployment unavailable: pod is Pending with ImagePullBackOff: Back-off pulling image "test-fake-jlanford-image"

Copy link
Copy Markdown
Contributor Author

@bharathi-tenneti bharathi-tenneti Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford So, DeploymentConditions of Progressing and Available are not considered failure. In current code , we are considering as failure, if the condition is anything else

		for _, s := range dep.Status.Conditions {
			if s.Type != "Progressing" && s.Type != "Available" {
				return nil, fmt.Errorf("Deployment failed: %v", s.Message)
			}

Currently, I am returning the Message filed in such case. However, as you have suggested to dig deeper, we need to dig into pods, and grab the error message from there. Is that what you meant?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, DeploymentConditions of Progressing and Available are not considered failure

We have to look at the condition status too. For example, Available = False and Progressing = False mean something's not working correctly.

return nil, fmt.Errorf("Deployment failed: %v", s.Message)
}
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error is not nil, but we were unable to determine a root cause, we should still return the error, possibly wrapped with some extra context if we know anything else about it.

Comment on lines +97 to +103
csvKey := types.NamespacedName{
Namespace: o.cfg.Namespace,
}

if csvKey.Name == "" {
return nil, fmt.Errorf("could not find installed CSV in install plan")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always result in the error being returned right? Should we pass the install plan to this function so we can find the CSV key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved csvKey login into install Plan function.

@openshift-ci-robot
Copy link
Copy Markdown

@bharathi-tenneti: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2020
@bharathi-tenneti bharathi-tenneti deleted the runbundle-verify branch August 28, 2020 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants