Skip to content

Conversation

@benjaminapetersen
Copy link
Contributor

@benjaminapetersen benjaminapetersen commented Feb 15, 2019

Based on #139

  • Squashed various fixes
  • Refactored into cleaner function calls
  • Renumbered manifests to ensure the clusteroperator/console is created after the operator. This should avoid a race that may cause the clusteroperator/console to report failure status simply because the operator does not yet exist.

Some screenshots:
screen shot 2019-02-19 at 3 49 03 pm
screen shot 2019-02-19 at 11 28 34 am
screen shot 2019-02-19 at 11 28 10 am

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 15, 2019
operatorsv1 "github.com/openshift/api/operator/v1"
"github.com/openshift/library-go/pkg/operator/v1helpers"
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailing out the purpose behind the status/conditions here to ensure we get it right.

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

I think tests are frozen right now on:

level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 3.11.0-673-gadf12809-dirty because: Get https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/controllerconfigs.machineconfiguration.openshift.io: dial tcp 172.30.0.1:443: connect: connection refused"

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed to resync 3.11.0-676-g745693cd-dirty because: error syncing: request declared a Content-Length of 483 but only wrote 0 bytes"

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

benjaminapetersen commented Feb 18, 2019

level=fatal msg="failed to initialize the cluster: Cluster operator network has not yet reported success"

level=fatal msg="failed to initialize the cluster: Cluster operator console has not yet reported success"

Network operator failed in one test, Console in the other. Not sure if flakes or related.

console and console-operator pods have no logs, apparently never came up.

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

no console-operator logs at all that run.

@benjaminapetersen
Copy link
Contributor Author

/retest

2 similar comments
@benjaminapetersen
Copy link
Contributor Author

/retest

@spadgett
Copy link
Member

/retest

@benjaminapetersen
Copy link
Contributor Author

error: unable to read image registry.svc.ci.openshift.org/ci-op-8yj29n81/stable@sha256:0d6c76c0d202665f7a16b899f4c62d94c9ac8a14c7540c841a8e802a91775253: received unexpected HTTP status: 504 Gateway Time-out

@benjaminapetersen
Copy link
Contributor Author

/retest

5 similar comments
@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen
Copy link
Contributor Author

/retest

@benjaminapetersen benjaminapetersen force-pushed the operator-status-revisions-squash branch from 67b79d7 to 09284a8 Compare February 19, 2019 16:39
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 19, 2019
@benjaminapetersen
Copy link
Contributor Author

/assign @zherman0 @jhadvig @spadgett

I'll take some feedback. I think if tests finally pass at some point I'd rather not have to make changes & restart the whole process 😄

@benjaminapetersen
Copy link
Contributor Author

rebased

// To use when another more specific status function is not sufficient.
// examples:
// setStatusCondition(operatorConfig, Failing, True, "SyncLoopError", "Sync loop failed to complete successfully")
func (c *consoleOperator) SetStatusCondition(operatorConfig *operatorsv1.Console, conditionType string, conditionStatus operatorsv1.ConditionStatus, conditionReason string, conditionMessage string) *operatorsv1.Console {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this is better than calling v1helpers.SetOperatorCondition directly. It actually seems worse to me since it would be easy to mix up the order of the arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see where this is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not used at this point, its been factored out.
I'll remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a step to initially condense the inline noise:

if aBadThing {
       logrus.Errorf("Bad things are happening")
       // logic....
	v1helpers.SetOperatorCondition(&operatorConfig.Status.Conditions, operatorsv1.OperatorCondition{
		Type:               BadType,
		Status:             BadStatus,
		Reason:             "ABadThing",
		Message:            "a bad thing",
		LastTransitionTime: metav1.Now(),
	})
       // oh my if more than one....
       // do other logic
}
// to a one-liner
setStatusCondition(operatorConfig, Failing, True, "SyncLoopError", "Sync loop failed to complete successfully")

That said, I agree, it was still too many things to take care of. Thats when I moved to

// one condition or 3, still one line... not 7,14,21, etc.
co.ConditionABadThing(operatorConfig)

Copy link
Contributor Author

@benjaminapetersen benjaminapetersen Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right way to do this is probably for:

// operator.go ideally could deal with all the status stuff, when it calls sync, and not have it 
// scattered across multiple files:
structuredThing, err := sync_v400()
// that we could instead handle it all in one place in operator.go
handleConditions(structuredThing, err)

Tech debt...

v1helpers.SetOperatorCondition(&operatorConfig.Status.Conditions, operatorsv1.OperatorCondition{
Type: operatorsv1.OperatorStatusTypeFailing,
Status: operatorsv1.ConditionFalse,
LastTransitionTime: metav1.Now(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd add messages to all these, but OK as a follow on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing:False is the desired state, along with Progressing:False and Available:True. I assumed if in the desired state no other information should be needed, but I'm open to adding info if we feel it is necessary or helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed if in the desired state no other information should be needed, but I'm open to adding info if we feel it is necessary or helpful.

I think it would make things more clear if we add a message saying things are good. Particularly because you have to think through a double negative like failing: false. The messages are displayed on the cluster settings page in the UI.

(For a follow on)

return operatorConfig
}

func (c *consoleOperator) ConditionResourceSyncSuccess(operatorConfig *operatorsv1.Console) *operatorsv1.Console {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different than ConditionNotFailing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not at this point. Initially I was considering putting the rest of this logic (sync_v400 line 110) in this function as it may have been necessary to set more than one status.

I may be willing to eliminate this wrapper as I believe it won't be a "set multiple statuses" kind of function.

@benjaminapetersen benjaminapetersen force-pushed the operator-status-revisions-squash branch from 0787a25 to fdc153a Compare February 21, 2019 20:59
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 21, 2019
@benjaminapetersen benjaminapetersen force-pushed the operator-status-revisions-squash branch from fdc153a to 873f74b Compare February 21, 2019 21:14
@benjaminapetersen
Copy link
Contributor Author

/retest

2 similar comments
@benjaminapetersen
Copy link
Contributor Author

/retest

@spadgett
Copy link
Member

/retest

Copy link
Member

@spadgett spadgett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

// To use when another more specific status function is not sufficient.
// examples:
// setStatusCondition(operatorConfig, Failing, True, "SyncLoopError", "Sync loop failed to complete successfully")
func (c *consoleOperator) SetStatusCondition(operatorConfig *operatorsv1.Console, conditionType string, conditionStatus operatorsv1.ConditionStatus, conditionReason string, conditionMessage string) *operatorsv1.Console {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove if unused, but we can do that as a follow on if this passes CI.

// the operand is in a transitional state if any of the above resources changed
// or if we have not settled on the desired number of replicas
if toUpdate || actualDeployment.Status.ReadyReplicas != deploymentsub.ConsoleReplicas {
co.ConditionResourceSyncProgressing(operatorConfig, "Changes made during sync updates, additional sync expected.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might revisit this message, but OK for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, its not the best message. Progressing is quite fast, so its unlikely to be seen, but still, can def revisit.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benjaminapetersen, spadgett

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [benjaminapetersen,spadgett]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spadgett
Copy link
Member

level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating cluster..."
level=info msg="Waiting up to 30m0s for the Kubernetes API..."
level=fatal msg="waiting for Kubernetes API: context deadline exceeded"

/retest

@benjaminapetersen
Copy link
Contributor Author

/retest

Died before generating any artifacts due to:
received unexpected HTTP status: 504 Gateway Time-out

@spadgett
Copy link
Member

/retest

@benjaminapetersen
Copy link
Contributor Author

woohoo, one set succeeded this time...

@benjaminapetersen
Copy link
Contributor Author

/retest

1 similar comment
@spadgett
Copy link
Member

/retest

@benjaminapetersen
Copy link
Contributor Author

Nice.

@spadgett
Copy link
Member

level=error msg="\t* module.vpc.aws_route_table_association.route_net[3]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.3: timeout while waiting for state to become 'success' (timeout: 

/retest

@benjaminapetersen
Copy link
Contributor Author

Bah, looked like they passed.

@spadgett
Copy link
Member

Bah, looked like they passed.

Yeah, they have to run again since another PR merged to master in between

@spadgett
Copy link
Member

e2e-aws passed, waiting on e2e-aws-operator

@spadgett
Copy link
Member

all green!

@openshift-merge-robot openshift-merge-robot merged commit cc814fa into openshift:master Feb 23, 2019
@benjaminapetersen
Copy link
Contributor Author

Fantastic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants