Report transient deployment trigger errors via API field by 0xmichalis · Pull Request #4941 · openshift/origin

0xmichalis · 2015-10-05T11:50:09Z

This PR adds failure inspection in the deploymentConfig controller.
Every time a deploymentConfig is reconciled (every two minutes), we try
to find any details about possible failures and report them to the user
via the status.details.message field in the deploymentConfig. Users can
use oc describe dc/config and see if there is any warning in their
config. oc status should follow up these changes and report likewise.

@ironcladlou @deads2k @smarterclayton PTAL

Fixes #3526
Fixes #3311

deads2k · 2015-10-05T13:12:07Z

Update oc status to know this too.

ironcladlou · 2015-10-05T14:32:34Z

Discussed on IRC, but to recap: the change needs to originate in the deployment config generator, and the caller (e.g. the configChangeController) needs to be able to interpret that error and generate a more meaningful message. The current approach won't handle reporting the actual causes for a non-existent initial deployment, which is the primary use case.

ironcladlou · 2015-10-05T16:47:56Z

Talked more on IRC, and I'm now thinking it might make more sense to have the deploymentController do the state inspection you've got here every time it reconciles a DC. This would ensure the message is set upon initial DC creation (which is the primary use case), and also ensure the message is cleared/updated as the triggers mutate the config.

So basically, when handling a DC, check to see "does a deployment for this DC.LatestVersion exist?" and if not, figure out some possible reasons why and update the message. If there is a deployment for the current version, clear the message.

0xmichalis · 2015-10-05T18:23:02Z

@ironcladlou makes sense, will update accordingly

0xmichalis · 2015-10-06T12:49:50Z

So basically, when handling a DC, check to see "does a deployment for this DC.LatestVersion exist?" and if not, figure out some possible reasons why and update the message. If there is a deployment for the current version, clear the message.

Thinking about this more, I believe we should not do that kind of inspection only when there is no respective deployment for the current dc version but also when the latest deployment is failed. WDYT?

ps. in your first paragraph, you probably meant the deploymentConfig controller.

ironcladlou · 2015-10-06T13:25:51Z

Thinking about this more, I believe we should not do that kind of inspection only when there is no respective deployment for the current dc version but also when the latest deployment is failed. WDYT?

Makes sense to me... if the deployment is visible and we see it failed, I don't see why not try and add more info to the config status. That might even be a good time to go look up the deployer pod and if it still exists, get some info from that as well to add to the message. The more clues we can collect and provide to the user, the better IMO.

ps. in your first paragraph, you probably meant the deploymentConfig controller.

You're right, I meant to say this work should happen in the deploymentConfig controller.

0xmichalis · 2015-10-07T14:41:23Z

@ironcladlou is this closer to what you were thinking? What other failures can we find out based on a dc? Inspect the template maybe? I could also see an unshedulable node as a possible failure but I am not sure about having a node client in here. Thoughts?

ironcladlou · 2015-10-07T14:58:03Z

pkg/deploy/controller/deploymentconfig/controller.go

How about making this function just return the message and do the update from the calling side? Makes it easier to test and less side-effecty. I don't like doing the update in multiple places within the function, via defer, etc.

Agreed, the code is cleaner now that the inspection happens once at start. Updated

0xmichalis · 2015-10-07T16:02:07Z

Need to add tests

ironcladlou · 2015-10-07T17:42:36Z

pkg/deploy/controller/deploymentconfig/controller.go

Closer... there are a few mods to fit the general reconcilation model of other controllers. Something like:

details := c.findDetails(config, client) c.updateDetails(config, details)

The findDetails function could immediately check to see if there's a deployment for config, and if so, return an empty string (to cause details to be cleared). It could then do all its other checks. Then findDetails needs to only be called once, from here (the one on L94 would not be necessary).

The updateDetails function would update config if details != config.details.Message. That handles setting new messages and clearing outdated ones, and handles no-op'ing if there's no difference.

You should also ditch all those Namespacers and just use a single client.Interface. The test support works just as well with those but with a lot less code.

Then findDetails needs to only be called once, from here (the one on L94 would not be necessary).

I was aware of it but wanted to keep population and cleaning in separate places. I guess though it doesn't really matter. Updating...

@ironcladlou I am not sure what to do with errors returned from the clients when querying for resources. Pass those as the details into the DC? Let the controller handle them? Or ignore them?

@ironcladlou comments addressed, unit tests for findDetails added, PTAL

0xmichalis · 2015-10-08T16:03:23Z

@ironcladlou updated the messages and tests (please use table tests with test names:) ).
Still have to update oc status but I am more inclined to do it in a follow-up (I would prefer to setup markers for dcs and we currently implement none - @deads2k WDYT?).

deads2k · 2015-10-08T17:13:20Z

@ironcladlou updated the messages and tests (please use table tests with test names:) ).
Still have to update oc status but I am more inclined to do it in a follow-up (I would prefer to setup markers for dcs and we currently implement none - @deads2k WDYT?).

I don't see a reason to block this pull on oc status, but I would say that bad oc status for every deployment (what we have today) is a p1 for the release.

ironcladlou · 2015-10-08T17:19:47Z

pkg/deploy/controller/deploymentconfig/controller_test.go

The nesting here renders like a flock of geese on my screen

lol

I will add mock funcs to reduce it

lol.

I was just kidding, don't sweat it

0xmichalis · 2015-10-08T20:41:23Z

@ironcladlou @deads2k I think this is ready

The oc status fix will be my next PR

ironcladlou · 2015-10-08T20:48:38Z

This LGTM, thanks a lot for working on it.

0xmichalis · 2015-10-09T10:41:47Z

Added the warning to oc status too. We could definitely traverse the graph and find out the cause sooner than the dc controller but then we would be incosistent. I think the right way is for status to follow what's reported by the controller.
@deads2k PTAL in the last commit

[vagrant@openshiftdev sample-app]$ oc status
In project test on server https://localhost:8443

service/database - 172.30.32.247:5434 -> 3306
  dc/database deploys docker.io/openshift/mysql-55-centos7:latest 
    #1 deployed 2 minutes ago - 1 pod

service/frontend - 172.30.252.126:5432 -> 8080
  dc/frontend deploys istag/origin-ruby-sample:latest <-
    builds https://github.com/openshift/ruby-hello-world.git with test/ruby-20-centos7:latest through bc/ruby-sample-build 
      #1 build failed 24 seconds ago (can't push to image)
    #1 deployment waiting on image or update

Warnings:
  bc/ruby-sample-build is pushing to imagestreamtag/origin-ruby-sample:latest that is using is/origin-ruby-sample, but the administrator has not configured the integrated Docker registry.  (oadm registry)
  The image trigger for "origin-ruby-sample:latest" will have no effect because image stream "origin-ruby-sample" does not exist!

To see more, use 'oc describe <resource>/<name>'.
You can use 'oc get all' to see a list of other objects.

0xmichalis · 2015-10-09T13:04:33Z

This means that we have to report multiple bad triggers. Separate line for
each? The message you are asking for feels like a separate line too

On Fri, Oct 9, 2015 at 2:49 PM, David Eads notifications@github.com wrote:

Isn't this what the missing istag is about?

A deploymentconfig can have multiple triggers. It could have one bad
trigger and one good one.

—
Reply to this email directly or view it on GitHub
#4941 (comment).

ironcladlou · 2015-10-09T13:57:34Z

I do agree that this work really belongs in the graph code- I brought this up before we started and @smarterclayton and I decided to do ASAP inside the controller to get started. Please correct me if I misunderstood the preferred initial direction.

My impression is since we're building a freeform message, it shouldn't be a big deal to swap in a different rendering from the graph library later.

smarterclayton · 2015-10-09T14:18:54Z

Controller doesn't have to use the graph library - especially if the controller is checking conditions directly as part of its core logic.

deads2k · 2015-10-09T14:56:09Z

This means that we have to report multiple bad triggers. Separate line for
each?

You want to do status as a followup? I do think we should handle multiple bad triggers as markers for warnings and the separately mark a deployment that can't succeed on the same line. Something like " #1 deployment waiting on image or update (image stream doesn't exist)"

0xmichalis · 2015-10-09T15:17:49Z

@ironcladlou what's your opinion on reporting multiple bad triggers in the details field or the can't succeed message due to no image? I have updated findDetails likewise.

ironcladlou · 2015-10-09T16:06:23Z

I think adding some sort of line per trigger would be just fine... I see no harm in making the message longer/better formatted for that.

0xmichalis · 2015-10-10T13:14:33Z

Removed status changes and squashed down

smarterclayton · 2015-10-10T16:00:49Z

pkg/deploy/controller/deploymentconfig/controller.go

Don't use "istag/" syntax in messages from a controller. Say "image stream tag foo:bar"

0xmichalis · 2015-10-12T10:46:48Z

[test]

openshift-bot · 2015-10-12T10:49:02Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/5761/)

deads2k · 2015-10-12T12:43:17Z

Don't use "istag/" syntax in messages from a controller. Say "image stream tag foo:bar"

Not trying to re-open this issue for this pull, but why not keep the TYPE/NAME syntax consistent throughout all messages? It's really convenient to be able to get an error message from anywhere and copy/paste a resource reference into an oc describe, oc get -o yaml, a label filter, or any other command with zero friction. Given the usage of that format throughout every command in arguments and output, I think its reasonable to expect people to be able to read it.

smarterclayton · 2015-10-12T15:11:09Z

Because istag/ is a specific construct of one client (the CLI) and has no
relevance for the UI or an Eclipse client (or a log). "istag" is
insufficient info for a new user to determine the referenced type. No
message in our core API should assume a particular client.

On Mon, Oct 12, 2015 at 8:43 AM, David Eads notifications@github.com
wrote:

Don't use "istag/" syntax in messages from a controller. Say "image stream
tag foo:bar"

Not trying to re-open this issue for this pull, but why not keep the
TYPE/NAME syntax consistent throughout all messages? It's really
convenient to be able to get an error message from anywhere and copy/paste
a resource reference into an oc describe, oc get -o yaml, a label filter,
or any other command with zero friction. Given the usage of that format
throughout every command in arguments and output, I think its reasonable to
expect people to be able to read it.

—
Reply to this email directly or view it on GitHub
#4941 (comment).

0xmichalis · 2015-10-13T13:13:29Z

Squashed down to one commit and updated the commit message

This commit adds failure inspection in the deploymentConfig controller. Every time a deploymentConfig is reconciled (every two minutes), we try to find any details about possible failures and report them to the user via the status.details.message field in the deploymentConfig. Users can use `oc describe dc/config` and see if there is any warning in their config. `oc status` should follow up these changes and report likewise.

openshift-bot · 2015-10-14T11:17:10Z

Evaluated for origin test up to 0b38d81

ironcladlou · 2015-10-14T14:12:10Z

This still LGTM but I think @deads2k and @smarterclayton will need to take another look to see if they're okay with the latest verbiage.

deads2k · 2015-10-14T20:03:45Z

I'm fine with the messages.

deads2k · 2015-10-14T20:13:32Z

[merge]

openshift-bot · 2015-10-14T20:17:03Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/3621/) (Image: devenv-fedora_2465)

openshift-bot · 2015-10-14T20:17:03Z

Evaluated for origin merge up to 0b38d81

…field Merged by openshift-bot

0xmichalis changed the title ~~Report transient deployment trigger erros via API field~~ Report transient deployment trigger errors via API field Oct 5, 2015

0xmichalis mentioned this pull request Oct 5, 2015

Deployment failure due to "lack of deployer pod" is ambiguous #4942

Closed

ironcladlou reviewed Oct 7, 2015
View reviewed changes

ironcladlou reviewed Oct 8, 2015
View reviewed changes

smarterclayton reviewed Oct 10, 2015
View reviewed changes

0xmichalis mentioned this pull request Oct 12, 2015

status: Warn about transient deployment trigger errors #5085

Merged

openshift-bot pushed a commit that referenced this pull request Oct 14, 2015

Merge pull request #4941 from kargakis/report-trigger-errors-via-api-…

1bc373b

…field Merged by openshift-bot

openshift-bot merged commit 1bc373b into openshift:master Oct 14, 2015

0xmichalis deleted the report-trigger-errors-via-api-field branch October 15, 2015 07:57

ironcladlou mentioned this pull request Oct 16, 2015

TestTriggers flake #5161

Closed

This was referenced Nov 4, 2015

Prevent deployment status string accumulation #5707

Closed

WIP: Optimize deploymentConfig status updates #5530

Closed

0xmichalis mentioned this pull request May 21, 2016

Deployment status (and build status, to a smaller degree) dominates in memory usage #5504

Closed

Conversation

0xmichalis commented Oct 5, 2015

Uh oh!

deads2k commented Oct 5, 2015

Uh oh!

ironcladlou commented Oct 5, 2015

Uh oh!

ironcladlou commented Oct 5, 2015

Uh oh!

0xmichalis commented Oct 5, 2015

Uh oh!

0xmichalis commented Oct 6, 2015

Uh oh!

ironcladlou commented Oct 6, 2015

Uh oh!

0xmichalis commented Oct 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Oct 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Oct 8, 2015

Uh oh!

deads2k commented Oct 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Oct 8, 2015

Uh oh!

ironcladlou commented Oct 8, 2015

Uh oh!

0xmichalis commented Oct 9, 2015

Uh oh!

0xmichalis commented Oct 9, 2015

Uh oh!

ironcladlou commented Oct 9, 2015

Uh oh!

smarterclayton commented Oct 9, 2015

Uh oh!

deads2k commented Oct 9, 2015

Uh oh!

0xmichalis commented Oct 9, 2015

Uh oh!

ironcladlou commented Oct 9, 2015

Uh oh!

0xmichalis commented Oct 10, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Oct 12, 2015

Uh oh!

openshift-bot commented Oct 12, 2015

Uh oh!

deads2k commented Oct 12, 2015

Uh oh!

smarterclayton commented Oct 12, 2015

Uh oh!

0xmichalis commented Oct 13, 2015

Uh oh!

openshift-bot commented Oct 14, 2015

Uh oh!

ironcladlou commented Oct 14, 2015

Uh oh!

deads2k commented Oct 14, 2015