Controllers do not requeue events on transient errors

All of the controllers currently have a serious flaw, in that none of them retry / requeue events when a failure occurs processing the event. This means that certain operations will not complete until the controller is restarted.

Stories need to be created for beta2 for deployments and builds individually that correctly handle errors (probably by simply requeueing the event that triggered the error).  This should be done in a way that does not overly complicate the controller loops (moves the logic for retry up a level).  See the scheduler for ideas about how these events can be requeued.  It's important to note that some controller loops will depend on order, so requeuing may result in operations delivered out of order, which means controller loops need to check that their action is valid (ie, if they already processed a newer event).  It's also important that the controllers can make progress in the face of transient errors.

It would be good for one person to look at the problem as a whole and help ensure the logic is right.  More unit and integration tests will need to be added to the controller, and a wider range of failure conditions detected.  Better logging behavior should also be applied, so that it's easy to diagnose failures.

Given the nature of the current behavior, we may need to delay features from beta2 in favor of ensuing the controller loops are sound.  

@bparees @ironcladlou


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controllers do not requeue events on transient errors #824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controllers do not requeue events on transient errors #824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions