Skip to content

Controllers do not requeue events on transient errors #824

@smarterclayton

Description

@smarterclayton

All of the controllers currently have a serious flaw, in that none of them retry / requeue events when a failure occurs processing the event. This means that certain operations will not complete until the controller is restarted.

Stories need to be created for beta2 for deployments and builds individually that correctly handle errors (probably by simply requeueing the event that triggered the error). This should be done in a way that does not overly complicate the controller loops (moves the logic for retry up a level). See the scheduler for ideas about how these events can be requeued. It's important to note that some controller loops will depend on order, so requeuing may result in operations delivered out of order, which means controller loops need to check that their action is valid (ie, if they already processed a newer event). It's also important that the controllers can make progress in the face of transient errors.

It would be good for one person to look at the problem as a whole and help ensure the logic is right. More unit and integration tests will need to be added to the controller, and a wider range of failure conditions detected. Better logging behavior should also be applied, so that it's easy to diagnose failures.

Given the nature of the current behavior, we may need to delay features from beta2 in favor of ensuing the controller loops are sound.

@bparees @ironcladlou

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions