Skip to content

Sporadically incorrect remediation actions #120

@hiddeco

Description

@hiddeco

I have noticed that one of our end-to-end tests is flaky (install-test-fail), and sometimes acts in such a way that the failures counter suddenly resets and the wrong remediation strategy is taken.

From a first observation of the logs this problem does not seem to just exist for the end-to-end tests, but may also occur in real environments. I have a suspicion that it may have to do with the fact that we do not retry failed status updates, and seem to be running into the object has been modified; please apply your changes to the latest version and try again errors quote often, especially for the install-test-fail scenario.

This may be an indication that:

  1. Multiple processes are touching the same resource
  2. The (during reconciliation) modified object is not passed on correctly everywhere, and we have a split brain problem (aggravated by the fact that we do not perform retries on failed updates)

34_Debug failure.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions