OCPBUGS-6861: Allow TopoLVM initial grace period while service-ca attempts to start#1378
Conversation
|
@copejon: This pull request references Jira Issue OCPBUGS-6861, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2b419d3 to
28885c9
Compare
|
/pj-rehearse ack |
|
/pj-rehearse |
|
Do you have some more details on the race condition? In the logs from the bug I see there is some trouble here and there in both pods, but not clear to me. |
|
/pj-rehearse |
2 similar comments
|
/pj-rehearse |
|
/pj-rehearse |
|
/jira refresh |
|
@ggiguash: This pull request references Jira Issue OCPBUGS-6861, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@copejon: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hey @pacevedom I've updated the description to provide a bit more context. I've also remove the jira reference as it's determined that the cause of that bug is not related to this patch. |
|
/jira refresh |
|
@copejon: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
This looks OK, except for the messy history. @pmtk and @pacevedom , we could set the merge method to squash to take the changes. Thoughts? |
|
@copejon: This pull request references Jira Issue OCPBUGS-6861, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7f7a0d9 to
1d58125
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon, dhellmann The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@copejon: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@copejon: Jira Issue OCPBUGS-6861: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-6861 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.13 |
|
@copejon: new pull request created: #1537 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
TopoLVM on MicroShift suffers flakey behavior that stems in part from a race window with the service-ca. This PR pads TopoLVM's startup time to give the service-ca a chance to issue certs. This PR sets the initial startupProbe values to a 5 minute timeout (30 retries at 10 second intervals).
The race window was exposed by flakey behavior with service-ca startup that is being addressed separately. The work there will reduce the scale of the race window, however it will still be present. The window appears during (re)boot because topolvm depends on the service-ca to issue cert. In this scenario, topolvm and service-ca are started concurrently; their boot order is not controlled by microshift. This means that both services' retry loops are initiated at the same time, so the diff between the service-ca issuing a cert and topolvm reaching it's retry max creates this race window. By providing topolvm a startup grace period, we allow the service ca to stabilize and improve the user experience since topolvm will not enter a crashloopbackoff as a result of timing out.
Note that this grace period does not delay topolvm's startup, it only delays the point at which topolvm startup failures are considered true failures.
Which issue(s) this PR addresses: