OSD-25179 - Avoid deadlock when lockfile is presetn#476
OSD-25179 - Avoid deadlock when lockfile is presetn#476Tafhim wants to merge 3 commits intoopenshift:masterfrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #476 +/- ##
==========================================
- Coverage 54.22% 53.86% -0.37%
==========================================
Files 122 122
Lines 5953 5993 +40
==========================================
Hits 3228 3228
- Misses 2523 2563 +40
Partials 202 202
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
Tests conducted Scenario #1: No pre-existing lock file ts=2024-10-04T04:36:51.642269253Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:36:51.643078303Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:36:51.654621698Z level=error logger=setup msg="Could not find a lock with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" error="configmaps \"managed-upgrade-operator-lock\" not found" stacktrace="main.cleanupLockForLeader\n\t/workdir/main.go:130\nmain.main\n\t/workdir/main.go:205\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"
ts=2024-10-04T04:36:51.654672297Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:36:51.654979545Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.740944065Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.744960788Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:36:51.750775161Z level=info logger=leader msg="Became the leader."Scenario #2: Lock file exists already ts=2024-10-04T04:38:14.342253208Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:38:14.342967346Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:38:14.355400982Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator, attempting to delete"
ts=2024-10-04T04:38:14.367709954Z level=info logger=setup msg="Lock deleted with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
ts=2024-10-04T04:38:14.367722464Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:38:14.368020958Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.372204868Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.374331127Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:38:14.440535791Z level=info logger=leader msg="Became the leader." |
0f9e35d to
5d5668e
Compare
|
Another scenario where lock initially had different owner. But when the owner got deleted, the existing Pod took leadership. ts=2024-10-11T08:30:35.112964071Z level=info logger=cmd msg="Operator Version: 1.0.0"
ts=2024-10-11T08:30:35.112996891Z level=info logger=cmd msg="Go Version: go1.22.4 (Red Hat 1.22.4-1.module+el8.10.0+21981+729b5bf8) X:boringcrypto,strictfipsruntime"
ts=2024-10-11T08:30:35.113000995Z level=info logger=cmd msg="Go OS/Arch: linux/amd64"
ts=2024-10-11T08:30:35.113004066Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:35.113945252Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:35.310814746Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:35.313873601Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:35.313887197Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:35.314202447Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-52nr2
ts=2024-10-11T08:30:35.317899721Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-52n
r2
ts=2024-10-11T08:30:35.319785785Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:35.334807308Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:36.393479566Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:38.654971987Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:43.061261542Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:52.420026904Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:10.910719652Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:27.496689267Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:31:45.082753841Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:03.419040735Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:21.773196599Z level=info logger=leader msg="Became the leader."
ts=2024-10-11T08:32:21.773605014Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
While the other pod stayed waiting to become leader ts=2024-10-11T08:30:37.694627426Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:37.695468461Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:37.707641854Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:37.79454308Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:37.794646038Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:37.893467151Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.996775964Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.998917257Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:38.013081751Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:39.093839453Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:41.171560483Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:45.826541009Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:55.342271994Z level=info logger=leader msg="Not the leader. Waiting."
t |
|
/lgtm |
|
@Tafhim: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| github.com/openshift/operator-custom-metrics v0.5.1 | ||
| github.com/openshift/osde2e-common v0.0.0-20240531074950-36a7055798ae | ||
| // Note: Please revisit the stop-gap in ./main.go > cleanupLockForLeader when updating operator-lib module. | ||
| // If the issue has been fixed upstream, the stop-gap should be removed. |
There was a problem hiding this comment.
Thanks Tafhim!
These are great!
There was a problem hiding this comment.
@Tafhim - I think it's better to have such notes added in markdown or the code pkg files for reference. However, if we want to ensure that a go.mod version is tied to version due to a known issue etc and we don't accidentally update it then it might be better to have it commented with the upstream issue/bug as follows under the replace directive. Example:
replace (
github.com/operator-framework/operator-lib v0.12.0 // Refer upstream issue <link>
)
It would also be fine to have a "maintenance" section added in this repo's docs - https://github.com/openshift/managed-upgrade-operator/blob/master/docs/development.md
We can have a separate maintenance.md file covering overall process and highlight this dependency requirement/reason there.
There was a problem hiding this comment.
I suggested putting it here so it would be noticed if the version was updated.
As far as I am aware we don't have the link for the upstream issue yet so a more generic note was suggested.
Putting this somewhere it's not going to be noticed may mean the patched release will happen without anyone noticing and we will have two services updating the same resource which may cause issues.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: devppratik, rendhalver, Tafhim The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/hold Thanks for reviews so far! Adding temporary hold on this PR till it's further reviewed. I did have some comments |
| // Note: This is a stop-gap solution until a fix has been implemented | ||
| // upstream in operator-framework/operator-lib. This function and it's | ||
| // usage should be removed once a fix is implemented upstream. |
There was a problem hiding this comment.
Instead of Note, can it be replaced as a TODO point?
|
/close Based on discussion with @Tafhim, we will go ahead with PR #483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address). However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim 👍 Closing this PR for now. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@ravitri: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
bug
What this PR does / why we need it?
MUO can sometimes get stuck in a deadlock if a pre-existing lockfile has ownerReference mismatch. This PR pre-emtively removes the lock file to avoid such scenario.
Which Jira/Github issue(s) this PR fixes?
OSD-25179
Special notes for your reviewer:
Pre-checks (if applicable):