OSD-25179 - Avoid deadlock when lockfile is presetn by Tafhim · Pull Request #476 · openshift/managed-upgrade-operator

Tafhim · 2024-10-04T04:01:19Z

What type of PR is this?

bug

What this PR does / why we need it?

MUO can sometimes get stuck in a deadlock if a pre-existing lockfile has ownerReference mismatch. This PR pre-emtively removes the lock file to avoid such scenario.

Which Jira/Github issue(s) this PR fixes?

OSD-25179

Special notes for your reviewer:

Pre-checks (if applicable):

Tested latest changes against a cluster

codecov-commenter · 2024-10-04T04:20:44Z

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.

Project coverage is 53.86%. Comparing base (b887c87) to head (6d67af1).
Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
main.go	0.00%	44 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
- Coverage   54.22%   53.86%   -0.37%     
==========================================
  Files         122      122              
  Lines        5953     5993      +40     
==========================================
  Hits         3228     3228              
- Misses       2523     2563      +40     
  Partials      202      202

Flag	Coverage Δ
	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
main.go	`0.00% <0.00%> (ø)`

Tafhim · 2024-10-04T04:41:47Z

Tests conducted

Scenario #1: No pre-existing lock file

ts=2024-10-04T04:36:51.642269253Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:36:51.643078303Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:36:51.654621698Z level=error logger=setup msg="Could not find a lock with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" error="configmaps \"managed-upgrade-operator-lock\" not found" stacktrace="main.cleanupLockForLeader\n\t/workdir/main.go:130\nmain.main\n\t/workdir/main.go:205\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"
ts=2024-10-04T04:36:51.654672297Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:36:51.654979545Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.740944065Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.744960788Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:36:51.750775161Z level=info logger=leader msg="Became the leader."

Scenario #2: Lock file exists already

ts=2024-10-04T04:38:14.342253208Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:38:14.342967346Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:38:14.355400982Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator, attempting to delete"
ts=2024-10-04T04:38:14.367709954Z level=info logger=setup msg="Lock deleted with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
ts=2024-10-04T04:38:14.367722464Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:38:14.368020958Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.372204868Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.374331127Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:38:14.440535791Z level=info logger=leader msg="Became the leader."

Tafhim · 2024-10-11T08:34:57Z

Another scenario where lock initially had different owner. But when the owner got deleted, the existing Pod took leadership.

ts=2024-10-11T08:30:35.112964071Z level=info logger=cmd msg="Operator Version: 1.0.0"                                                                                  
ts=2024-10-11T08:30:35.112996891Z level=info logger=cmd msg="Go Version: go1.22.4 (Red Hat 1.22.4-1.module+el8.10.0+21981+729b5bf8) X:boringcrypto,strictfipsruntime"  
ts=2024-10-11T08:30:35.113000995Z level=info logger=cmd msg="Go OS/Arch: linux/amd64"                                                                                  
ts=2024-10-11T08:30:35.113004066Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"                                                                         
ts=2024-10-11T08:30:35.113945252Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator                                             
ts=2024-10-11T08:30:35.310814746Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" 
Looking in test-managed-upgrade-operator                                                                                                                               
ts=2024-10-11T08:30:35.313873601Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"            
ts=2024-10-11T08:30:35.313887197Z level=info logger=leader msg="Trying to become the leader."                                                                          
ts=2024-10-11T08:30:35.314202447Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-52nr2                                      
ts=2024-10-11T08:30:35.317899721Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-52n
r2                                                                                                                                                                     
ts=2024-10-11T08:30:35.319785785Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m                                
ts=2024-10-11T08:30:35.334807308Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:36.393479566Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:38.654971987Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:43.061261542Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:52.420026904Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:10.910719652Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:27.496689267Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:45.082753841Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:32:03.419040735Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:21.773196599Z level=info logger=leader msg="Became the leader." 
ts=2024-10-11T08:32:21.773605014Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator

While the other pod stayed waiting to become leader

ts=2024-10-11T08:30:37.694627426Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:37.695468461Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:37.707641854Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:37.79454308Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:37.794646038Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:37.893467151Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.996775964Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.998917257Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:38.013081751Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:39.093839453Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:41.171560483Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:45.826541009Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:55.342271994Z level=info logger=leader msg="Not the leader. Waiting."
t

devppratik · 2024-10-18T07:16:57Z

/lgtm

openshift-ci · 2024-11-01T06:47:58Z

@Tafhim: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rendhalver · 2024-11-04T21:41:18Z

 	github.com/openshift/operator-custom-metrics v0.5.1
 	github.com/openshift/osde2e-common v0.0.0-20240531074950-36a7055798ae
+	// Note: Please revisit the stop-gap in ./main.go > cleanupLockForLeader when updating operator-lib module.
+	// If the issue has been fixed upstream, the stop-gap should be removed.


Thanks Tafhim!
These are great!

@Tafhim - I think it's better to have such notes added in markdown or the code pkg files for reference. However, if we want to ensure that a go.mod version is tied to version due to a known issue etc and we don't accidentally update it then it might be better to have it commented with the upstream issue/bug as follows under the replace directive. Example:

replace ( github.com/operator-framework/operator-lib v0.12.0 // Refer upstream issue <link> )

It would also be fine to have a "maintenance" section added in this repo's docs - https://github.com/openshift/managed-upgrade-operator/blob/master/docs/development.md

We can have a separate maintenance.md file covering overall process and highlight this dependency requirement/reason there.

I suggested putting it here so it would be noticed if the version was updated.
As far as I am aware we don't have the link for the upstream issue yet so a more generic note was suggested.
Putting this somewhere it's not going to be noticed may mean the patched release will happen without anyone noticing and we will have two services updating the same resource which may cause issues.

rendhalver · 2024-11-04T21:43:03Z

/lgtm

openshift-ci · 2024-11-04T21:43:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: devppratik, rendhalver, Tafhim
Once this PR has been reviewed and has the lgtm label, please assign ravitri for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ravitri · 2024-11-05T02:00:26Z

/hold

Thanks for reviews so far! Adding temporary hold on this PR till it's further reviewed. I did have some comments

ravitri · 2024-11-05T02:29:41Z

+// Note: This is a stop-gap solution until a fix has been implemented
+// upstream in operator-framework/operator-lib. This function and it's
+// usage should be removed once a fix is implemented upstream.


Instead of Note, can it be replaced as a TODO point?

ravitri · 2024-11-08T08:21:46Z

/close

Based on discussion with @Tafhim, we will go ahead with PR #483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim 👍

Closing this PR for now.

openshift-merge-robot · 2024-11-08T08:21:55Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-11-08T08:23:07Z

@ravitri: Closed this PR.

Details

In response to this:

/close

Based on discussion with @Tafhim, we will go ahead with PR #483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim 👍

Closing this PR for now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci Bot requested review from TheUndeadKing and shitaljante October 4, 2024 04:01

OSD-25179 - Avoid deadlock when lockfile is present

5d5668e

Tafhim force-pushed the OSD-25179-lockfile branch from 0f9e35d to 5d5668e Compare October 4, 2024 04:57

OSD-25179 - Handle multi pod scenarios

491d011

openshift-ci Bot assigned devppratik Oct 18, 2024

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2024

OSD-25179 - Add comment to clarify the stop-gap implementation

6d67af1

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 1, 2024

rendhalver reviewed Nov 4, 2024

View reviewed changes

openshift-ci Bot assigned rendhalver Nov 4, 2024

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2024

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024

ravitri reviewed Nov 5, 2024

View reviewed changes

ravitri mentioned this pull request Nov 6, 2024

Bump operator-lib to v0.13.0 #483

Merged

2 tasks

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024

openshift-ci Bot closed this Nov 8, 2024

Conversation

Tafhim commented Oct 4, 2024

What type of PR is this?

What this PR does / why we need it?

Which Jira/Github issue(s) this PR fixes?

Special notes for your reviewer:

Pre-checks (if applicable):

Uh oh!

codecov-commenter commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Tafhim commented Oct 4, 2024

Uh oh!

Tafhim commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devppratik commented Oct 18, 2024

Uh oh!

openshift-ci Bot commented Nov 1, 2024

Uh oh!

rendhalver Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

ravitri Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

rendhalver Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

rendhalver commented Nov 4, 2024

Uh oh!

openshift-ci Bot commented Nov 4, 2024

Uh oh!

ravitri commented Nov 5, 2024

Uh oh!

ravitri Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

ravitri commented Nov 8, 2024

Uh oh!

openshift-merge-robot commented Nov 8, 2024

Uh oh!

openshift-ci Bot commented Nov 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Oct 4, 2024 •

edited

Loading

Tafhim commented Oct 11, 2024 •

edited

Loading