Skip to content

OSD-25179 - Avoid deadlock when lockfile is presetn#476

Closed
Tafhim wants to merge 3 commits intoopenshift:masterfrom
Tafhim:OSD-25179-lockfile
Closed

OSD-25179 - Avoid deadlock when lockfile is presetn#476
Tafhim wants to merge 3 commits intoopenshift:masterfrom
Tafhim:OSD-25179-lockfile

Conversation

@Tafhim
Copy link
Copy Markdown
Contributor

@Tafhim Tafhim commented Oct 4, 2024

What type of PR is this?

bug

What this PR does / why we need it?

MUO can sometimes get stuck in a deadlock if a pre-existing lockfile has ownerReference mismatch. This PR pre-emtively removes the lock file to avoid such scenario.

Which Jira/Github issue(s) this PR fixes?

OSD-25179

Special notes for your reviewer:

Pre-checks (if applicable):

  • Tested latest changes against a cluster

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 4, 2024

Codecov Report

Attention: Patch coverage is 0% with 44 lines in your changes missing coverage. Please review.

Project coverage is 53.86%. Comparing base (b887c87) to head (6d67af1).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
main.go 0.00% 44 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
- Coverage   54.22%   53.86%   -0.37%     
==========================================
  Files         122      122              
  Lines        5953     5993      +40     
==========================================
  Hits         3228     3228              
- Misses       2523     2563      +40     
  Partials      202      202              
Flag Coverage Δ
?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
main.go 0.00% <0.00%> (ø)

@Tafhim
Copy link
Copy Markdown
Contributor Author

Tafhim commented Oct 4, 2024

Tests conducted

Scenario #1: No pre-existing lock file

ts=2024-10-04T04:36:51.642269253Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:36:51.643078303Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:36:51.654621698Z level=error logger=setup msg="Could not find a lock with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" error="configmaps \"managed-upgrade-operator-lock\" not found" stacktrace="main.cleanupLockForLeader\n\t/workdir/main.go:130\nmain.main\n\t/workdir/main.go:205\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"
ts=2024-10-04T04:36:51.654672297Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:36:51.654979545Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.740944065Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-crgs9
ts=2024-10-04T04:36:51.744960788Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:36:51.750775161Z level=info logger=leader msg="Became the leader."

Scenario #2: Lock file exists already

ts=2024-10-04T04:38:14.342253208Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-04T04:38:14.342967346Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-04T04:38:14.355400982Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator, attempting to delete"
ts=2024-10-04T04:38:14.367709954Z level=info logger=setup msg="Lock deleted with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
ts=2024-10-04T04:38:14.367722464Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-04T04:38:14.368020958Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.372204868Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-845b4486dc-p92gg
ts=2024-10-04T04:38:14.374331127Z level=info logger=leader msg="No pre-existing lock was found."
ts=2024-10-04T04:38:14.440535791Z level=info logger=leader msg="Became the leader."

@Tafhim Tafhim force-pushed the OSD-25179-lockfile branch from 0f9e35d to 5d5668e Compare October 4, 2024 04:57
@Tafhim
Copy link
Copy Markdown
Contributor Author

Tafhim commented Oct 11, 2024

Another scenario where lock initially had different owner. But when the owner got deleted, the existing Pod took leadership.

ts=2024-10-11T08:30:35.112964071Z level=info logger=cmd msg="Operator Version: 1.0.0"                                                                                  
ts=2024-10-11T08:30:35.112996891Z level=info logger=cmd msg="Go Version: go1.22.4 (Red Hat 1.22.4-1.module+el8.10.0+21981+729b5bf8) X:boringcrypto,strictfipsruntime"  
ts=2024-10-11T08:30:35.113000995Z level=info logger=cmd msg="Go OS/Arch: linux/amd64"                                                                                  
ts=2024-10-11T08:30:35.113004066Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"                                                                         
ts=2024-10-11T08:30:35.113945252Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator                                             
ts=2024-10-11T08:30:35.310814746Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator" 
Looking in test-managed-upgrade-operator                                                                                                                               
ts=2024-10-11T08:30:35.313873601Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"            
ts=2024-10-11T08:30:35.313887197Z level=info logger=leader msg="Trying to become the leader."                                                                          
ts=2024-10-11T08:30:35.314202447Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-52nr2                                      
ts=2024-10-11T08:30:35.317899721Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-52n
r2                                                                                                                                                                     
ts=2024-10-11T08:30:35.319785785Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m                                
ts=2024-10-11T08:30:35.334807308Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:36.393479566Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:38.654971987Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:43.061261542Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:30:52.420026904Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:10.910719652Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:27.496689267Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:31:45.082753841Z level=info logger=leader msg="Not the leader. Waiting."                                                                              
ts=2024-10-11T08:32:03.419040735Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:32:21.773196599Z level=info logger=leader msg="Became the leader." 
ts=2024-10-11T08:32:21.773605014Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator

While the other pod stayed waiting to become leader

ts=2024-10-11T08:30:37.694627426Z level=info logger=cmd msg="Version of operator-sdk: v1.21.0"
ts=2024-10-11T08:30:37.695468461Z level=debug logger=k8sutil msg="Found namespace" Namespace=test-managed-upgrade-operator
ts=2024-10-11T08:30:37.707641854Z level=info logger=setup msg="Lock already exists with name managed-upgrade-operator-lock in namespace test-managed-upgrade-operator"
Looking in test-managed-upgrade-operator
ts=2024-10-11T08:30:37.79454308Z level=info logger=setup msg="Lock already has existing owner managed-upgrade-operator-5c477fcfd-fx22m, skipping deletion"
ts=2024-10-11T08:30:37.794646038Z level=info logger=leader msg="Trying to become the leader."
ts=2024-10-11T08:30:37.893467151Z level=debug logger=leader msg="Found podname" Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.996775964Z level=debug logger=leader msg="Found Pod" Pod.Namespace=test-managed-upgrade-operator Pod.Name=managed-upgrade-operator-5c477fcfd-gt5v5
ts=2024-10-11T08:30:37.998917257Z level=info logger=leader msg="Found existing lock" LockOwner=managed-upgrade-operator-5c477fcfd-fx22m
ts=2024-10-11T08:30:38.013081751Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:39.093839453Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:41.171560483Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:45.826541009Z level=info logger=leader msg="Not the leader. Waiting."
ts=2024-10-11T08:30:55.342271994Z level=info logger=leader msg="Not the leader. Waiting."
t

@devppratik
Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2024
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 1, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 1, 2024

@Tafhim: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread go.mod
github.com/openshift/operator-custom-metrics v0.5.1
github.com/openshift/osde2e-common v0.0.0-20240531074950-36a7055798ae
// Note: Please revisit the stop-gap in ./main.go > cleanupLockForLeader when updating operator-lib module.
// If the issue has been fixed upstream, the stop-gap should be removed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tafhim!
These are great!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tafhim - I think it's better to have such notes added in markdown or the code pkg files for reference. However, if we want to ensure that a go.mod version is tied to version due to a known issue etc and we don't accidentally update it then it might be better to have it commented with the upstream issue/bug as follows under the replace directive. Example:

replace (
    github.com/operator-framework/operator-lib v0.12.0 // Refer upstream issue <link>
)

It would also be fine to have a "maintenance" section added in this repo's docs - https://github.com/openshift/managed-upgrade-operator/blob/master/docs/development.md

We can have a separate maintenance.md file covering overall process and highlight this dependency requirement/reason there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested putting it here so it would be noticed if the version was updated.
As far as I am aware we don't have the link for the upstream issue yet so a more generic note was suggested.
Putting this somewhere it's not going to be noticed may mean the patched release will happen without anyone noticing and we will have two services updating the same resource which may cause issues.

@rendhalver
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 4, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: devppratik, rendhalver, Tafhim
Once this PR has been reviewed and has the lgtm label, please assign ravitri for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ravitri
Copy link
Copy Markdown
Member

ravitri commented Nov 5, 2024

/hold

Thanks for reviews so far! Adding temporary hold on this PR till it's further reviewed. I did have some comments

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2024
Comment thread main.go
Comment on lines +125 to +127
// Note: This is a stop-gap solution until a fix has been implemented
// upstream in operator-framework/operator-lib. This function and it's
// usage should be removed once a fix is implemented upstream.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Note, can it be replaced as a TODO point?

@ravitri ravitri mentioned this pull request Nov 6, 2024
2 tasks
@ravitri
Copy link
Copy Markdown
Member

ravitri commented Nov 8, 2024

/close

Based on discussion with @Tafhim, we will go ahead with PR #483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim 👍

Closing this PR for now.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024
@openshift-ci openshift-ci Bot closed this Nov 8, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Nov 8, 2024

@ravitri: Closed this PR.

Details

In response to this:

/close

Based on discussion with @Tafhim, we will go ahead with PR #483 changes for now and observe if it helps ahead. We believe that the operator-lib bump would help address the issue we were facing (atleast partially). Incase it still continues to impact us, we will open a new PR based on this PR's work (might need to align the solution a bit based on what exact problem we are trying to address).

However, this PR has helped us revise some of the implementation details for the lockfile which was great and will serve good reference for future too. Thanks for your help and work on this Tafhim 👍

Closing this PR for now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants