Skip to content

Fix potential deadlock when k8s client is used#2031

Merged
knative-prow-robot merged 5 commits into
knative:masterfrom
skonto:fix_deadlock
Feb 25, 2021
Merged

Fix potential deadlock when k8s client is used#2031
knative-prow-robot merged 5 commits into
knative:masterfrom
skonto:fix_deadlock

Conversation

@skonto
Copy link
Copy Markdown
Contributor

@skonto skonto commented Feb 19, 2021

Changes

  • Removes the global metrics config mutex to avoid a potential deadlock when k8s client is used due to the
    registered metrics hooks in the client.
  • This PR preserves the previous semantics when managing the metrics config etc so it should be transparent from a user point of view.
  • It follows the worker approach in the opencensus lib.

/kind bug

Fixes #807

/cc @vagababov @evankanderson

@knative-prow-robot knative-prow-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 19, 2021
@google-cla google-cla Bot added the cla: yes Indicates the PR's author has signed the CLA. label Feb 19, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 19, 2021
@skonto skonto changed the title remove global metrics config lock Fix potential deadlock when k8s client is used Feb 19, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 19, 2021

Codecov Report

Merging #2031 (3ca75a4) into master (29092fe) will increase coverage by 0.03%.
The diff coverage is 85.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2031      +/-   ##
==========================================
+ Coverage   67.15%   67.19%   +0.03%     
==========================================
  Files         213      214       +1     
  Lines        8991     9019      +28     
==========================================
+ Hits         6038     6060      +22     
- Misses       2682     2686       +4     
- Partials      271      273       +2     
Impacted Files Coverage Δ
metrics/metrics_worker.go 71.42% <71.42%> (ø)
metrics/exporter.go 84.61% <100.00%> (+6.35%) ⬆️
metrics/testing.go 100.00% <100.00%> (ø)
test/gcs/mock/mock.go 91.39% <0.00%> (-2.16%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29092fe...908eed3. Read the comment docs.

Stavros Kontopoulos added 2 commits February 19, 2021 16:33
@skonto
Copy link
Copy Markdown
Contributor Author

skonto commented Feb 19, 2021

TestMetricsExport fails again.

Comment thread metrics/testing.go Outdated

// InitForTesting initialize the necessary global variables for unit tests.
func InitForTesting() {
setupMetricsWorker()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already called due to init?

Copy link
Copy Markdown
Contributor Author

@skonto skonto Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope that was really weird because I thought init from the metrics package (since it is imported) it would be call first but it does not. It fails with a nil pointer exception when it tries to access the mWorker instance.
Let me double check maybe something messed up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant reproduce it will update.

@skonto
Copy link
Copy Markdown
Contributor Author

skonto commented Feb 22, 2021

@evankanderson gentle ping.

Copy link
Copy Markdown
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the unit test failure, this looks reasonable to me.

Comment thread metrics/exporter.go
Comment on lines +167 to +168
err = <-updateCmd.done
return err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be return <-updateCmd.done

@evankanderson
Copy link
Copy Markdown
Member

Running main and your branch 84 times, I see about an equivalent number of failures (8 vs 9(, but there seems to be a lot more output on failure with this PR, and I see a timeout/missing data failure in the OpenCensus export in addition to the e2e_test.go:320 timeout on Prometheus errors. I'm willing to believe that the OpenCensus failure was a fluke and that this doesn't make the Prometheus failures any worse (though they seem much higher than I'd expect).

/retest
/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2021
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021
@skonto
Copy link
Copy Markdown
Contributor Author

skonto commented Feb 25, 2021

/retest

@skonto
Copy link
Copy Markdown
Contributor Author

skonto commented Feb 25, 2021

@evankanderson I have seen that test failing multiple times in my other pr #2005, left a comment there, so I am not sure if this is any different. I will try to debug that and see if it can be fixed in another PR (I am out this week btw).

@knative-prow-robot knative-prow-robot merged commit 5367a43 into knative:master Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential deadlock in knative.dev/pkg/metrics when using k8s.io/client-go/kubernetes

4 participants