fix: Race with Prometheus Operator CR conversion#1999
fix: Race with Prometheus Operator CR conversion#1999ringerc wants to merge 1 commit intoVictoriaMetrics:masterfrom
Conversation
Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster. Fixes VictoriaMetrics#1998
|
The LLM-generated explanation for the issue, given this prompt:
was: Prometheus CR Conversion Race Condition FixSymptomOn first install of the VictoriaMetrics Operator, existing Prometheus CRs Root Cause AnalysisTwo independent bugs combine to produce the symptom. Bug 1 —
|
| Test | Verifies |
|---|---|
Test_sharedAPIDiscoverer_allSubscribersNotified |
All concurrent subscribers for a group are notified when the API becomes available |
Test_sharedAPIDiscoverer_lateSubscriberGetsNotified |
After the first subscriber is notified and the group entry deleted, a late subscriber still gets a fresh polling goroutine and is notified |
Files changed
| File | Change |
|---|---|
internal/controller/operator/factory/k8stools/client_utils.go |
NewObjectWatcherForNamespaces accepts and forwards opts ...metav1.ListOptions |
internal/controller/operator/vmprometheusconverter_controller.go |
All 6 WatchFunc closures pass options; startPollFor rewritten; discoveryClient interface and pollInterval field added |
internal/controller/operator/vmprometheusconverter_controller_test.go |
Two new unit tests |
|
hey @ringerc |
|
@AndrewChubatiuk That's a way more intrusive change, and I honestly don't know the innards of the kube API and the conventions for operator implementation well enough to have a reasonable chance of usefully reviewing it. It sounds like it addresses the same issue - but it's also been open for a month+ and touches a lot of files so presumably it's not considered an easy merge. I'm trying out a private build off this branch for now to work around the issue anyway. |
|
I'd prefer Andrew's patch - it also allows us to resolve a few long-standing problems and remove several hacks. Sorry for keeping the review of it on the backburner - now that you've filed the issue I'll bump it up my review queue. |
|
As for this PR - I think it only fixes the issue on 1.35+ but doesn't address the problem on other k8s versions. Is it addressed by |
|
@vrutkovs As far as I can tell, this issue does not exist on pre-1.35 so there is nothing to fix for older versions. If I've understood it correctly, the I'm happy to close this as there's a preferred approach prepared by someone who knows much more than I do. |
Yes, I think so. Could you open a separate PR to fix that? |
Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster.
This LLM-assisted patch fixes the issue, and adds a test case to validate the fix. I'll review the change more closely then mark this non-draft and ready for review, but quite frankly it would've taken me weeks to find and identify this issue so the effectiveness of my review may be questionable.
Details added in comment to keep LLM material separate: #1999 (comment)
Fixes #1998