fix: Race with Prometheus Operator CR conversion by ringerc · Pull Request #1999 · VictoriaMetrics/operator

ringerc · 2026-03-25T19:24:44Z

Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster.

This LLM-assisted patch fixes the issue, and adds a test case to validate the fix. I'll review the change more closely then mark this non-draft and ready for review, but quite frankly it would've taken me weeks to find and identify this issue so the effectiveness of my review may be questionable.

Details added in comment to keep LLM material separate: #1999 (comment)

Fixes #1998

Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster. Fixes VictoriaMetrics#1998

ringerc · 2026-03-25T19:27:12Z

The LLM-generated explanation for the issue, given this prompt:

There is a race condition in VictoriaMetrics Operator's support for converting Prometheus CRs like PodMonitor into the VictoriaMetrics equivalent CRs. When the operator is first installed and started, it does not detect PodMonitor and ServiceMonitor resources and convert them. When the operator is restarted, the resources are correctly detected and converted. Investigate this and prepare a patch to fix it.

was:

Prometheus CR Conversion Race Condition Fix

Symptom

On first install of the VictoriaMetrics Operator, existing Prometheus CRs
(PodMonitor, ServiceMonitor, ProbeMonitor, PrometheusRule,
AlertmanagerConfig, ScrapeConfig) are not detected and converted to their
VM-native equivalents. After a pod restart the resources are correctly detected
and converted.

Root Cause Analysis

Two independent bugs combine to produce the symptom.

Bug 1 — `WatchFunc` ignores `metav1.ListOptions` (primary)

Background

client-go v0.35.3 enables the WatchListClient feature gate by default for
Kubernetes ≥ 1.35. When this gate is on the Reflector switches from the classic
List → Watch bootstrap sequence to WatchList mode:

WatchFunc is called with options.SendInitialEvents = ptr.To(true) and
options.AllowWatchBookmarks = true.
The API server streams all existing objects as ADDED events followed by a
special BOOKMARK event (isInitialEventsEndBookmark).
The Reflector buffers those events in a temporary store and only promotes them
to the main store after the BOOKMARK is received — marking the cache as
"synced".

If the BOOKMARK never arrives the informer never marks itself synced and no
AddFunc handlers fire for existing objects.

The bug

NewObjectWatcherForNamespaces previously ignored the metav1.ListOptions
argument that the Reflector passes to WatchFunc:

// BROKEN — options is silently dropped
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
    return k8stools.NewObjectWatcherForNamespaces[promv1.PodMonitorList](
        ctx, rclient, "pod_monitors", baseConf.WatchNamespaces)
},

Because SendInitialEvents was never forwarded to the underlying
rclient.Watch call, the API server started a plain watch with no initial
event delivery. The Reflector waited indefinitely for the BOOKMARK that would
never come, so existing Prometheus CRs were invisible to the operator on first
start.

On restart the Reflector retried via the classic List fallback path (it falls
back after a configurable timeout), which does deliver existing objects — hence
the restart workaround.

Fix

NewObjectWatcherForNamespaces was given a variadic opts ...metav1.ListOptions
parameter. Each per-namespace rclient.Watch call now sets Raw on its
client.ListOptions:

// client_utils.go
func NewObjectWatcherForNamespaces[T any, PT listing[T]](
    ctx context.Context,
    rclient client.WithWatch,
    crdTypeName string,
    namespaces []string,
    opts ...metav1.ListOptions,   // forwarded from Reflector
) (watch.Interface, error) {
    ...
    addWatcher := func(ns string) error {
        listOpt := &client.ListOptions{Namespace: ns}
        if len(opts) > 0 {
            rawOpts := opts[0]          // copy to avoid aliasing
            listOpt.Raw = &rawOpts
        }
        w, err := rclient.Watch(localCtx, dst, listOpt)
        ...
    }
}

All six WatchFunc closures in vmprometheusconverter_controller.go were
updated to pass options:

// FIXED
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
    return k8stools.NewObjectWatcherForNamespaces[promv1.PodMonitorList](
        ctx, rclient, "pod_monitors", baseConf.WatchNamespaces, options)
},

Bug 2 — `startPollFor` holds mutex during channel send (secondary)

Background

Before starting any Prometheus-CR informer the operator must confirm that the
Prometheus CRDs are actually registered in the cluster. sharedAPIDiscoverer
polls the discovery API and notifies waiting goroutines via unbuffered channels.

The bug

startPollFor held s.mu while sending on each subscriber channel:

// BROKEN
s.mu.Lock()
for _, r := range api.APIResources {
    if notify, ok := s.kindReadyByGroup[group][r.Kind]; ok {
        notify <- struct{}{}        // blocks while mutex is held!
        delete(s.kindReadyByGroup[group], r.Kind)
    }
}
if len(s.kindReadyByGroup[group]) == 0 {
    // group entry NOT deleted here — omission!
}
s.mu.Unlock()

Two failure modes result:

Deadlock under concurrent subscribers. A goroutine calling
subscribeForGroupKind tries to acquire s.mu. The polling goroutine holds
s.mu while blocked on a channel send to an earlier subscriber. If that
subscriber is the goroutine trying to call subscribeForGroupKind, the
system deadlocks.
Late-subscriber never notified. After all registered subscribers are
notified the group entry in kindReadyByGroup was never deleted. A goroutine
subscribing after the polling goroutine has exited adds itself to the now-
orphaned map entry. No polling goroutine is started (because the entry already
exists) and the new subscriber waits forever.

Fix

Collect the channels to notify while holding the mutex, delete the group entry
atomically when it becomes empty, release the mutex, then send:

// FIXED
func (s *sharedAPIDiscoverer) startPollFor(ctx context.Context, group string) {
    interval := s.pollInterval
    if interval == 0 {
        interval = 5 * time.Second
    }
    tick := time.NewTicker(interval)
    defer tick.Stop()
    for {
        select {
        case <-tick.C:
            api, err := s.baseClient.ServerResourcesForGroupVersion(group)
            if err != nil {
                if !k8serrors.IsNotFound(err) {
                    converterLogger.Error(err, "cannot get server resource for api group version")
                }
                continue
            }
            s.mu.Lock()
            var toNotify []chan struct{}
            for _, r := range api.APIResources {
                if notify, ok := s.kindReadyByGroup[group][r.Kind]; ok {
                    toNotify = append(toNotify, notify)
                    delete(s.kindReadyByGroup[group], r.Kind)
                }
            }
            remaining := len(s.kindReadyByGroup[group])
            if remaining == 0 {
                delete(s.kindReadyByGroup, group)   // prevent orphaned entry
            }
            s.mu.Unlock()
            // send without holding the lock
            for _, ch := range toNotify {
                select {
                case ch <- struct{}{}:
                case <-ctx.Done():
                    return
                }
            }
            if remaining == 0 {
                return
            }
        case <-ctx.Done():
            return
        }
    }
}

Testability improvements

A discoveryClient interface was introduced so the polling logic can be tested
without a real API server:

type discoveryClient interface {
    ServerResourcesForGroupVersion(groupVersion string) (*metav1.APIResourceList, error)
}

type sharedAPIDiscoverer struct {
    baseClient       discoveryClient
    pollInterval     time.Duration   // 0 → default 5 s
    mu               sync.Mutex
    kindReadyByGroup map[string]map[string]chan struct{}
}

Tests added

internal/controller/operator/vmprometheusconverter_controller_test.go

Test	Verifies
`Test_sharedAPIDiscoverer_allSubscribersNotified`	All concurrent subscribers for a group are notified when the API becomes available
`Test_sharedAPIDiscoverer_lateSubscriberGetsNotified`	After the first subscriber is notified and the group entry deleted, a late subscriber still gets a fresh polling goroutine and is notified

Files changed

File	Change
`internal/controller/operator/factory/k8stools/client_utils.go`	`NewObjectWatcherForNamespaces` accepts and forwards `opts ...metav1.ListOptions`
`internal/controller/operator/vmprometheusconverter_controller.go`	All 6 `WatchFunc` closures pass `options`; `startPollFor` rewritten; `discoveryClient` interface and `pollInterval` field added
`internal/controller/operator/vmprometheusconverter_controller_test.go`	Two new unit tests

AndrewChubatiuk · 2026-03-26T07:44:12Z

hey @ringerc
thanks for a PR
we already have one related PR opened. could you please check if it covers you issue as well?

ringerc · 2026-03-30T00:43:36Z

@AndrewChubatiuk That's a way more intrusive change, and I honestly don't know the innards of the kube API and the conventions for operator implementation well enough to have a reasonable chance of usefully reviewing it.

It sounds like it addresses the same issue - but it's also been open for a month+ and touches a lot of files so presumably it's not considered an easy merge.

I'm trying out a private build off this branch for now to work around the issue anyway.

cubic-dev-ai

No issues found across 3 files

vrutkovs · 2026-03-30T07:15:38Z

I'd prefer Andrew's patch - it also allows us to resolve a few long-standing problems and remove several hacks. Sorry for keeping the review of it on the backburner - now that you've filed the issue I'll bump it up my review queue.

vrutkovs · 2026-03-30T07:20:51Z

As for this PR - I think it only fixes the issue on 1.35+ but doesn't address the problem on other k8s versions. Is it addressed by startPollFor too? I think we want a few more e2e tests to cover possible issues here

ringerc · 2026-03-31T01:00:34Z

@vrutkovs As far as I can tell, this issue does not exist on pre-1.35 so there is nothing to fix for older versions. If I've understood it correctly, the startPollFor issue you referenced looks like it's a separate concurrency bug?

I'm happy to close this as there's a preferred approach prepared by someone who knows much more than I do.

vrutkovs · 2026-03-31T06:28:17Z

If I've understood it correctly, the startPollFor issue you referenced looks like it's a separate concurrency bug?

Yes, I think so. Could you open a separate PR to fix that?

fix: Race with Prometheus Operator CR conversion

1587c88

Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster. Fixes VictoriaMetrics#1998

ringerc marked this pull request as ready for review March 30, 2026 00:43

ringerc requested review from AndrewChubatiuk and vrutkovs as code owners March 30, 2026 00:43

cubic-dev-ai Bot reviewed Mar 30, 2026

View reviewed changes

ringerc mentioned this pull request Mar 30, 2026

replace converter with a controller for each type, that should be converted #1889

Merged

ringerc closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Race with Prometheus Operator CR conversion#1999

fix: Race with Prometheus Operator CR conversion#1999
ringerc wants to merge 1 commit intoVictoriaMetrics:masterfrom
ringerc:fix-prom-conversion-race

ringerc commented Mar 25, 2026 •

edited

Loading

Uh oh!

ringerc commented Mar 25, 2026

Uh oh!

AndrewChubatiuk commented Mar 26, 2026 •

edited

Loading

Uh oh!

ringerc commented Mar 30, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

vrutkovs commented Mar 30, 2026 •

edited

Loading

Uh oh!

vrutkovs commented Mar 30, 2026

Uh oh!

ringerc commented Mar 31, 2026

Uh oh!

vrutkovs commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ringerc commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ringerc commented Mar 25, 2026

Prometheus CR Conversion Race Condition Fix

Symptom

Root Cause Analysis

Bug 1 — WatchFunc ignores metav1.ListOptions (primary)

Background

The bug

Fix

Bug 2 — startPollFor holds mutex during channel send (secondary)

Background

The bug

Fix

Testability improvements

Tests added

Files changed

Uh oh!

AndrewChubatiuk commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ringerc commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Mar 30, 2026

Uh oh!

ringerc commented Mar 31, 2026

Uh oh!

vrutkovs commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ringerc commented Mar 25, 2026 •

edited

Loading

Bug 1 — `WatchFunc` ignores `metav1.ListOptions` (primary)

Bug 2 — `startPollFor` holds mutex during channel send (secondary)

AndrewChubatiuk commented Mar 26, 2026 •

edited

Loading

ringerc commented Mar 30, 2026 •

edited

Loading

vrutkovs commented Mar 30, 2026 •

edited

Loading