Skip to content

fix: Race with Prometheus Operator CR conversion#1999

Closed
ringerc wants to merge 1 commit intoVictoriaMetrics:masterfrom
ringerc:fix-prom-conversion-race
Closed

fix: Race with Prometheus Operator CR conversion#1999
ringerc wants to merge 1 commit intoVictoriaMetrics:masterfrom
ringerc:fix-prom-conversion-race

Conversation

@ringerc
Copy link
Copy Markdown

@ringerc ringerc commented Mar 25, 2026

Due to a race in the operator, conversion of Prometheus Operator CRs was unreliable and could fail to occur when the operator was first installed in a kube cluster.

This LLM-assisted patch fixes the issue, and adds a test case to validate the fix. I'll review the change more closely then mark this non-draft and ready for review, but quite frankly it would've taken me weeks to find and identify this issue so the effectiveness of my review may be questionable.

Details added in comment to keep LLM material separate: #1999 (comment)

Fixes #1998

Due to a race in the operator, conversion of Prometheus Operator CRs
was unreliable and could fail to occur when the operator was first
installed in a kube cluster.

Fixes VictoriaMetrics#1998
@ringerc
Copy link
Copy Markdown
Author

ringerc commented Mar 25, 2026

The LLM-generated explanation for the issue, given this prompt:

There is a race condition in VictoriaMetrics Operator's support for converting Prometheus CRs like PodMonitor into the VictoriaMetrics equivalent CRs. When the operator is first installed and started, it does not detect PodMonitor and ServiceMonitor resources and convert them. When the operator is restarted, the resources are correctly detected and converted. Investigate this and prepare a patch to fix it.

was:

Prometheus CR Conversion Race Condition Fix

Symptom

On first install of the VictoriaMetrics Operator, existing Prometheus CRs
(PodMonitor, ServiceMonitor, ProbeMonitor, PrometheusRule,
AlertmanagerConfig, ScrapeConfig) are not detected and converted to their
VM-native equivalents. After a pod restart the resources are correctly detected
and converted.


Root Cause Analysis

Two independent bugs combine to produce the symptom.

Bug 1 — WatchFunc ignores metav1.ListOptions (primary)

Background

client-go v0.35.3 enables the WatchListClient feature gate by default for
Kubernetes ≥ 1.35. When this gate is on the Reflector switches from the classic
List → Watch bootstrap sequence to WatchList mode:

  1. WatchFunc is called with options.SendInitialEvents = ptr.To(true) and
    options.AllowWatchBookmarks = true.
  2. The API server streams all existing objects as ADDED events followed by a
    special BOOKMARK event (isInitialEventsEndBookmark).
  3. The Reflector buffers those events in a temporary store and only promotes them
    to the main store after the BOOKMARK is received — marking the cache as
    "synced".

If the BOOKMARK never arrives the informer never marks itself synced and no
AddFunc handlers fire for existing objects.

The bug

NewObjectWatcherForNamespaces previously ignored the metav1.ListOptions
argument that the Reflector passes to WatchFunc:

// BROKEN — options is silently dropped
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
    return k8stools.NewObjectWatcherForNamespaces[promv1.PodMonitorList](
        ctx, rclient, "pod_monitors", baseConf.WatchNamespaces)
},

Because SendInitialEvents was never forwarded to the underlying
rclient.Watch call, the API server started a plain watch with no initial
event delivery. The Reflector waited indefinitely for the BOOKMARK that would
never come, so existing Prometheus CRs were invisible to the operator on first
start.

On restart the Reflector retried via the classic List fallback path (it falls
back after a configurable timeout), which does deliver existing objects — hence
the restart workaround.

Fix

NewObjectWatcherForNamespaces was given a variadic opts ...metav1.ListOptions
parameter. Each per-namespace rclient.Watch call now sets Raw on its
client.ListOptions:

// client_utils.go
func NewObjectWatcherForNamespaces[T any, PT listing[T]](
    ctx context.Context,
    rclient client.WithWatch,
    crdTypeName string,
    namespaces []string,
    opts ...metav1.ListOptions,   // forwarded from Reflector
) (watch.Interface, error) {
    ...
    addWatcher := func(ns string) error {
        listOpt := &client.ListOptions{Namespace: ns}
        if len(opts) > 0 {
            rawOpts := opts[0]          // copy to avoid aliasing
            listOpt.Raw = &rawOpts
        }
        w, err := rclient.Watch(localCtx, dst, listOpt)
        ...
    }
}

All six WatchFunc closures in vmprometheusconverter_controller.go were
updated to pass options:

// FIXED
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
    return k8stools.NewObjectWatcherForNamespaces[promv1.PodMonitorList](
        ctx, rclient, "pod_monitors", baseConf.WatchNamespaces, options)
},

Bug 2 — startPollFor holds mutex during channel send (secondary)

Background

Before starting any Prometheus-CR informer the operator must confirm that the
Prometheus CRDs are actually registered in the cluster. sharedAPIDiscoverer
polls the discovery API and notifies waiting goroutines via unbuffered channels.

The bug

startPollFor held s.mu while sending on each subscriber channel:

// BROKEN
s.mu.Lock()
for _, r := range api.APIResources {
    if notify, ok := s.kindReadyByGroup[group][r.Kind]; ok {
        notify <- struct{}{}        // blocks while mutex is held!
        delete(s.kindReadyByGroup[group], r.Kind)
    }
}
if len(s.kindReadyByGroup[group]) == 0 {
    // group entry NOT deleted here — omission!
}
s.mu.Unlock()

Two failure modes result:

  1. Deadlock under concurrent subscribers. A goroutine calling
    subscribeForGroupKind tries to acquire s.mu. The polling goroutine holds
    s.mu while blocked on a channel send to an earlier subscriber. If that
    subscriber is the goroutine trying to call subscribeForGroupKind, the
    system deadlocks.

  2. Late-subscriber never notified. After all registered subscribers are
    notified the group entry in kindReadyByGroup was never deleted. A goroutine
    subscribing after the polling goroutine has exited adds itself to the now-
    orphaned map entry. No polling goroutine is started (because the entry already
    exists) and the new subscriber waits forever.

Fix

Collect the channels to notify while holding the mutex, delete the group entry
atomically when it becomes empty, release the mutex, then send:

// FIXED
func (s *sharedAPIDiscoverer) startPollFor(ctx context.Context, group string) {
    interval := s.pollInterval
    if interval == 0 {
        interval = 5 * time.Second
    }
    tick := time.NewTicker(interval)
    defer tick.Stop()
    for {
        select {
        case <-tick.C:
            api, err := s.baseClient.ServerResourcesForGroupVersion(group)
            if err != nil {
                if !k8serrors.IsNotFound(err) {
                    converterLogger.Error(err, "cannot get server resource for api group version")
                }
                continue
            }
            s.mu.Lock()
            var toNotify []chan struct{}
            for _, r := range api.APIResources {
                if notify, ok := s.kindReadyByGroup[group][r.Kind]; ok {
                    toNotify = append(toNotify, notify)
                    delete(s.kindReadyByGroup[group], r.Kind)
                }
            }
            remaining := len(s.kindReadyByGroup[group])
            if remaining == 0 {
                delete(s.kindReadyByGroup, group)   // prevent orphaned entry
            }
            s.mu.Unlock()
            // send without holding the lock
            for _, ch := range toNotify {
                select {
                case ch <- struct{}{}:
                case <-ctx.Done():
                    return
                }
            }
            if remaining == 0 {
                return
            }
        case <-ctx.Done():
            return
        }
    }
}

Testability improvements

A discoveryClient interface was introduced so the polling logic can be tested
without a real API server:

type discoveryClient interface {
    ServerResourcesForGroupVersion(groupVersion string) (*metav1.APIResourceList, error)
}

type sharedAPIDiscoverer struct {
    baseClient       discoveryClient
    pollInterval     time.Duration   // 0 → default 5 s
    mu               sync.Mutex
    kindReadyByGroup map[string]map[string]chan struct{}
}

Tests added

internal/controller/operator/vmprometheusconverter_controller_test.go

Test Verifies
Test_sharedAPIDiscoverer_allSubscribersNotified All concurrent subscribers for a group are notified when the API becomes available
Test_sharedAPIDiscoverer_lateSubscriberGetsNotified After the first subscriber is notified and the group entry deleted, a late subscriber still gets a fresh polling goroutine and is notified

Files changed

File Change
internal/controller/operator/factory/k8stools/client_utils.go NewObjectWatcherForNamespaces accepts and forwards opts ...metav1.ListOptions
internal/controller/operator/vmprometheusconverter_controller.go All 6 WatchFunc closures pass options; startPollFor rewritten; discoveryClient interface and pollInterval field added
internal/controller/operator/vmprometheusconverter_controller_test.go Two new unit tests

@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

AndrewChubatiuk commented Mar 26, 2026

hey @ringerc
thanks for a PR
we already have one related PR opened. could you please check if it covers you issue as well?

@ringerc
Copy link
Copy Markdown
Author

ringerc commented Mar 30, 2026

@AndrewChubatiuk That's a way more intrusive change, and I honestly don't know the innards of the kube API and the conventions for operator implementation well enough to have a reasonable chance of usefully reviewing it.

It sounds like it addresses the same issue - but it's also been open for a month+ and touches a lot of files so presumably it's not considered an easy merge.

I'm trying out a private build off this branch for now to work around the issue anyway.

@ringerc ringerc marked this pull request as ready for review March 30, 2026 00:43
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

@vrutkovs
Copy link
Copy Markdown
Collaborator

vrutkovs commented Mar 30, 2026

I'd prefer Andrew's patch - it also allows us to resolve a few long-standing problems and remove several hacks. Sorry for keeping the review of it on the backburner - now that you've filed the issue I'll bump it up my review queue.

@vrutkovs
Copy link
Copy Markdown
Collaborator

As for this PR - I think it only fixes the issue on 1.35+ but doesn't address the problem on other k8s versions. Is it addressed by startPollFor too? I think we want a few more e2e tests to cover possible issues here

@ringerc
Copy link
Copy Markdown
Author

ringerc commented Mar 31, 2026

@vrutkovs As far as I can tell, this issue does not exist on pre-1.35 so there is nothing to fix for older versions. If I've understood it correctly, the startPollFor issue you referenced looks like it's a separate concurrency bug?

I'm happy to close this as there's a preferred approach prepared by someone who knows much more than I do.

@ringerc ringerc closed this Mar 31, 2026
@vrutkovs
Copy link
Copy Markdown
Collaborator

If I've understood it correctly, the startPollFor issue you referenced looks like it's a separate concurrency bug?

Yes, I think so. Could you open a separate PR to fix that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VMOperator sometimes fails to convert pre-existing Prometheus Operator CRs on first install

3 participants