Skip to content

Update rule oom handler#275

Merged
Tzvonimir merged 4 commits into
mainfrom
tzvonimir/oom-events-feedback-loop
Feb 15, 2026
Merged

Update rule oom handler#275
Tzvonimir merged 4 commits into
mainfrom
tzvonimir/oom-events-feedback-loop

Conversation

@Tzvonimir
Copy link
Copy Markdown
Contributor

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Comment thread internal/collector/workload_rule_collector.go
@Tzvonimir Tzvonimir force-pushed the tzvonimir/oom-events-feedback-loop branch from 2e95b17 to 080783a Compare February 13, 2026 13:48
Comment thread internal/collector/workload_rule_collector.go
Comment thread internal/collector/workload_rule_collector.go
Comment thread internal/collector/workload_rule_collector.go
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 15, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Clean implementation of WorkloadRuleCollector following established patterns. Both previous race condition findings (double-close of batchChan) are resolved via sync.Once and RWMutex guards. OOM detection improvements in pod and container collectors are correct.

✅ 3 resolved
Bug: Goroutine can re-enter Stop() causing double-close of batchChan

📄 internal/collector/workload_rule_collector.go:126
The background goroutine in Start() (lines 127-136) differs from the template WorkloadRecommendationCollector in a problematic way. When stopCh is closed by an external Stop() call, the goroutine checks ctx.Err() != nil and calls Stop() again (line 133). This creates a guaranteed re-entrant call path:

  1. External caller invokes Stop() → closes stopCh, closes batchChan, sets it to nil
  2. Goroutine receives on <-stopCh (line 131)
  3. If context is also cancelled (ctx.Err() != nil), goroutine calls Stop() again (line 133)
  4. Second Stop() reads c.batchChan as nil (set in step 1), skips the close — this specific path is safe

However, there's a race window: if Stop() is called externally and context cancels simultaneously, both the goroutine's ctx.Done() branch and the external caller run Stop() concurrently. Neither call is protected by a mutex, so both can see batchChan != nil, and both can attempt close(batchChan), causing a panic.

The WorkloadRecommendationCollector avoids this by simply returning when stopCh fires (line 154-155 of that file), without calling Stop() again. The fix is to either:

  1. Match the template pattern: remove the re-entrant Stop() call on the <-stopCh branch
  2. Add a sync.Once to guard the teardown logic in Stop()
Bug: Race: concurrent Stop() calls can double-close batchChan

📄 internal/collector/workload_rule_collector.go:176
The Stop() method can be called concurrently from the background goroutine (lines 127-136) and from an external caller. The batchChan nil-check-and-close at lines 187-191 is not protected by a mutex, so two goroutines can both see c.batchChan != nil and both attempt close(c.batchChan), causing a panic: close of closed channel.

Additionally, the informer's event handler (line 150) can still be executing c.batchChan <- ... after Stop() closes the channel, causing a panic: send on closed channel. There's no synchronization between the informer shutdown (async via stopCh) and the immediate close(c.batchChan).

Impact: Under graceful shutdown or context cancellation, the collector can panic and crash the process.

Suggested fix: Add a sync.Once to protect the channel close, or use a mutex around both the send and close paths. For example:

type WorkloadRuleCollector struct {
    // ... existing fields
    stopOnce sync.Once
}

func (c *WorkloadRuleCollector) Stop() error {
    c.stopOnce.Do(func() {
        close(c.stopCh)
        // wait briefly for informer handlers to drain
        close(c.batchChan)
    })
    if c.batcher != nil {
        c.batcher.stop()
    }
    return nil
}

Note: The WorkloadRecommendationCollector has the same issue, but that's pre-existing.

Edge Case: oomEventsChanged misses mutations when event count is unchanged

📄 internal/collector/workload_rule_collector.go:165
The oomEventsChanged method only compares len(oldEvents) != len(newEvents). This means if an OOM event is removed and a different one is added in the same update (count stays the same), or if an existing event's fields are modified, the change is silently dropped.

While the 60-second resync partially mitigates this (it re-sends if any OOM events exist), there's a window where the control plane has stale data. More importantly, if the CRD controller ever implements event rotation (removing old events while adding new ones to cap the list), this logic would permanently miss those updates until a resync.

Consider using reflect.DeepEqual for a content comparison, consistent with how workloadRecommendationChanged compares status maps in the sibling collector:

func (c *WorkloadRuleCollector) oomEventsChanged(oldWR, newWR *unstructured.Unstructured) bool {
	oldEvents, _, _ := unstructured.NestedSlice(oldWR.Object, "status", "oomEvents")
	newEvents, _, _ := unstructured.NestedSlice(newWR.Object, "status", "oomEvents")
	return !reflect.DeepEqual(oldEvents, newEvents)
}

This is cheap for small slices and catches all mutations.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Tzvonimir Tzvonimir merged commit a7ab236 into main Feb 15, 2026
25 checks passed
@Tzvonimir Tzvonimir deleted the tzvonimir/oom-events-feedback-loop branch February 15, 2026 20:45
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
* Update rule oom handler

* Fix rule collector and startup check

* Fix rbac and dist installer

* Fix gitar issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants