Update rule oom handler by Tzvonimir · Pull Request #275 · devzero-inc/zxporter

Tzvonimir · 2026-02-12T19:17:10Z

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

BUGFIX: Non-breaking fix for an issue.
NEW FEATURE: Non-breaking addition of functionality.
BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
ENHANCEMENT: Improvement to existing functionality.
CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

(Step one: e.g., "Run make test to verify all tests pass.")
(Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
(Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

I have read and followed the CONTRIBUTING guidelines.
My code follows the code style of this project.
I have updated the documentation as needed.
I have added tests that cover my changes.
All new and existing tests have passed locally.
I have run this code in a local environment to verify functionality.
I have considered the security implications of this change.

gitar-bot · 2026-02-15T20:15:38Z

Code Review ✅ Approved 3 resolved / 3 findings

Clean implementation of WorkloadRuleCollector following established patterns. Both previous race condition findings (double-close of batchChan) are resolved via sync.Once and RWMutex guards. OOM detection improvements in pod and container collectors are correct.

✅ 3 resolved

✅ Bug: Goroutine can re-enter Stop() causing double-close of batchChan

📄 internal/collector/workload_rule_collector.go:126
The background goroutine in Start() (lines 127-136) differs from the template WorkloadRecommendationCollector in a problematic way. When stopCh is closed by an external Stop() call, the goroutine checks ctx.Err() != nil and calls Stop() again (line 133). This creates a guaranteed re-entrant call path:

External caller invokes Stop() → closes stopCh, closes batchChan, sets it to nil

Goroutine receives on <-stopCh (line 131)

If context is also cancelled (ctx.Err() != nil), goroutine calls Stop() again (line 133)

Second Stop() reads c.batchChan as nil (set in step 1), skips the close — this specific path is safe

However, there's a race window: if Stop() is called externally and context cancels simultaneously, both the goroutine's ctx.Done() branch and the external caller run Stop() concurrently. Neither call is protected by a mutex, so both can see batchChan != nil, and both can attempt close(batchChan), causing a panic.

The WorkloadRecommendationCollector avoids this by simply returning when stopCh fires (line 154-155 of that file), without calling Stop() again. The fix is to either:

Match the template pattern: remove the re-entrant Stop() call on the <-stopCh branch

Add a sync.Once to guard the teardown logic in Stop()

✅ Bug: Race: concurrent Stop() calls can double-close batchChan

📄 internal/collector/workload_rule_collector.go:176
The Stop() method can be called concurrently from the background goroutine (lines 127-136) and from an external caller. The batchChan nil-check-and-close at lines 187-191 is not protected by a mutex, so two goroutines can both see c.batchChan != nil and both attempt close(c.batchChan), causing a panic: close of closed channel.

Additionally, the informer's event handler (line 150) can still be executing c.batchChan <- ... after Stop() closes the channel, causing a panic: send on closed channel. There's no synchronization between the informer shutdown (async via stopCh) and the immediate close(c.batchChan).

Impact: Under graceful shutdown or context cancellation, the collector can panic and crash the process.

Suggested fix: Add a sync.Once to protect the channel close, or use a mutex around both the send and close paths. For example:
type WorkloadRuleCollector struct {
    // ... existing fields
    stopOnce sync.Once
}

func (c *WorkloadRuleCollector) Stop() error {
    c.stopOnce.Do(func() {
        close(c.stopCh)
        // wait briefly for informer handlers to drain
        close(c.batchChan)
    })
    if c.batcher != nil {
        c.batcher.stop()
    }
    return nil
}
Note: The WorkloadRecommendationCollector has the same issue, but that's pre-existing.

✅ Edge Case: oomEventsChanged misses mutations when event count is unchanged

📄 internal/collector/workload_rule_collector.go:165
The oomEventsChanged method only compares len(oldEvents) != len(newEvents). This means if an OOM event is removed and a different one is added in the same update (count stays the same), or if an existing event's fields are modified, the change is silently dropped.

While the 60-second resync partially mitigates this (it re-sends if any OOM events exist), there's a window where the control plane has stale data. More importantly, if the CRD controller ever implements event rotation (removing old events while adding new ones to cap the list), this logic would permanently miss those updates until a resync.

Consider using reflect.DeepEqual for a content comparison, consistent with how workloadRecommendationChanged compares status maps in the sibling collector:
func (c *WorkloadRuleCollector) oomEventsChanged(oldWR, newWR *unstructured.Unstructured) bool {
	oldEvents, _, _ := unstructured.NestedSlice(oldWR.Object, "status", "oomEvents")
	newEvents, _, _ := unstructured.NestedSlice(newWR.Object, "status", "oomEvents")
	return !reflect.DeepEqual(oldEvents, newEvents)
}
This is cheap for small slices and catches all mutations.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

* Update rule oom handler * Fix rule collector and startup check * Fix rbac and dist installer * Fix gitar issues

gitar-bot Bot reviewed Feb 12, 2026

View reviewed changes

Comment thread internal/collector/workload_rule_collector.go

Update rule oom handler

080783a

Tzvonimir force-pushed the tzvonimir/oom-events-feedback-loop branch from 2e95b17 to 080783a Compare February 13, 2026 13:48

mutantkeyboard reviewed Feb 13, 2026

View reviewed changes

Comment thread internal/collector/workload_rule_collector.go

mutantkeyboard approved these changes Feb 13, 2026

View reviewed changes

Fix rule collector and startup check

3fddc6c

gitar-bot Bot reviewed Feb 15, 2026

View reviewed changes

Comment thread internal/collector/workload_rule_collector.go

gitar-bot Bot reviewed Feb 15, 2026

View reviewed changes

Comment thread internal/collector/workload_rule_collector.go

Tzvonimir added 2 commits February 15, 2026 18:42

Fix rbac and dist installer

c27f3a9

Fix gitar issues

b2f9304

dray92 approved these changes Feb 15, 2026

View reviewed changes

Tzvonimir merged commit a7ab236 into main Feb 15, 2026
25 checks passed

Tzvonimir deleted the tzvonimir/oom-events-feedback-loop branch February 15, 2026 20:45

Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026

Update rule oom handler (#275)

f662abe

* Update rule oom handler * Fix rule collector and startup check * Fix rbac and dist installer * Fix gitar issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update rule oom handler#275

Update rule oom handler#275
Tzvonimir merged 4 commits into
mainfrom
tzvonimir/oom-events-feedback-loop

Tzvonimir commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Tzvonimir commented Feb 12, 2026

[Title]

📚 Description of Changes

❓ Motivation and Context

🔍 Types of Changes

🔬 QA / Verification Steps

✅ Global Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitar-bot Bot commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants