Skip to content

better oom detection#370

Merged
Parthiba-Hazra merged 2 commits into
mainfrom
ph/oom
Apr 29, 2026
Merged

better oom detection#370
Parthiba-Hazra merged 2 commits into
mainfrom
ph/oom

Conversation

@Parthiba-Hazra
Copy link
Copy Markdown
Collaborator

@Parthiba-Hazra Parthiba-Hazra commented Apr 28, 2026

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Summary by Gitar

  • New OOM detection system:
    • Introduced OOMReconciler for periodic K8s API sweeps to capture OOM events missed by informer-based collectors.
    • Added deduplication logic between the real-time informer path and the periodic sweep to prevent duplicate events.
  • Refactored OOM reporting:
    • Added direct MPA stream publishing for OOM events to ensure low-latency delivery, bypassing the lossy combinedChannel.
    • Standardized OOM termination reasons using ReasonOOMKilled and ReasonStartError constants.
  • Server integration:
    • Updated MpaServer broadcast logic to correctly populate the OomKillCount field in gRPC metric items.

This will update automatically on new commits.

Comment thread internal/collector/pod_collector.go
Comment thread internal/collector/oom_reconciler.go Outdated
Comment thread internal/collector/pod_collector.go Outdated
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 28, 2026

CI failed: The CI build failed due to a transient infrastructure error during the setup of the Kind environment, unrelated to the code changes.

Overview

The CI pipeline encountered a failure during the environment setup phase where the sha256sum verification for the Kind binary installation failed. This appears to be a transient issue with the external dependency installation script in the GitHub Actions runner environment.

Failures

Kind Installation Checksum Failure (confidence: high)

  • Type: infrastructure
  • Affected jobs: 73414639177, 73414639133
  • Related to change: no
  • Root cause: The sha256sum utility failed to find a valid checksum for the Kind binary during installation, causing the script to exit prematurely.
  • Suggested fix: This is an infrastructure-level flake. Re-running the CI job is recommended to resolve the transient download or verification issue.

Summary

  • Change-related failures: 0
  • Infrastructure/flaky failures: 1 (Kind installation checksum failure)
  • Recommended action: Re-run the failed CI jobs. If the error persists, check the health of the Kind installation step in the workflow configuration.
Code Review ✅ Approved 3 resolved / 3 findings

Improves OOM detection logic by resolving data races on reconciler fields, implementing pagination for pod sweeps, and centralizing snapshot construction. No issues found.

✅ 3 resolved
Bug: Data race on mpaPublisher and oomReconciler fields

📄 internal/collector/pod_collector.go:536-538 📄 internal/collector/pod_collector.go:541-543 📄 internal/collector/pod_collector.go:549 📄 internal/collector/pod_collector.go:592
Both SetMpaPublisher and SetOOMReconciler write to PodCollector fields without any synchronization, while publishOOMToMpaStream reads them from informer callbacks running in a different goroutine. This is a classic data race.

The setters are called from wireMpaPublisherIntoPodCollector/wireOOMReconcilerIntoPodCollector in the controller, which can run during restartCollectors while the informer is actively delivering events.

Performance: Sweep lists all pods without filtering or pagination

📄 internal/collector/oom_reconciler.go:120-134
Every 30 seconds, sweepNamespace calls Pods(ns).List() with empty ListOptions, fetching every pod in the namespace. In large clusters with thousands of pods this creates significant API server load and memory pressure in the zxporter process, especially when watching all namespaces.

Only pods with OOM terminations need inspection, so a field selector can dramatically reduce the response size.

Quality: Duplicate OOM snapshot construction in two places

📄 internal/collector/pod_collector.go:556-570 📄 internal/collector/oom_reconciler.go:196-210
The ContainerMetricsSnapshot construction logic for OOM events is nearly identical in PodCollector.publishOOMToMpaStream (pod_collector.go:556-589) and OOMReconciler.publishOOM (oom_reconciler.go:196-228). This duplication makes it easy for future changes to diverge between the two paths.

Tip

Comment Gitar fix CI to trigger a fix.

Was this helpful? React with 👍 / 👎 | Gitar

@Parthiba-Hazra Parthiba-Hazra enabled auto-merge (squash) April 28, 2026 15:39
auto-merge was automatically disabled April 28, 2026 15:40

Pull request was closed

@Parthiba-Hazra Parthiba-Hazra merged commit 58db2bf into main Apr 29, 2026
46 of 48 checks passed
@Parthiba-Hazra Parthiba-Hazra deleted the ph/oom branch April 29, 2026 07:23
Parthiba-Hazra added a commit that referenced this pull request May 5, 2026
* better oom detection

* fix data race on mpaPublisher/oomReconciler
Parthiba-Hazra added a commit that referenced this pull request May 5, 2026
* better oom detection

* fix data race on mpaPublisher/oomReconciler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants