Skip to content

fix: make OOM kill count cumulative in container metrics timeseries#377

Open
debot-macmini1 wants to merge 1 commit into
mainfrom
fix/oom-timeseries-cumulative
Open

fix: make OOM kill count cumulative in container metrics timeseries#377
debot-macmini1 wants to merge 1 commit into
mainfrom
fix/oom-timeseries-cumulative

Conversation

@debot-macmini1
Copy link
Copy Markdown

Problem

OOM events are visible via the new container_oom_event resource path, but the "normal" container metrics timeseries (MPA stream ContainerMetricItem) does not reliably reflect OOMs.

Root cause in current implementation:

  • SubscriptionManager.Broadcast sets oom_kill_count to 0/1 based only on the current sample's LastTerminationReason.
  • The stream send is non-blocking and drops metrics when the per-client channel is full.
  • OOM is a rare, single-sample signal, so a single drop can cause the timeseries to miss it.
  • Proto comment expects oom_kill_count to be cumulative if available.

Fix

Make oom_kill_count a cumulative, sticky count per (namespace/pod/container) inside SubscriptionManager:

  • Track {count, lastRestart} in-memory.
  • Increment when we observe LastTerminationReason == OOMKilled AND RestartCount advances beyond the last processed restartCount.
  • Always emit the cumulative oom_kill_count on every subsequent utilization sample.

This means even if the OOM-bearing sample is dropped due to backpressure, the counter persists and later samples still reflect the OOM.

Tests (negative/robustness)

Added unit tests in internal/server/mpa_server_test.go:

  • Verifies oom_kill_count is cumulative and sticky across non-OOM samples.
  • Verifies no double-count when the same restartCount is observed repeatedly.
  • Negative test: fills the client channel to force a dropped OOM sample, then confirms the next sample still carries oom_kill_count=1.

Notes

  • go test ./... runs e2e (test/e2e) which can hang locally; unit tests can be run with:
    • go test ./internal/server -run TestSubscriptionManager -count=1

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 7, 2026

Important

Your team uses Gitar, but you don't have an assigned seat yet. Ask a team admin to add your seat so Gitar can review your code. Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant