Skip to content

NO-JIRA: Fix flaky observability and telemetry test teardown#6578

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
agullon:fix-flaky-observability-telemetry-tests
Apr 24, 2026
Merged

NO-JIRA: Fix flaky observability and telemetry test teardown#6578
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
agullon:fix-flaky-observability-telemetry-tests

Conversation

@agullon
Copy link
Copy Markdown
Contributor

@agullon agullon commented Apr 24, 2026

Summary

  • Add retry logic to Loki queries in observability tests to handle race condition between OTEL collector restart and data ingestion
  • Add healthcheck to telemetry suite teardown to prevent vg-manager CrashLoopBackOff from accumulating across rapid MicroShift restarts

Test plan

  • el98-lrel@optional observability Loki tests pass consistently
  • el98-lrel@storage-telemetry storage reboot test passes consistently

🤖 Generated with Claude Code

Add retry logic to Loki queries in observability tests to handle the
race condition between OTEL collector restart and Loki data ingestion.

Add healthcheck to telemetry suite teardown to prevent vg-manager
CrashLoopBackOff from accumulating across rapid MicroShift restarts,
which caused intermittent storage test failures in shared scenarios.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pre-commit.check-secrets: ENABLED
@openshift-ci-robot
Copy link
Copy Markdown

@agullon: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Add retry logic (Wait Until Keyword Succeeds 10x 5s) to Loki queries in observability tests to handle the race condition between OTEL collector restart and Loki data ingestion
  • Add Wait For MicroShift Healthcheck Success to telemetry suite teardown to prevent vg-manager CrashLoopBackOff from accumulating across rapid MicroShift restarts, which caused intermittent storage test failures in shared scenarios (e.g. el98-lrel@storage-telemetry)

Root Cause Analysis

Observability (Loki tests): The test setup restarts microshift-observability with a test OTEL config, then queries Loki immediately with zero retry logic. If Loki hasn't ingested data yet, the query returns empty and the test fails.

Storage (reboot test): Each telemetry test restarts MicroShift without waiting for health. The vg-manager container gets killed mid-initialization on every restart, and CRI-O accumulates the restart count across kubelet restarts. By the time the storage suite runs, vg-manager is in CrashLoopBackOff with escalating backoff (up to 1m20s), so PVCs can't be provisioned within the 120s timeout.

Test plan

  • Verify el98-lrel@optional observability Loki tests pass consistently
  • Verify el98-lrel@storage-telemetry storage reboot test passes consistently
  • Verify telemetry tests still pass (healthcheck adds time but shouldn't affect results)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Two robot test suites updated with resilience improvements. Observability suite now retries Loki query verification up to 10 times over 50 seconds. Telemetry suite adds ostree-health resource helper and waits for MicroShift healthcheck in teardown sequence.

Changes

Cohort / File(s) Summary
Observability Test Suite
test/suites/optional/observability.robot
Added retry logic using Wait Until Keyword Succeeds 10x 5s wrapper around Loki query checks for service_name="journald" and service_name="kube_events" to tolerate transient failures.
Telemetry Test Suite
test/suites/telemetry/telemetry.robot
Integrated ostree-health.resource helper and updated suite-level teardown to wait for successful MicroShift healthcheck before logout and kubeconfig cleanup.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding retry logic to observability tests and fixing telemetry test teardown to address flakiness.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Custom check validates Ginkgo test names, but PR modifies Robot Framework test suites with static descriptive names, making the check inapplicable.
Test Structure And Quality ✅ Passed The custom check targets Ginkgo test code, but the PR modifies Robot Framework test suites (.robot files) using entirely different syntax and patterns.
Microshift Test Compatibility ✅ Passed This check is not applicable to the PR because the modified files are Robot Framework test suites (.robot files), not Ginkgo e2e tests written in Go.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add or modify any Ginkgo e2e tests written in Go. The changes are exclusively to Robot Framework test files, which are unrelated to Ginkgo. The check passes because there are no Ginkgo tests to evaluate for SNO compatibility.
Topology-Aware Scheduling Compatibility ✅ Passed Changes only modify Robot framework test files with test logic—no deployment manifests, operator code, or Kubernetes scheduling constraints are modified.
Ote Binary Stdout Contract ✅ Passed OTE Binary Stdout Contract check is not applicable to Robot Framework test files, which are declarative scripts without process-level entry points.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies only Robot Framework test files (.robot), not Ginkgo e2e tests. The custom check applies only to Go-based Ginkgo tests with IPv6 and disconnected network compatibility concerns.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from copejon and pacevedom April 24, 2026 10:53
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/suites/telemetry/telemetry.robot`:
- Around line 112-116: Teardown currently calls "Wait For MicroShift Healthcheck
Success" directly which can abort the teardown when it fails; update the
teardown to run the healthcheck with continue-on-failure semantics (e.g., wrap
"Wait For MicroShift Healthcheck Success" with a continue-on-failure keyword
such as "Run Keyword And Ignore Error" or "Run Keyword And Continue On Failure")
so that "Logout MicroShift Host" and "Remove Kubeconfig" always execute; modify
the Teardown sequence to call the healthcheck via that wrapper while leaving
"Logout MicroShift Host" and "Remove Kubeconfig" unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: df48e8af-9f98-48cf-8dac-80204ac18a81

📥 Commits

Reviewing files that changed from the base of the PR and between 1320339 and 4660692.

📒 Files selected for processing (2)
  • test/suites/optional/observability.robot
  • test/suites/telemetry/telemetry.robot

Comment thread test/suites/telemetry/telemetry.robot
@agullon
Copy link
Copy Markdown
Contributor Author

agullon commented Apr 24, 2026

/cherrypick release-4.22

@openshift-cherrypick-robot
Copy link
Copy Markdown

@agullon: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kasturinarra
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: agullon, kasturinarra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [agullon,kasturinarra]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agullon
Copy link
Copy Markdown
Contributor Author

agullon commented Apr 24, 2026

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@agullon: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 744e3a3 and 2 for PR HEAD 4660692 in total

@agullon
Copy link
Copy Markdown
Contributor Author

agullon commented Apr 24, 2026

/retest

@openshift-ci-robot
Copy link
Copy Markdown

@agullon: This pull request explicitly references no jira issue.

Details

In response to this:

Summary

  • Add retry logic to Loki queries in observability tests to handle race condition between OTEL collector restart and data ingestion
  • Add healthcheck to telemetry suite teardown to prevent vg-manager CrashLoopBackOff from accumulating across rapid MicroShift restarts

Test plan

  • el98-lrel@optional observability Loki tests pass consistently
  • el98-lrel@storage-telemetry storage reboot test passes consistently

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@agullon
Copy link
Copy Markdown
Contributor Author

agullon commented Apr 24, 2026

/retest

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 11289b3 and 1 for PR HEAD 4660692 in total

@agullon
Copy link
Copy Markdown
Contributor Author

agullon commented Apr 24, 2026

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

@agullon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit c5c9dd4 into openshift:main Apr 24, 2026
13 checks passed
@openshift-cherrypick-robot
Copy link
Copy Markdown

@agullon: new pull request created: #6579

Details

In response to this:

/cherrypick release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants