Skip to content

Bug 2086728: Improves Config Drift Monitor e2e tests#3146

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
cheesesashimi:zzlotnik/config-drift-reboot-check
May 16, 2022
Merged

Bug 2086728: Improves Config Drift Monitor e2e tests#3146
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
cheesesashimi:zzlotnik/config-drift-reboot-check

Conversation

@cheesesashimi
Copy link
Copy Markdown
Member

- What I did

The Config Drift Monitor tests were broken by #3141. The breakage was determined to be an issue with how we determine whether the Config Drift Monitor has started: We were grabbing the logs and indiscriminately searching for the startup text. Instead, we should grab the logs and scan from the bottom up, searching for the startup message absent a similar shutdown message. This PR was tested against master as well as the aforementioned PR.

Additionally, to provide better signal around the Config Drift Monitor, we should check whether the node reboots as a result of the test. Use of the Forcefile should cause a reboot, whereas file content reversion should not.

- How to verify it

Run the attached e2e test suite.

- Description for the changelog
Get better signal from Config Drift Monitor tests

@openshift-ci openshift-ci Bot requested review from cgwalters and raisaat May 11, 2022 20:57
Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, some comments inline

Comment thread test/e2e-shared-tests/helpers.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is problematic if the logs if empty. We can probably work around it with i<=0 though

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

Comment thread test/helpers/utils.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do something similar in mcd_test and sno_mcd_test. Doesn't have to be part of this PR, but we should maybe share the implementation

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed. I wrote these helpers with that in mind. Once this lands, I'll open a separate PR to fix those.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/config-drift-reboot-check branch from 04d7507 to c039d7a Compare May 11, 2022 21:15
@cheesesashimi
Copy link
Copy Markdown
Member Author

Looks like an infra issue occurred.

/test e2e-gcp-op

@cheesesashimi
Copy link
Copy Markdown
Member Author

This should also run against an SNO installation.

/test e2e-gcp-op-single-node

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More infra issues

/test e2e-gcp-op

Also soft approval pending some test data

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

There's a gcp issue, no need to retest that job until resolved.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented May 11, 2022

Also this PR sounds like a fix so we'll get it in regardless ;)

splitLogs := strings.Split(string(logs), "\n")

// Scan the logs from the bottom up, looking for either a shutdown or startup message.
for i := len(splitLogs) - 1; i >= 0; i-- {
Copy link
Copy Markdown
Contributor

@kikisdeliveryservice kikisdeliveryservice May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I might be missing context here..)

Q: is there no other way to check if configdrifmonitor is running other than checking logs? Can we not use IsRunning() since that's the canonical way to test if it's up what we use in other tests and in the daemon itself.

This is dependent on the logs/other parts of the cluster working correctly as opposed to directly checking the thing we're interested in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if I understand correctly, this is from the perspective of the main testing pod which spun up a cluster, and is now interacting with it to e2e test. We're in the perspective of a user and not the actual MCD pod itself.

Or do you mean that we can debug into the MCD pod and check the running processes to see if the config drift monitor is running?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang re: perspective, you're right. 👍

second point is more what I'm generally asking - is there any way to actually just check if the configdriftmonitor is running other than checking for a pod log that it started? Reading the func comments it seems that this may run after mcd is done so we can't rely on mcd state but overall wondering if there is a more direct vs indirect way to check if it's running.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presently, there isn't a more direct way. That said, I wish we had a more direct way of detecting that such as with a node annotation or via a /livez endpoint.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, that settles that for now then. thanks @cheesesashimi

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

retesting to see if gcp is cooperating

/e2e-gcp-op

@cheesesashimi
Copy link
Copy Markdown
Member Author

I think they need the /test in front of them:

/test e2e-gcp-op
/test e2e-gcp-op-single-node

@cheesesashimi
Copy link
Copy Markdown
Member Author

Looks like infra is still unhappy.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

I think they need the /test in front of them:

/test e2e-gcp-op /test e2e-gcp-op-single-node

yeah that was my typo 😅

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

still waiting for gcp issues to resolve.. i think we need openshift/installer#5898 to merge. =/

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

gcp is still having issues, but since this is a fix the deadline isn't relevant.

@sinnykumari
Copy link
Copy Markdown
Contributor

Since gcp-op test is green now, retrying
/test e2e-gcp-op

@cgwalters
Copy link
Copy Markdown
Member

/retest
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

kikisdeliveryservice commented May 13, 2022

@cheesesashimi can you link this to the related bug so it can merge?

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@sinnykumari
Copy link
Copy Markdown
Contributor

Let's also make sure that single node test is green as well
/test e2e-gcp-op-single-node

@cgwalters cgwalters changed the title Improves Config Drift Monitor e2e tests BZ 2086728: Improves Config Drift Monitor e2e tests May 16, 2022
@cgwalters
Copy link
Copy Markdown
Member

/hold
for the single node test I guess

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2022
@cgwalters
Copy link
Copy Markdown
Member

/bugzilla refresh

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2022

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters cgwalters changed the title BZ 2086728: Improves Config Drift Monitor e2e tests Bug 2086728: Improves Config Drift Monitor e2e tests May 16, 2022
@openshift-ci openshift-ci Bot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2022

@cheesesashimi: This pull request references Bugzilla bug 2086728, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (rioliu@redhat.com), skipping review request.

Details

In response to this:

Bug 2086728: Improves Config Drift Monitor e2e tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Copy Markdown
Contributor

cluster failed during bootstrap, retrying
/test e2e-gcp-op-single-node

@cheesesashimi
Copy link
Copy Markdown
Member Author

I'm seeing two classes of failures for the single-node test:

  1. Infra issues.
  2. e2e test suite timeouts.

For 1, theres not much we can do. For 2, I observed that by the time the Config Drift tests run, we've used 80 minutes of our 90 minute timeout. Looking at other runs of the same job, this has been an issue for a while.

However, there are two mitigations: 1) Bump our e2e single node timeout from 90 minutes to 120 minutes. While a simple fix, I'm reluctant to increase that timeout. 2) Split the e2e-single-node tests across multiple SNO instances, each with its own 90 minute timeout.

@sinnykumari
Copy link
Copy Markdown
Contributor

For 1, theres not much we can do. For 2, I observed that by the time the Config Drift tests run, we've used 80 minutes of our 90 minute timeout. Looking at other runs of the same job, this has been an issue for a while.

Agree, we should improve e2e-gcp-op test failure on SNO if it ha been hitting timeout for a while.
Meanwhile, let's not block this PR if test is failing because of timeout issue because regular gcp-op test is green

@sinnykumari
Copy link
Copy Markdown
Contributor

@cgwalters SNO gcp-op test failed again due to timeout. All tests ran successfully until TestRunShared (which is last in the queue). Since, regular gcp-op tests has passed successfully, should we merge this PR and handle timeout issue separately?

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

can we run the tests with clusterbot and link the must gather to the PR? Alternatively, @cheesesashimi ran the tests on his own cluster so im fine with this testimony 😆

Comment thread test/helpers/utils.go
// system has been up.
// See: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-uptime
func GetNodeUptime(t *testing.T, cs *framework.ClientSet, node corev1.Node) float64 {
output := ExecCmdOnNode(t, cs, node, "cat", "/rootfs/proc/uptime")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker but something to address in followup: parsing /proc/sys/kernel/random/boot_id is a better version of this, it's already used by the MCD; see getBootID(). That approach will handle the case where e.g. the node is up longer than the first uptime by the time we manage to log back in.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. That is a much better way to accomplish this and I'll get that fixed in a subsequent PR.

@cgwalters
Copy link
Copy Markdown
Member

I think we should land all this stuff sooner rather than later to have more time to deal with other fallout which may or may not include SNO. If anyone else agrees then please cancel hold and let's get to testing #3135

@cheesesashimi
Copy link
Copy Markdown
Member Author

I agree, let's get this merged.

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2022

@cheesesashimi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-single-node c039d7a link false /test e2e-gcp-op-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ae01659 into openshift:master May 16, 2022
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 16, 2022

@cheesesashimi: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 2086728 has not been moved to the MODIFIED state.

Details

In response to this:

Bug 2086728: Improves Config Drift Monitor e2e tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cheesesashimi cheesesashimi deleted the zzlotnik/config-drift-reboot-check branch March 21, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants