Bug 2086728: Improves Config Drift Monitor e2e tests#3146
Conversation
yuqi-zhang
left a comment
There was a problem hiding this comment.
Generally looks good, some comments inline
There was a problem hiding this comment.
I think this is problematic if the logs if empty. We can probably work around it with i<=0 though
There was a problem hiding this comment.
Good catch, thanks!
There was a problem hiding this comment.
We do something similar in mcd_test and sno_mcd_test. Doesn't have to be part of this PR, but we should maybe share the implementation
There was a problem hiding this comment.
100% agreed. I wrote these helpers with that in mind. Once this lands, I'll open a separate PR to fix those.
04d7507 to
c039d7a
Compare
|
Looks like an infra issue occurred. /test e2e-gcp-op |
|
This should also run against an SNO installation. /test e2e-gcp-op-single-node |
yuqi-zhang
left a comment
There was a problem hiding this comment.
More infra issues
/test e2e-gcp-op
Also soft approval pending some test data
|
There's a gcp issue, no need to retest that job until resolved. |
|
Also this PR sounds like a fix so we'll get it in regardless ;) |
| splitLogs := strings.Split(string(logs), "\n") | ||
|
|
||
| // Scan the logs from the bottom up, looking for either a shutdown or startup message. | ||
| for i := len(splitLogs) - 1; i >= 0; i-- { |
There was a problem hiding this comment.
(I might be missing context here..)
Q: is there no other way to check if configdrifmonitor is running other than checking logs? Can we not use IsRunning() since that's the canonical way to test if it's up what we use in other tests and in the daemon itself.
This is dependent on the logs/other parts of the cluster working correctly as opposed to directly checking the thing we're interested in.
There was a problem hiding this comment.
Hmm, if I understand correctly, this is from the perspective of the main testing pod which spun up a cluster, and is now interacting with it to e2e test. We're in the perspective of a user and not the actual MCD pod itself.
Or do you mean that we can debug into the MCD pod and check the running processes to see if the config drift monitor is running?
There was a problem hiding this comment.
@yuqi-zhang re: perspective, you're right. 👍
second point is more what I'm generally asking - is there any way to actually just check if the configdriftmonitor is running other than checking for a pod log that it started? Reading the func comments it seems that this may run after mcd is done so we can't rely on mcd state but overall wondering if there is a more direct vs indirect way to check if it's running.
There was a problem hiding this comment.
Presently, there isn't a more direct way. That said, I wish we had a more direct way of detecting that such as with a node annotation or via a /livez endpoint.
There was a problem hiding this comment.
ah ok, that settles that for now then. thanks @cheesesashimi
|
retesting to see if gcp is cooperating /e2e-gcp-op |
|
I think they need the /test e2e-gcp-op |
|
Looks like infra is still unhappy. |
yeah that was my typo 😅 |
|
still waiting for gcp issues to resolve.. i think we need openshift/installer#5898 to merge. =/ |
|
gcp is still having issues, but since this is a fix the deadline isn't relevant. |
|
Since gcp-op test is green now, retrying |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, cheesesashimi, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@cheesesashimi can you link this to the related bug so it can merge? |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
Let's also make sure that single node test is green as well |
|
/hold |
|
/bugzilla refresh |
|
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@cheesesashimi: This pull request references Bugzilla bug 2086728, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla (rioliu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
cluster failed during bootstrap, retrying |
|
I'm seeing two classes of failures for the single-node test:
For 1, theres not much we can do. For 2, I observed that by the time the Config Drift tests run, we've used 80 minutes of our 90 minute timeout. Looking at other runs of the same job, this has been an issue for a while. However, there are two mitigations: 1) Bump our e2e single node timeout from 90 minutes to 120 minutes. While a simple fix, I'm reluctant to increase that timeout. 2) Split the e2e-single-node tests across multiple SNO instances, each with its own 90 minute timeout. |
Agree, we should improve e2e-gcp-op test failure on SNO if it ha been hitting timeout for a while. |
|
@cgwalters SNO gcp-op test failed again due to timeout. All tests ran successfully until TestRunShared (which is last in the queue). Since, regular gcp-op tests has passed successfully, should we merge this PR and handle timeout issue separately? |
|
can we run the tests with clusterbot and link the must gather to the PR? Alternatively, @cheesesashimi ran the tests on his own cluster so im fine with this testimony 😆 |
| // system has been up. | ||
| // See: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-proc-uptime | ||
| func GetNodeUptime(t *testing.T, cs *framework.ClientSet, node corev1.Node) float64 { | ||
| output := ExecCmdOnNode(t, cs, node, "cat", "/rootfs/proc/uptime") |
There was a problem hiding this comment.
Not a blocker but something to address in followup: parsing /proc/sys/kernel/random/boot_id is a better version of this, it's already used by the MCD; see getBootID(). That approach will handle the case where e.g. the node is up longer than the first uptime by the time we manage to log back in.
There was a problem hiding this comment.
I agree. That is a much better way to accomplish this and I'll get that fixed in a subsequent PR.
|
I think we should land all this stuff sooner rather than later to have more time to deal with other fallout which may or may not include SNO. If anyone else agrees then please cancel hold and let's get to testing #3135 |
|
I agree, let's get this merged. /unhold |
|
@cheesesashimi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@cheesesashimi: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 2086728 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
- What I did
The Config Drift Monitor tests were broken by #3141. The breakage was determined to be an issue with how we determine whether the Config Drift Monitor has started: We were grabbing the logs and indiscriminately searching for the startup text. Instead, we should grab the logs and scan from the bottom up, searching for the startup message absent a similar shutdown message. This PR was tested against master as well as the aforementioned PR.
Additionally, to provide better signal around the Config Drift Monitor, we should check whether the node reboots as a result of the test. Use of the Forcefile should cause a reboot, whereas file content reversion should not.
- How to verify it
Run the attached e2e test suite.
- Description for the changelog
Get better signal from Config Drift Monitor tests