-
Notifications
You must be signed in to change notification settings - Fork 4.8k
bug 1861201: make must-gather test resilient to failures and disk timing #25336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
43dd939 to
db460b4
Compare
|
@deads2k: This pull request references Bugzilla bug 1861201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| }) | ||
|
|
||
| g.It("runs successfully for audit logs", func() { | ||
| // On IBM ROKS, events will not be part of the output, since audit logs do not include control plane logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this new here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this new here?
it was down below, but it nerfed the test, so this just nerfs it earlier.
test/extended/cli/mustgather.go
Outdated
| if eventsChecked <= expectedNumberOfAuditEntries { | ||
| lastErr = fmt.Errorf("expected %d audit events for %q, but only got %d", expectedNumberOfAuditEntries, auditDirectory, eventsChecked) | ||
| return false, nil | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, kill this else
| continue // ignore truncated data | ||
| // it will happen that the audit files are sometimes empty, we can | ||
| // safely ignore these files since they don't provide valuable information | ||
| if fi.Size() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because no events within that window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because no events within that window?
pre-existing. I don't know yet. I think it's related to the whacky code from before that waited 10 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, it's because their control plane is unreachable from pods, so audit logs are unreachable
|
Question and nit, otherwise looks fine. |
Apparently it takes a while for logs to appear, so we retry on a loop to avoid that. The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts. Update the error message to more clearly indicate which audit log is broken for better debugging in the future.
db460b4 to
07a5fee
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| return true, nil | ||
| }) | ||
| o.Expect(lastErr).NotTo(o.HaveOccurred()) // print the last error first if we have one | ||
| o.Expect(err).NotTo(o.HaveOccurred()) // otherwise be sure we fail on the timeout if it happened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"context expired" or whatever we get here is probably not going to be all that useful. Can we include "no audit directories found" or some such in this error message?
|
/retest |
|
/hold There are more than one test failures blocking the merging of this change. Please prioritize merging #25314 instead, which fixes and skips a number of issues introduced by the 1.19 rebase. This PR will then need to unskip. |
|
@deads2k: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@deads2k: All pull requests linked via external trackers have merged: openshift/origin#25336. Bugzilla bug 1861201 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Apparently it takes a while for logs to appear, so we retry on a loop to avoid that.
The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts.
Update the error message to more clearly indicate which audit log is broken for better debugging in the future.