Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Jul 28, 2020

Apparently it takes a while for logs to appear, so we retry on a loop to avoid that.
The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts.
Update the error message to more clearly indicate which audit log is broken for better debugging in the future.

@deads2k deads2k force-pushed the must-gather-test branch from 43dd939 to db460b4 Compare July 28, 2020 18:04
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2020
@deads2k deads2k changed the title make must-gather test resilient to failures and disk timing bug 1861201: make must-gather test resilient to failures and disk timing Jul 28, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 28, 2020
@openshift-ci-robot
Copy link

@deads2k: This pull request references Bugzilla bug 1861201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

bug 1861201: make must-gather test resilient to failures and disk timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

})

g.It("runs successfully for audit logs", func() {
// On IBM ROKS, events will not be part of the output, since audit logs do not include control plane logs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this new here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this new here?

it was down below, but it nerfed the test, so this just nerfs it earlier.

if eventsChecked <= expectedNumberOfAuditEntries {
lastErr = fmt.Errorf("expected %d audit events for %q, but only got %d", expectedNumberOfAuditEntries, auditDirectory, eventsChecked)
return false, nil
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, kill this else

continue // ignore truncated data
// it will happen that the audit files are sometimes empty, we can
// safely ignore these files since they don't provide valuable information
if fi.Size() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because no events within that window?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because no events within that window?

pre-existing. I don't know yet. I think it's related to the whacky code from before that waited 10 seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, it's because their control plane is unreachable from pods, so audit logs are unreachable

@smarterclayton
Copy link
Contributor

Question and nit, otherwise looks fine.

Apparently it takes a while for logs to appear, so we retry on a loop to avoid that.
The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts.
Update the error message to more clearly indicate which audit log is broken for better debugging in the future.
@deads2k deads2k force-pushed the must-gather-test branch from db460b4 to 07a5fee Compare July 28, 2020 18:23
@deads2k deads2k added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2020
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

return true, nil
})
o.Expect(lastErr).NotTo(o.HaveOccurred()) // print the last error first if we have one
o.Expect(err).NotTo(o.HaveOccurred()) // otherwise be sure we fail on the timeout if it happened
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"context expired" or whatever we get here is probably not going to be all that useful. Can we include "no audit directories found" or some such in this error message?

@marun
Copy link
Contributor

marun commented Jul 28, 2020

/retest

@marun
Copy link
Contributor

marun commented Jul 28, 2020

/hold

There are more than one test failures blocking the merging of this change. Please prioritize merging #25314 instead, which fixes and skips a number of issues introduced by the 1.19 rebase. This PR will then need to unskip.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2020
@openshift-ci-robot
Copy link

@deads2k: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-fips 07a5fee link /test e2e-aws-fips

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor Author

deads2k commented Jul 29, 2020

this fixes the must-gather problem. TRT is prioritizing unblocking the payload promotion and this resolves one of two issues. #25335 / #25336 resolves the other. We will merge these

@deads2k deads2k merged commit e615de4 into openshift:master Jul 29, 2020
@openshift-ci-robot
Copy link

@deads2k: All pull requests linked via external trackers have merged: openshift/origin#25336. Bugzilla bug 1861201 has been moved to the MODIFIED state.

Details

In response to this:

bug 1861201: make must-gather test resilient to failures and disk timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants