bug 1861201: make must-gather test resilient to failures and disk timing #25336

deads2k · 2020-07-28T18:04:56Z

Apparently it takes a while for logs to appear, so we retry on a loop to avoid that.
The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts.
Update the error message to more clearly indicate which audit log is broken for better debugging in the future.

openshift-ci-robot · 2020-07-28T18:09:39Z

@deads2k: This pull request references Bugzilla bug 1861201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

bug 1861201: make must-gather test resilient to failures and disk timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2020-07-28T18:11:19Z

test/extended/cli/mustgather.go

 	})

 	g.It("runs successfully for audit logs", func() {
+		// On IBM ROKS, events will not be part of the output, since audit logs do not include control plane logs.


Why is this new here?

Why is this new here?

it was down below, but it nerfed the test, so this just nerfs it earlier.

smarterclayton · 2020-07-28T18:12:02Z

test/extended/cli/mustgather.go

+				if eventsChecked <= expectedNumberOfAuditEntries {
+					lastErr = fmt.Errorf("expected %d audit events for %q, but only got %d", expectedNumberOfAuditEntries, auditDirectory, eventsChecked)
+					return false, nil
+				} else {


nit, kill this else

smarterclayton · 2020-07-28T18:12:23Z

test/extended/cli/mustgather.go

-						continue // ignore truncated data
+					// it will happen that the audit files are sometimes empty, we can
+					// safely ignore these files since they don't provide valuable information
+					if fi.Size() == 0 {


because no events within that window?

because no events within that window?

pre-existing. I don't know yet. I think it's related to the whacky code from before that waited 10 seconds

oh wait, it's because their control plane is unreachable from pods, so audit logs are unreachable

smarterclayton · 2020-07-28T18:13:18Z

Question and nit, otherwise looks fine.

Apparently it takes a while for logs to appear, so we retry on a loop to avoid that. The openshift-apiserver will have many fewer requests than the kube-apiserver, so have two different counts. Update the error message to more clearly indicate which audit log is broken for better debugging in the future.

smarterclayton · 2020-07-28T18:25:59Z

/lgtm

openshift-ci-robot · 2020-07-28T18:26:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/OWNERS~~ [deads2k,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-07-28T18:32:01Z

test/extended/cli/mustgather.go

+			return true, nil
+		})
+		o.Expect(lastErr).NotTo(o.HaveOccurred()) // print the last error first if we have one
+		o.Expect(err).NotTo(o.HaveOccurred())     // otherwise be sure we fail on the timeout if it happened


"context expired" or whatever we get here is probably not going to be all that useful. Can we include "no audit directories found" or some such in this error message?

marun · 2020-07-28T21:18:16Z

/retest

marun · 2020-07-28T22:47:28Z

/hold

There are more than one test failures blocking the merging of this change. Please prioritize merging #25314 instead, which fixes and skips a number of issues introduced by the 1.19 rebase. This PR will then need to unskip.

openshift-ci-robot · 2020-07-28T23:07:08Z

@deads2k: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-fips	`07a5fee`	link	`/test e2e-aws-fips`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

deads2k · 2020-07-29T00:10:27Z

this fixes the must-gather problem. TRT is prioritizing unblocking the payload promotion and this resolves one of two issues. #25335 / #25336 resolves the other. We will merge these

openshift-ci-robot · 2020-07-29T00:10:36Z

@deads2k: All pull requests linked via external trackers have merged: openshift/origin#25336. Bugzilla bug 1861201 has been moved to the MODIFIED state.

Details

In response to this:

bug 1861201: make must-gather test resilient to failures and disk timing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k force-pushed the must-gather-test branch from 43dd939 to db460b4 Compare July 28, 2020 18:04

openshift-ci-robot requested review from bparees and knobunc July 28, 2020 18:05

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2020

deads2k changed the title ~~make must-gather test resilient to failures and disk timing~~ bug 1861201: make must-gather test resilient to failures and disk timing Jul 28, 2020

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 28, 2020

smarterclayton reviewed Jul 28, 2020

View reviewed changes

deads2k force-pushed the must-gather-test branch from db460b4 to 07a5fee Compare July 28, 2020 18:23

deads2k added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2020

openshift-ci-robot assigned smarterclayton Jul 28, 2020

wking reviewed Jul 28, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2020

deads2k merged commit e615de4 into openshift:master Jul 29, 2020

deads2k mentioned this pull request Jul 29, 2020

bug 1861189: UPSTREAM: <drop>: make pdb tests pass reliably #25335

Merged

bug 1861201: make must-gather test resilient to failures and disk timing #25336

bug 1861201: make must-gather test resilient to failures and disk timing #25336

Uh oh!

Conversation

deads2k commented Jul 28, 2020

Uh oh!

openshift-ci-robot commented Jul 28, 2020

Uh oh!

smarterclayton Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

deads2k Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

smarterclayton Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

smarterclayton Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

deads2k Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

smarterclayton Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Jul 28, 2020

Uh oh!

smarterclayton commented Jul 28, 2020

Uh oh!

openshift-ci-robot commented Jul 28, 2020

Uh oh!

wking Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

marun commented Jul 28, 2020

Uh oh!

marun commented Jul 28, 2020

Uh oh!

openshift-ci-robot commented Jul 28, 2020

Uh oh!

deads2k commented Jul 29, 2020

Uh oh!

openshift-ci-robot commented Jul 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants