Skip to content

add kubepodcrashlooping to list of ignored alerts#25094

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
bparees:alerts
Jun 12, 2020
Merged

add kubepodcrashlooping to list of ignored alerts#25094
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
bparees:alerts

Conversation

@bparees
Copy link
Copy Markdown
Contributor

@bparees bparees commented Jun 10, 2020

this is only being added temporarily until
https://bugzilla.redhat.com/show_bug.cgi?id=1842002 is resolved,
after which this PR must be reverted.

this is only being added temporarily until
https://bugzilla.redhat.com/show_bug.cgi?id=1842002 is resolved,
after which this PR must be reverted.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@bparees
Copy link
Copy Markdown
Contributor Author

bparees commented Jun 10, 2020

/cherrypick release-4.5

@openshift-cherrypick-robot
Copy link
Copy Markdown

@bparees: once the present PR merges, I will cherry-pick it on top of release-4.5 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Copy Markdown
Member

wking commented Jun 10, 2020

Won't you need to link a bug from the PR subject to backport to 4.5? Also, my preference for a test-side fix would be a six-minute post-install sleep (or a kube-apiserver Progressing=True sleep, capped at 6m) to give the API server time to settle down. But I'm fine with whatever in the short term ;)

@bparees
Copy link
Copy Markdown
Contributor Author

bparees commented Jun 10, 2020

Won't you need to link a bug from the PR subject to backport to 4.5?

not if i just override the bz req. which is what i plan to do.

Also, my preference for a test-side fix would be a six-minute post-install sleep (or a kube-apiserver Progressing=True sleep, capped at 6m) to give the API server time to settle down. But I'm fine with whatever in the short term

this seemed to be the approach @smarterclayton preferred, post-install sleeps were apparently rejected previously.

@smarterclayton
Copy link
Copy Markdown
Contributor

Post install sleep won't fix this - this test runs AFTER the rest of the tests, and looks back 2 hours. Adding 6m won't fix it unfortunately unless it happens that upgrade takes longer than 2 hours, which is not something to rely on. I originally thought it was the wait but the test is much smarter now and our product regressing a bit.

@wking
Copy link
Copy Markdown
Member

wking commented Jun 10, 2020

/lgtm

GCP failed to build with:

Pulling image registry.svc.ci.openshift.org/ocp/builder:golang-1.13 ...
error: build error: failed to pull image: Error: image ocp/builder:golang-1.13 not found

/retest

@openshift-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2020
@bparees
Copy link
Copy Markdown
Contributor Author

bparees commented Jun 10, 2020

Pulling image registry.svc.ci.openshift.org/ocp/builder:golang-1.13 ...
error: build error: failed to pull image: Error: image ocp/builder:golang-1.13 not found

that seems like an error that should not happen... @stevekuznetsov

@wking
Copy link
Copy Markdown
Member

wking commented Jun 10, 2020

that seems like an error that should not happen...

Seems rare. But yeah, would be nice if it didn't happen at all :).

@bparees
Copy link
Copy Markdown
Contributor Author

bparees commented Jun 11, 2020

@openshift/openshift-team-monitoring not that this test is really your concern but fyi.

@wking
Copy link
Copy Markdown
Member

wking commented Jun 11, 2020

FIPS Kube API went dark before bootstrap complete, and install failed. e2e-cmd failed in setup with:

Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.0.4:6443/.well-known/oauth-authorization-server endpoint data

serial made it to the tests, but failed on:

sig-api-machinery] Namespaces [Serial] should delete fast enough (90 percent of 100 namespaces in 150 seconds) [Suite:openshift/conformance/serial] 

upgrade failed on Service was unreachable during disruption for at least 48s.

/retest

@wking
Copy link
Copy Markdown
Member

wking commented Jun 11, 2020

Poking in the FIPS failure's log bundle:

$ grep '"attempt": [^0]' bootstrap/containers/*.inspect 
bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.inspect:      "attempt": 11,
bootstrap/containers/cluster-version-operator-1f6115911bbfa402033b04dcbee0fee34760f69d80d236666a3d7c712efb9cd1.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-de38bbc76573b2c14da39eb8d966dc8698dd0286d75acc0827901dce64405bf6.inspect:      "attempt": 1,
bootstrap/containers/kube-apiserver-insecure-readyz-7526e4441fd4413f7fc6fbc29785f486342621c74dc7c7639725328ed9834008.inspect:      "attempt": 2,
bootstrap/containers/kube-apiserver-insecure-readyz-bc4f1fa5dda3d31498b9cc06bbe86318697aa3e36567948b1d592180979ebb46.inspect:      "attempt": 1,
bootstrap/containers/kube-controller-manager-c1d4b4e90e251f9e20e1afa9ab9a7adc648230fb16ce667325e847daa6a52126.inspect:      "attempt": 1,
bootstrap/containers/kube-scheduler-ba28db102059ca25af13633430885cdfa3f9c4efafdf3dc6f02bf4fd15294a57.inspect:      "attempt": 1,
bootstrap/containers/setup-4affb608eb4017587629de44dcf89e31b8a8e40fc52884027c8e061e8b573b73.inspect:      "attempt": 1,
$ tail -n2 bootstrap/containers/cloud-credential-operator-ad0c8f95af28155d280a26416ede6a94adc46f57b724f6b54f08905018d6fb83.log 
time="2020-06-11T01:02:38Z" level=info msg="setting up AWS pod identity controller"
time="2020-06-11T01:02:38Z" level=fatal msg="unable to register controllers to the manager" error="AWS_POD_IDENTITY_WEBHOOK_IMAGE is not set"

I don't see anyone talking about that yet; spun off to rhbz#1846200.

@wking
Copy link
Copy Markdown
Member

wking commented Jun 11, 2020

This time upgrade died on install with _WellKnownNotReady, so I filed rhbz#1846203 for that one too.

/test e2e-gcp-upgrade

@lilic
Copy link
Copy Markdown
Contributor

lilic commented Jun 11, 2020

/retest

@wking
Copy link
Copy Markdown
Member

wking commented Jun 11, 2020

We're backsliding :p. Verify:

2020/06/11 08:15:41 Building src
2020/06/11 08:17:56 Build src failed, printing logs:
2020/06/11 08:17:56 error: Unable to retrieve logs from failed build: build src is in an error state. No logs are available.

/retest

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@bparees: new pull request created: #25099

Details

In response to this:

/cherrypick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bparees bparees deleted the alerts branch March 29, 2021 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants