Skip to content

Conversation

@damemi
Copy link
Contributor

@damemi damemi commented Feb 23, 2021

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label Feb 23, 2021
@openshift-ci-robot
Copy link

@damemi: This pull request references Bugzilla bug 1896558, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1896558: bump(openshift/kubernetes): multi-az spreading e2e flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 23, 2021
@openshift-ci-robot openshift-ci-robot added the vendor-update Touching vendor dir or related files label Feb 23, 2021
@damemi
Copy link
Contributor Author

damemi commented Feb 23, 2021

/bugzilla refresh

@openshift-ci-robot
Copy link

@damemi: This pull request references Bugzilla bug 1896558, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 23, 2021
@damemi
Copy link
Contributor Author

damemi commented Feb 24, 2021

/retest

@damemi damemi force-pushed the bump-multiaz-fixes branch from 13f2b0a to 91d4ca0 Compare March 3, 2021 15:42
@damemi
Copy link
Contributor Author

damemi commented Mar 3, 2021

Updated to bump all kubernetes deps... build failures looked like some incompatibility with just bumping o/k

github.com/davecgh/go-spew v1.1.1
github.com/docker/distribution v2.7.1+incompatible
github.com/fsouza/go-dockerclient v0.0.0-20171004212419-da3951ba2e9e
github.com/fsouza/go-dockerclient v1.6.6-0.20200611205848-6aaf6c2d625c
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to bump this as well, otherwise go mod tidy fails:

go: finding module for package github.com/docker/libnetwork/ipamutils
go: found github.com/docker/libnetwork/ipamutils in github.com/docker/libnetwork v0.5.6
go: finding module for package github.com/Sirupsen/logrus
go: found github.com/Sirupsen/logrus in github.com/Sirupsen/logrus v1.8.0
go: github.com/openshift/origin/test/extended/images imports
	github.com/fsouza/go-dockerclient imports
	github.com/docker/docker/opts imports
	github.com/docker/libnetwork/ipamutils imports
	github.com/docker/libnetwork/osl imports
	github.com/Sirupsen/logrus: github.com/Sirupsen/logrus@v1.8.0: parsing go.mod:
	module declares its path as: github.com/sirupsen/logrus
	        but was required as: github.com/Sirupsen/logrus

@damemi damemi force-pushed the bump-multiaz-fixes branch from ab9cea6 to 376ce44 Compare March 3, 2021 17:52
@damemi
Copy link
Contributor Author

damemi commented Mar 3, 2021

vendor/github.com/openshift/build-machinery-go/make/targets/golang/../../lib/golang.mk:22: *** `go` is required with minimal version "1.15.2", detected version "1.14.6". You can override this check by using `make GO_REQUIRED_MIN_VERSION:=`.  Stop.

origin needs to have its golang version updated in order to bump the openshift deps. @soltysh is this something we can do?

@openshift-ci-robot
Copy link

@damemi: This pull request references Bugzilla bug 1896558, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @kasturinarra

Details

In response to this:

Bug 1896558: bump(openshift/kubernetes): multi-az spreading e2e flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Contributor Author

damemi commented Mar 9, 2021

Ok, I don't actually need the build-machinery-go bump. But this does depend on bumping library-go (openshift/library-go#1008, in order to react to openshift/kubernetes#558)

edit: this actually needs openshift/kubernetes#616 to fix o/k, so that library-go doesn't depend on our forked version

@damemi
Copy link
Contributor Author

damemi commented Mar 9, 2021

/hold
for the above lib-go PR

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2021
@damemi
Copy link
Contributor Author

damemi commented Mar 10, 2021

/retest

@stbenjam
Copy link
Member

What's the status of this? I just tried myself on #25997 and running into the things you seem to have already solved :)

Any chance to bump openshift/kubernetes here to include openshift/kubernetes#628?

@damemi
Copy link
Contributor Author

damemi commented Mar 22, 2021

@stbenjam talking to @sttts this depends on fixing a change made to o/k (openshift/kubernetes#616). Without it, the version of kubernetes vendored by origin does not compile with the vendored library-go

Afaict, the dockerclient version pinning is still necessary, and origin will need its CI updated to go1.15 if we want to bump build-machinery-go

@stbenjam
Copy link
Member

openshift/kubernetes#616 has landed!

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

11 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@damemi
Copy link
Contributor Author

damemi commented Mar 29, 2021

/hold
looks like some kind of new relevant error....

STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-scheduling] Multi-AZ Clusters
  k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/ubernetes_lite.go:47
STEP: Checking for multi-zone cluster.  Zone count = 2
Mar 29 13:08:19.339: INFO: Waiting up to 1m0s for all nodes to be ready
Mar 29 13:09:19.738: INFO: ComputeCPUMemFraction for node: ip-10-0-138-145.us-east-2.compute.internal
Mar 29 13:09:19.813: INFO: Pod for on the node: aws-ebs-csi-driver-node-prq8s, Cpu: 30, Mem: 157286400
Mar 29 13:09:19.813: INFO: Pod for on the node: tuned-qb8zb, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.813: INFO: Pod for on the node: dns-default-p868p, Cpu: 65, Mem: 137363456
Mar 29 13:09:19.813: INFO: Pod for on the node: image-registry-8689cf4bdc-zmlkd, Cpu: 100, Mem: 268435456
Mar 29 13:09:19.813: INFO: Pod for on the node: node-ca-lhhjp, Cpu: 10, Mem: 10485760
Mar 29 13:09:19.813: INFO: Pod for on the node: ingress-canary-4dtnh, Cpu: 10, Mem: 20971520
Mar 29 13:09:19.813: INFO: Pod for on the node: router-default-68bd96b689-lclb2, Cpu: 100, Mem: 268435456
Mar 29 13:09:19.813: INFO: Pod for on the node: machine-config-daemon-mj56z, Cpu: 40, Mem: 104857600
Mar 29 13:09:19.813: INFO: Pod for on the node: certified-operators-dr2f2, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.813: INFO: Pod for on the node: redhat-marketplace-gwm8f, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.813: INFO: Pod for on the node: redhat-operators-gl29g, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.813: INFO: Pod for on the node: alertmanager-main-0, Cpu: 8, Mem: 283115520
Mar 29 13:09:19.813: INFO: Pod for on the node: alertmanager-main-2, Cpu: 8, Mem: 283115520
Mar 29 13:09:19.813: INFO: Pod for on the node: grafana-85bf4567dc-qt8zx, Cpu: 5, Mem: 125829120
Mar 29 13:09:19.813: INFO: Pod for on the node: kube-state-metrics-56cb64c9d8-ggqhb, Cpu: 4, Mem: 125829120
Mar 29 13:09:19.813: INFO: Pod for on the node: node-exporter-nshn9, Cpu: 9, Mem: 220200960
Mar 29 13:09:19.813: INFO: Pod for on the node: prometheus-adapter-5c64f698-nkl87, Cpu: 1, Mem: 26214400
Mar 29 13:09:19.813: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 76, Mem: 1262485504
Mar 29 13:09:19.813: INFO: Pod for on the node: thanos-querier-5d9585c755-kg5vr, Cpu: 9, Mem: 96468992
Mar 29 13:09:19.813: INFO: Pod for on the node: multus-wclfs, Cpu: 10, Mem: 157286400
Mar 29 13:09:19.813: INFO: Pod for on the node: network-metrics-daemon-zn6zq, Cpu: 20, Mem: 125829120
Mar 29 13:09:19.813: INFO: Pod for on the node: network-check-target-wxnbl, Cpu: 10, Mem: 15728640
Mar 29 13:09:19.813: INFO: Pod for on the node: ovs-gqnld, Cpu: 15, Mem: 419430400
Mar 29 13:09:19.813: INFO: Pod for on the node: sdn-cmsw4, Cpu: 110, Mem: 230686720
Mar 29 13:09:19.813: INFO: Node: ip-10-0-138-145.us-east-2.compute.internal, totalRequestedCPUResource: 780, cpuAllocatableMil: 3500, cpuFraction: 0.22285714285714286
Mar 29 13:09:19.813: INFO: Node: ip-10-0-138-145.us-east-2.compute.internal, totalRequestedMemResource: 4654628864, memAllocatableVal: 15623766016, memFraction: 0.29791977550312027
Mar 29 13:09:19.813: INFO: ComputeCPUMemFraction for node: ip-10-0-171-116.us-east-2.compute.internal
Mar 29 13:09:19.894: INFO: Pod for on the node: aws-ebs-csi-driver-node-rwt4r, Cpu: 30, Mem: 157286400
Mar 29 13:09:19.894: INFO: Pod for on the node: tuned-2ppkx, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.894: INFO: Pod for on the node: dns-default-gb99z, Cpu: 65, Mem: 137363456
Mar 29 13:09:19.894: INFO: Pod for on the node: node-ca-2ld9b, Cpu: 10, Mem: 10485760
Mar 29 13:09:19.894: INFO: Pod for on the node: ingress-canary-wqg2s, Cpu: 10, Mem: 20971520
Mar 29 13:09:19.894: INFO: Pod for on the node: machine-config-daemon-hh9ft, Cpu: 40, Mem: 104857600
Mar 29 13:09:19.894: INFO: Pod for on the node: community-operators-xdnff, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.894: INFO: Pod for on the node: redhat-marketplace-2wtbl, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.894: INFO: Pod for on the node: node-exporter-lw9v2, Cpu: 9, Mem: 220200960
Mar 29 13:09:19.894: INFO: Pod for on the node: openshift-state-metrics-54d764868c-dltdk, Cpu: 3, Mem: 199229440
Mar 29 13:09:19.894: INFO: Pod for on the node: telemeter-client-7c6f857fc5-429dw, Cpu: 3, Mem: 73400320
Mar 29 13:09:19.894: INFO: Pod for on the node: multus-xsx76, Cpu: 10, Mem: 157286400
Mar 29 13:09:19.894: INFO: Pod for on the node: network-metrics-daemon-qcb7n, Cpu: 20, Mem: 125829120
Mar 29 13:09:19.894: INFO: Pod for on the node: network-check-source-57c69dc7cc-9z8hl, Cpu: 10, Mem: 41943040
Mar 29 13:09:19.894: INFO: Pod for on the node: network-check-target-lhtk2, Cpu: 10, Mem: 15728640
Mar 29 13:09:19.894: INFO: Pod for on the node: ovs-4snlv, Cpu: 15, Mem: 419430400
Mar 29 13:09:19.894: INFO: Pod for on the node: sdn-7ncgf, Cpu: 110, Mem: 230686720
Mar 29 13:09:19.894: INFO: Node: ip-10-0-171-116.us-east-2.compute.internal, totalRequestedCPUResource: 475, cpuAllocatableMil: 3500, cpuFraction: 0.1357142857142857
Mar 29 13:09:19.894: INFO: Node: ip-10-0-171-116.us-east-2.compute.internal, totalRequestedMemResource: 2176843776, memAllocatableVal: 15623766016, memFraction: 0.1393290051688393
Mar 29 13:09:19.894: INFO: ComputeCPUMemFraction for node: ip-10-0-206-201.us-east-2.compute.internal
Mar 29 13:09:19.970: INFO: Pod for on the node: aws-ebs-csi-driver-node-2njfw, Cpu: 30, Mem: 157286400
Mar 29 13:09:19.970: INFO: Pod for on the node: tuned-28jpp, Cpu: 10, Mem: 52428800
Mar 29 13:09:19.970: INFO: Pod for on the node: dns-default-hvdks, Cpu: 65, Mem: 137363456
Mar 29 13:09:19.970: INFO: Pod for on the node: image-registry-8689cf4bdc-5whpt, Cpu: 100, Mem: 268435456
Mar 29 13:09:19.970: INFO: Pod for on the node: node-ca-vl8tc, Cpu: 10, Mem: 10485760
Mar 29 13:09:19.970: INFO: Pod for on the node: ingress-canary-777ht, Cpu: 10, Mem: 20971520
Mar 29 13:09:19.970: INFO: Pod for on the node: router-default-68bd96b689-kp5j6, Cpu: 100, Mem: 268435456
Mar 29 13:09:19.970: INFO: Pod for on the node: machine-config-daemon-x6kgc, Cpu: 40, Mem: 104857600
Mar 29 13:09:19.970: INFO: Pod for on the node: alertmanager-main-1, Cpu: 8, Mem: 283115520
Mar 29 13:09:19.970: INFO: Pod for on the node: node-exporter-829pf, Cpu: 9, Mem: 220200960
Mar 29 13:09:19.970: INFO: Pod for on the node: prometheus-adapter-5c64f698-hgmd4, Cpu: 1, Mem: 26214400
Mar 29 13:09:19.970: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 76, Mem: 1262485504
Mar 29 13:09:19.970: INFO: Pod for on the node: thanos-querier-5d9585c755-dzzd2, Cpu: 9, Mem: 96468992
Mar 29 13:09:19.970: INFO: Pod for on the node: multus-bntg9, Cpu: 10, Mem: 157286400
Mar 29 13:09:19.970: INFO: Pod for on the node: network-metrics-daemon-s7vth, Cpu: 20, Mem: 125829120
Mar 29 13:09:19.970: INFO: Pod for on the node: network-check-target-lbgpj, Cpu: 10, Mem: 15728640
Mar 29 13:09:19.970: INFO: Pod for on the node: ovs-v97f6, Cpu: 15, Mem: 419430400
Mar 29 13:09:19.970: INFO: Pod for on the node: sdn-dgrgb, Cpu: 110, Mem: 230686720
Mar 29 13:09:19.970: INFO: Node: ip-10-0-206-201.us-east-2.compute.internal, totalRequestedCPUResource: 733, cpuAllocatableMil: 3500, cpuFraction: 0.20942857142857144
Mar 29 13:09:19.970: INFO: Node: ip-10-0-206-201.us-east-2.compute.internal, totalRequestedMemResource: 3962568704, memAllocatableVal: 15623766016, memFraction: 0.2536244270390384
Mar 29 13:09:20.026: INFO: Waiting for running...
[AfterEach] [sig-scheduling] Multi-AZ Clusters
  k8s.io/kubernetes@v1.20.0/test/e2e/framework/framework.go:175
STEP: Collecting events from namespace "e2e-multi-az-5321".
STEP: Found 6 events.
Mar 29 13:19:20.211: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0: { } Scheduled: Successfully assigned e2e-multi-az-5321/63c4019c-2e78-4331-89a1-a475cb7fa8bf-0 to ip-10-0-138-145.us-east-2.compute.internal
Mar 29 13:19:20.211: INFO: At 2021-03-29 13:08:18 +0000 UTC - event for e2e-multi-az-5321: {namespace-security-allocation-controller } CreatedSCCRanges: created SCC ranges
Mar 29 13:19:20.211: INFO: At 2021-03-29 13:09:21 +0000 UTC - event for 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0: {multus } AddedInterface: Add eth0 [10.129.2.104/23]
Mar 29 13:19:20.211: INFO: At 2021-03-29 13:13:20 +0000 UTC - event for 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0: {kubelet ip-10-0-138-145.us-east-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Mar 29 13:19:20.211: INFO: At 2021-03-29 13:17:21 +0000 UTC - event for 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0: {kubelet ip-10-0-138-145.us-east-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: waited too long for request to timeout or sandbox k8s_63c4019c-2e78-4331-89a1-a475cb7fa8bf-0_e2e-multi-az-5321_5643709e-04ac-4925-afbc-45286928b801_0 to be created: error reserving pod name k8s_63c4019c-2e78-4331-89a1-a475cb7fa8bf-0_e2e-multi-az-5321_5643709e-04ac-4925-afbc-45286928b801_0 for id d97f32796badeff9502991850139abe64b2dcc82cf33c03e89af246aabeb86ba: name is reserved
Mar 29 13:19:20.211: INFO: At 2021-03-29 13:17:23 +0000 UTC - event for 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0: {multus } AddedInterface: Add eth0 [10.129.2.105/23]
Mar 29 13:19:20.238: INFO: POD                                     NODE                                        PHASE    GRACE  CONDITIONS
Mar 29 13:19:20.238: INFO: 63c4019c-2e78-4331-89a1-a475cb7fa8bf-0  ip-10-0-138-145.us-east-2.compute.internal  Pending         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-03-29 13:09:20 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-03-29 13:09:20 +0000 UTC ContainersNotReady containers with unready status: [63c4019c-2e78-4331-89a1-a475cb7fa8bf-0]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-03-29 13:09:20 +0000 UTC ContainersNotReady containers with unready status: [63c4019c-2e78-4331-89a1-a475cb7fa8bf-0]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-03-29 13:09:20 +0000 UTC  }]
Mar 29 13:19:20.238: INFO: 
Mar 29 13:19:20.313: INFO: skipping dumping cluster info - cluster too large
STEP: Destroying namespace "e2e-multi-az-5321" for this suite.
[AfterEach] [sig-scheduling] Multi-AZ Clusters
  k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/ubernetes_lite.go:67
fail [k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/ubernetes_lite.go:65]: Unexpected error:
    <*errors.errorString | 0xc0013c61d0>: {
        s: "Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels \"startPodsID=bde8c4ac-8b6a-48c1-9cd7-e079f135607f\" to be running",
    }
    Error waiting for 1 pods to be running - probably a timeout: Timeout while waiting for pods with labels "startPodsID=bde8c4ac-8b6a-48c1-9cd7-e079f135607f" to be running
occurred

Kubelet may be retrying requests that are timing out in CRI-O due to system load is a new error for me

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 29, 2021
@damemi
Copy link
Contributor Author

damemi commented Mar 29, 2021

error reserving pod name k8s_63c4019c-2e78-4331-89a1-a475cb7fa8bf-0_e2e-multi-az-5321_5643709e-04ac-4925-afbc-45286928b801_0 for id d97f32796badeff9502991850139abe64b2dcc82cf33c03e89af246aabeb86ba: name is reserved

maybe the k8s prefix is not allowed for pod names? @soltysh any idea about this?

@damemi
Copy link
Contributor Author

damemi commented Mar 30, 2021

/retest

1 similar comment
@damemi
Copy link
Contributor Author

damemi commented Apr 5, 2021

/retest

@damemi
Copy link
Contributor Author

damemi commented Apr 6, 2021

/retest
now these are just failing with a timeout trying to create the balancing pods (not the same error as above). It's happening along with some other tests but still obviously contained to these. However I'm pretty stuck trying to figure out what's causing it now

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2021

@damemi: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-serial 40a8a54 link /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea
Copy link

aojea commented Apr 7, 2021

/retest
now these are just failing with a timeout trying to create the balancing pods (not the same error as above). It's happening along with some other tests but still obviously contained to these. However I'm pretty stuck trying to figure out what's causing it now

pods are being killed by OOM

Apr  6 21:01:27.750: INFO: At 2021-04-06 20:55:27 +0000 UTC - event for ca8a031a-75a7-4976-b5f8-e1bca8603fa6-0: {kubelet ip-10-0-242-81.us-west-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

journal from that node https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25915/pull-ci-openshift-origin-master-e2e-aws-serial/1379504898885816320/artifacts/e2e-aws-serial/gather-extra/artifacts/nodes/ip-10-0-242-81.us-west-2.compute.internal/journal

pr 06 20:51:29.933286 ip-10-0-242-81 crio[1431]: I0406 20:51:29.916384  165907 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-multi-az-7158", Name:"ca8a031a-75a7-4976-b5f8-e1bca8603fa6-0", UID:"aa04fb0f-f81a-410e-8735-5647e7491600", APIVersion:"v1", ResourceVersion:"69293", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.131.0.60/23]
Apr 06 20:51:29.934048 ip-10-0-242-81 crio[1431]: time="2021-04-06 20:51:29.933985855Z" level=info msg="Got pod network &{Name:ca8a031a-75a7-4976-b5f8-e1bca8603fa6-0 Namespace:e2e-multi-az-7158 ID:38f1b3e5bea9509e0d7d4b2c9e7a9f495c88175a184e66702dbe886bb7b0a141 NetNS:/var/run/netns/14715a92-4ef2-49b1-83e1-04732d19c612 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Apr 06 20:51:29.934247 ip-10-0-242-81 crio[1431]: time="2021-04-06 20:51:29.934207528Z" level=info msg="About to check CNI network multus-cni-network (type=multus)"
Apr 06 20:51:29.935864 ip-10-0-242-81 hyperkube[1467]: I0406 20:51:29.935738    1467 kubelet.go:1927] SyncLoop (UPDATE, "api"): "ca8a031a-75a7-4976-b5f8-e1bca8603fa6-0_e2e-multi-az-7158(aa04fb0f-f81a-410e-8735-5647e7491600)"
Apr 06 20:51:29.950799 ip-10-0-242-81 systemd[1]: Started crio-conmon-38f1b3e5bea9509e0d7d4b2c9e7a9f495c88175a184e66702dbe886bb7b0a141.scope.
Apr 06 20:51:30.072034 ip-10-0-242-81 kernel: runc invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=-1000
Apr 06 20:51:30.072215 ip-10-0-242-81 kernel: CPU: 0 PID: 166022 Comm: runc Not tainted 4.18.0-240.15.1.el8_3.x86_64 #1
Apr 06 20:51:30.072278 ip-10-0-242-81 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
Apr 06 20:51:30.072343 ip-10-0-242-81 kernel: Call Trace:
Apr 06 20:51:30.072401 ip-10-0-242-81 kernel:  dump_stack+0x5c/0x80
Apr 06 20:51:30.072449 ip-10-0-242-81 kernel:  dump_header+0x51/0x308
Apr 06 20:51:30.072497 ip-10-0-242-81 kernel:  out_of_memory.cold.31+0x39/0x89
Apr 06 20:51:30.072543 ip-10-0-242-81 kernel:  mem_cgroup_out_of_memory+0xbe/0xd0
Apr 06 20:51:30.072588 ip-10-0-242-81 kernel:  try_charge+0x6f4/0x780
Apr 06 20:51:30.072640 ip-10-0-242-81 kernel:  ? __alloc_pages_nodemask+0xef/0x280
Apr 06 20:51:30.072688 ip-10-0-242-81 kernel:  mem_cgroup_try_charge+0x8b/0x190
Apr 06 20:51:30.072739 ip-10-0-242-81 kernel:  mem_cgroup_try_charge_delay+0x1c/0x40
Apr 06 20:51:30.072796 ip-10-0-242-81 kernel:  do_anonymous_page+0xb5/0x360
Apr 06 20:51:30.072890 ip-10-0-242-81 kernel:  __handle_mm_fault+0x77c/0x7c0
Apr 06 20:51:30.072943 ip-10-0-242-81 kernel:  ? ovl_show_options+0x12b/0x230 [overlay]
Apr 06 20:51:30.073000 ip-10-0-242-81 kernel:  handle_mm_fault+0xc2/0x1d0
Apr 06 20:51:30.073054 ip-10-0-242-81 kernel:  __do_page_fault+0x21b/0x4d0
Apr 06 20:51:30.073107 ip-10-0-242-81 kernel:  do_page_fault+0x32/0x110
Apr 06 20:51:30.073161 ip-10-0-242-81 kernel:  ? page_fault+0x8/0x30
Apr 06 20:51:30.073214 ip-10-0-242-81 kernel:  page_fault+0x1e/0x30
Apr 06 20:51:30.073266 ip-10-0-242-81 kernel: RIP: 0033:0x560fde70c232
Apr 06 20:51:30.073323 ip-10-0-242-81 kernel: Code: 0f 6f 06 f3 0f 6f 4e 10 f3 0f 6f 56 20 f3 0f 6f 5e 30 f3 0f 6f 64 1e c0 f3 0f 6f 6c 1e d0 f3 0f 6f 74 1e e0 f3 0f 6f 7c 1e f0 <f3> 0f 7f 07 f3 0f 7f 4f 10 f3 0f 7f 57 20 f3 0f 7f 5f 30 f3 0f 7f
Apr 06 20:51:30.073386 ip-10-0-242-81 kernel: RSP: 002b:000000c0000d8e60 EFLAGS: 00010287
Apr 06 20:51:30.073442 ip-10-0-242-81 kernel: RAX: 0000000000000073 RBX: 0000000000000073 RCX: 000000c00030965c
Apr 06 20:51:30.073498 ip-10-0-242-81 kernel: RDX: 000000000004186a RSI: 000000c00030965c RDI: 000000c000344000
Apr 06 20:51:30.073552 ip-10-0-242-81 kernel: RBP: 000000c0000d8e88 R08: 0000000000000000 R09: 0000000000000001
Apr 06 20:51:30.073612 ip-10-0-242-81 kernel: R10: 000000c000344000 R11: 00000000000001a2 R12: 0000000000000000
Apr 06 20:51:30.073669 ip-10-0-242-81 kernel: R13: 0000000000000040 R14: 0000560fdebbcd54 R15: 0000000000000038
Apr 06 20:51:30.073721 ip-10-0-242-81 kernel: memory: usage 12288kB, limit 12288kB, failcnt 18
Apr 06 20:51:30.080416 ip-10-0-242-81 kernel: memory+swap: usage 12288kB, limit 9007199254740988kB, failcnt 0
Apr 06 20:51:30.080503 ip-10-0-242-81 kernel: kmem: usage 976kB, limit 9007199254740988kB, failcnt 0
Apr 06 20:51:30.601276 ip-10-0-242-81 kernel: Memory cgroup stats for /kubepods.slice/kubepods-podaa04fb0f_f81a_410e_8735_5647e7491600.slice:
Apr 06 20:51:30.601448 ip-10-0-242-81 kernel: anon 11329536
                                              file 0

@aojea
Copy link

aojea commented Apr 7, 2021

The ingress ones seems to fail because they assume there is no default ingress class

fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Apr  6 20:40:05.612: Expected IngressClassName to be nil, got openshift-default

@soltysh
Copy link
Contributor

soltysh commented Apr 7, 2021

This is being picked with k8s bump in #26054
@damemi feel free to link the BZ to this PR
/close

@openshift-ci-robot
Copy link

@soltysh: Closed this PR.

Details

In response to this:

This is being picked with k8s bump in #26054
@damemi feel free to link the BZ to this PR
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@damemi: This pull request references Bugzilla bug 1896558. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1896558: bump(openshift/kubernetes): multi-az spreading e2e flakes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@damemi
Copy link
Contributor Author

damemi commented Apr 7, 2021

@aojea thanks for finding that, OOM is interesting since these are using just small test images. I wonder if the memory limits are being set too low (and if we even need limits for the scheduler to see the nodes as balanced, or if it only works on requests)

@soltysh if you hit this again in your PR we will have to debug further

@damemi
Copy link
Contributor Author

damemi commented May 17, 2021

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2021

@damemi: This pull request references Bugzilla bug 1896558, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants