-
Notifications
You must be signed in to change notification settings - Fork 1.5k
data/aws/variables-aws: Bump master volume to 500 GiB for I/O #844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We'd bumped this to 120 GiB in bca6a92 (aws: Increase default master disk size to 120GB for IO, 2018-12-27), but in a recent CI cluster [1] the master-0 etcd-member still cycled: [core@ip-10-0-6-251 ~]$ sudo crictl ps -a | grep etcd-member 4612be4ebdd15 94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4 27 minutes ago Running etcd-member 1 81b927a307579 quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1 43 minutes ago Exited etcd-member 0 [core@ip-10-0-6-251 ~]$ sudo crictl inspect 81b927a307579 { "status": { "id": "81b927a307579bab5675851d1ffa3e37c288a252c125ffe0a9e7b4a772fae0a7", "metadata": { "attempt": 0, "name": "etcd-member" }, "state": "CONTAINER_EXITED", "createdAt": "2018-12-08T05:36:05.119762774Z", "startedAt": "2018-12-08T05:36:05.143173509Z", "finishedAt": "2018-12-08T05:51:04.889449401Z", "exitCode": 255, "image": { "image": "quay.io/coreos/etcd:v3.2.14" }, "imageRef": "quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1", "reason": "Error", "message": "", "labels": { "io.kubernetes.container.name": "etcd-member", "io.kubernetes.pod.name": "etcd-member-ip-10-0-6-251.ec2.internal", "io.kubernetes.pod.namespace": "kube-system", "io.kubernetes.pod.uid": "73a817bd724bb22e2bd6e4ff0ccee478" }, "annotations": { "io.kubernetes.container.hash": "f3672bc7", "io.kubernetes.container.ports": "[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]", "io.kubernetes.container.restartCount": "0", "io.kubernetes.container.terminationMessagePath": "/dev/termination-log", "io.kubernetes.container.terminationMessagePolicy": "File", "io.kubernetes.pod.terminationGracePeriod": "30" }, "mounts": [], "logPath": "/var/log/pods/73a817bd724bb22e2bd6e4ff0ccee478/etcd-member/0.log" } } [core@ip-10-0-6-251 ~]$ sudo crictl logs --tail=10 81b927a307579 2018-12-08 05:48:01.220052 W | wal: sync duration of 1.551958035s, expected less than 1s 2018-12-08 05:48:05.013263 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 11; CANCEL") 2018-12-08 05:48:12.491950 W | wal: sync duration of 1.756141212s, expected less than 1s 2018-12-08 05:49:04.229575 W | etcdserver: apply entries took too long [120.740107ms for 1 entries] 2018-12-08 05:49:04.229602 W | etcdserver: avoid queries with large range/delete range! 2018-12-08 05:49:04.712619 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL") 2018-12-08 05:49:06.346062 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL") 2018-12-08 05:49:06.722589 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL") 2018-12-08 05:49:11.048170 W | etcdserver: apply entries took too long [151.877306ms for 1 entries] 2018-12-08 05:49:11.048202 W | etcdserver: avoid queries with large range/delete range! On gp2's sliding IOPS scale [2], 500 GiB gets us lots more space (1.5k IOPS baseline), but the etcd docs recommend m4.large with max 3.6k concurrent IOPS for a 50-node kubernetes cluster [3]. So we'll see if it helps in CI. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/841/pull-ci-openshift-installer-master-e2e-aws/2064/build-log.txt [2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes_gp2 [3]: https://github.com/etcd-io/etcd/blob/v3.3.10/Documentation/op-guide/hardware.md#small-cluster
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
But the underlying issue was master-instability. Here's when the initial round of pods died: [core@ip-10-0-7-211 ~]$ sudo crictl ps -a
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT
6413013f1b203 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc 9 seconds ago Running operator 2
434fa05a795cb f6df05d6a36426dde85b9c778b435a7aaa12543d69d603f629b8f5273356ec7b 13 seconds ago Running cluster-dns-operator 1
9c3e0b96b8325 f06c190859935d127c2efee77beb689fbacb53ec93b88547c25392fc970289f7 14 seconds ago Running operator 1
ebc194bc9b966 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:6667ac4aecae183dfd4e6ae4277dd86ca977e0a3b9feefee653043105503c6d6 18 seconds ago Exited tuned 1
011545eb60059 05503aa686767edf45b70172c8975c8b9743bb6a6c1c182c181eb36cd610f6fc 18 seconds ago Running machine-config-server 1
d8aa7ead7c3dc registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820 18 seconds ago Running machine-approver-controller 1
413a0cca0629b bb84dbdafdfa61c29ca989c652d63debe03a043c2099f6ad7097ac28865bd406 20 seconds ago Running cluster-network-operator 1
c0ace414b0ce2 25dbbded706585310ae3ebc743bcc59659f826fff1bac24be4a56b83d37e3cc2 25 seconds ago Running machine-config-daemon 1
a2e1ef4a0c97b 03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9 27 seconds ago Running sdn-controller 1
c08642e38da06 registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13 27 seconds ago Running cluster-version-operator 1
8f80023565506 1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8 28 seconds ago Running sdn 1
fb31831e0665c 1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8 28 seconds ago Running openvswitch 1
89155ae218212 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98 30 seconds ago Running openshift-kube-apiserver 1
edafa7d07d214 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315 30 seconds ago Running kube-controller-manager 1
abab48f8aaf0e registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315 30 seconds ago Running scheduler 1
b735c768b8baf 94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4 38 seconds ago Running etcd-member 1
b79135e346749 b02de22ff740f0bfa7e5dde5aa1a8169051375a5f0c69c28fafefc9408f72b06 39 seconds ago Exited certs 0
04b4489ae42e3 04a052dbf6cb5ac2afa57eb41c37a2964ee16c7ee62986900aceb38f369c8411 39 seconds ago Exited discovery 1
1e38cb58d5063 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:14b67d7a5d1ec05dd45e60663b6e4e0c460cf7f429397dd3a3ec2d1997e52096 2 minutes ago Exited machine-config-daemon 0
e8715191161af registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:08fbd1a5a90b59572a3b097389fc17f7ae9b9b1ef7e1f3d19103e280fc656518 2 minutes ago Exited console 0
9c19639ae072d registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:9ecf75c8384f5ec5431b4a3267466833c1caa212a183ce8b3e7f42bc3f7e1dcc 3 minutes ago Exited machine-config-server 0
439c340ab23ce 03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9 3 minutes ago Exited controller-manager 0
a12b3df1bae1f registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98 4 minutes ago Exited openshift-kube-apiserver 0
74d8d492a43db registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98 4 minutes ago Exited openshift-apiserver 0
98c47bd0e9246 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc 4 minutes ago Exited installer 0
6c15d73ec3e14 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:c8c110b8733d0d352ddc5fe35ba9eeac913b7609c2c9c778586f2bb74f281681 4 minutes ago Exited registry 0
7796346af6b01 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820 5 minutes ago Exited machine-approver-controller 0
afb16e1bdfea8 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:40fa0e40cf625f314586d47a4199d5de985b7f980d7c4b650ab7f2c0f74f41b2 5 minutes ago Exited registry-ca-hostmapper 0
8aff3b343e0dc registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc 6 minutes ago Exited operator 1
158f097ba79cb registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315 7 minutes ago Exited kube-controller-manager 0
128c03f9fa9a8 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:38d43ca65fce090c19b092b1c0962781c146f9fc65f3228eb96f5aad684c9119 7 minutes ago Exited installer 0
bf6ac2a6570f2 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315 8 minutes ago Exited scheduler 0
b43c90d9cebb6 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:b532d9351b803b5a03cf6777f12d725b4973851957497ea3e2b37313aadd6750 8 minutes ago Exited operator 0
4f333ad437f34 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a494ee2152687973d1987598cf82f7606b2c4fbb48c7f09f2e897cb417ab88f1 8 minutes ago Exited installer 0
28eed681186eb registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc 8 minutes ago Exited installer 0
c0d2a89890a29 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:ce86e514320b680f39735323288cfd19caee5a9480b086b4b275454aef94136e 8 minutes ago Exited dns-node-resolver 0
d244a6189ea7b registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:e4936a702d7d466a64a6a9359f35c7ad528bba7c35fe5c582a90e46f9051d8b8 8 minutes ago Exited dns 0
7e12941ef62cd registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:3287ba30af508652bda691490ad7dbd97febf4e90b72e23228f87c9038fc387e 9 minutes ago Exited operator 0
9f0bcda47b3bc registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f538d65377e0891e70555009f4778c0b1782360ea0f4309adec905cad752593d 9 minutes ago Exited cluster-dns-operator 0
dc6d7fba5de79 registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13 9 minutes ago Exited cluster-version-operator 0
b46b53bc15b68 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae 9 minutes ago Exited openvswitch 0
98d49e2e52366 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae 9 minutes ago Exited sdn 0
62fe7aeacb776 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98 10 minutes ago Exited sdn-controller 0
e2026a15792b4 registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a8aa3e53cbaeae806210878f0c7b499b636a963b2a52f4d1eea6db3dfa2fdc98 11 minutes ago Exited cluster-network-operator 0
751acc223850b quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1 13 minutes ago Exited etcd-member 0Timing on that first [core@ip-10-0-7-211 ~]$ sudo crictl inspect 751acc223850b4
{
"status": {
"id": "751acc223850b440d265c509f54b4b845b532c5fcbe53f4e6107486aaf81a67f",
"metadata": {
"attempt": 0,
"name": "etcd-member"
},
"state": "CONTAINER_EXITED",
"createdAt": "2018-12-08T06:52:46.817382964Z",
"startedAt": "2018-12-08T06:52:46.839814886Z",
"finishedAt": "2018-12-08T07:07:43.572403945Z",
"exitCode": 255,
"image": {
"image": "quay.io/coreos/etcd:v3.2.14"
},
"imageRef": "quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1",
"reason": "Error",
"message": "",
"labels": {
"io.kubernetes.container.name": "etcd-member",
"io.kubernetes.pod.name": "etcd-member-ip-10-0-7-211.ec2.internal",
"io.kubernetes.pod.namespace": "kube-system",
"io.kubernetes.pod.uid": "b691fb2b768f9a08bdaac585b1e6cec4"
},
"annotations": {
"io.kubernetes.container.hash": "c47b4da2",
"io.kubernetes.container.ports": "[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30"
},
"mounts": [],
"logPath": "/var/log/pods/b691fb2b768f9a08bdaac585b1e6cec4/etcd-member/0.log"
}
}Again, no luck with the termination log: [core@ip-10-0-7-211 ~]$ sudo crictl exec 751acc223850b4 cat /dev/termination-log
FATA[0000] execing command in container failed: Internal error occurred: error executing command in container: container is not created or runningMaybe I can [core@ip-10-0-7-211 ~]$ sudo crictl logs 751acc223850b4
2018-12-08 06:52:46.851656 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2018-12-08 06:52:46.851995 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd-member-ip-10-0-7-211.ec2.internal
2018-12-08 06:52:46.852129 I | etcdmain: etcd Version: 3.2.14
2018-12-08 06:52:46.852185 I | etcdmain: Git SHA: fb5cd6f1c
2018-12-08 06:52:46.852232 I | etcdmain: Go Version: go1.8.5
2018-12-08 06:52:46.852342 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-08 06:52:46.852357 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-12-08 06:52:46.852526 I | embed: peerTLS: cert = /etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-08 06:52:46.853286 I | embed: listening for peers on https://0.0.0.0:2380
2018-12-08 06:52:46.853396 I | embed: listening for client requests on 0.0.0.0:2379
2018-12-08 06:52:46.862199 N | embed: got bootstrap from DNS for etcd-server at 0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.862232 N | embed: got bootstrap from DNS for etcd-server at etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.862261 N | embed: got bootstrap from DNS for etcd-server at 1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.865255 I | pkg/netutil: resolving ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380 to 10.0.7.211:2380
2018-12-08 06:52:46.886503 I | etcdserver: name = etcd-member-ip-10-0-7-211.ec2.internal
2018-12-08 06:52:46.886520 I | etcdserver: data dir = /var/lib/etcd
2018-12-08 06:52:46.886529 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-08 06:52:46.886534 I | etcdserver: heartbeat = 100ms
2018-12-08 06:52:46.886538 I | etcdserver: election = 1000ms
2018-12-08 06:52:46.886543 I | etcdserver: snapshot count = 100000
2018-12-08 06:52:46.886554 I | etcdserver: advertise client URLs = https://10.0.7.211:2379
2018-12-08 06:52:46.886561 I | etcdserver: initial advertise peer URLs = https://10.0.7.211:2380
2018-12-08 06:52:46.886582 I | etcdserver: initial cluster = 0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380,1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380,etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.889270 I | etcdserver: starting member 2b82b9112d5c6d00 in cluster 3cb94e8b51630707
2018-12-08 06:52:46.889305 I | raft: 2b82b9112d5c6d00 became follower at term 0
2018-12-08 06:52:46.889322 I | raft: newRaft 2b82b9112d5c6d00 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-12-08 06:52:46.889329 I | raft: 2b82b9112d5c6d00 became follower at term 1
2018-12-08 06:52:46.894760 W | auth: simple token is not cryptographically signed
2018-12-08 06:52:46.898690 I | rafthttp: starting peer 36dea2cf9cf8b40f...
2018-12-08 06:52:46.898738 I | rafthttp: started HTTP pipelining with peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.902994 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (writer)
2018-12-08 06:52:46.903066 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (writer)
2018-12-08 06:52:46.906014 I | rafthttp: started peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.906107 I | rafthttp: added peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.906202 I | rafthttp: starting peer 45ac11b837bb4860...
2018-12-08 06:52:46.906275 I | rafthttp: started HTTP pipelining with peer 45ac11b837bb4860
2018-12-08 06:52:46.911971 I | rafthttp: started peer 45ac11b837bb4860
2018-12-08 06:52:46.912055 I | rafthttp: added peer 45ac11b837bb4860
2018-12-08 06:52:46.912131 I | etcdserver: starting server... [version: 3.2.14, cluster version: to_be_decided]
2018-12-08 06:52:46.914329 I | embed: ClientTLS: cert = /etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-08 06:52:46.915682 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 06:52:46.916041 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 06:52:46.916369 I | rafthttp: started streaming with peer 45ac11b837bb4860 (writer)
2018-12-08 06:52:46.916446 I | rafthttp: started streaming with peer 45ac11b837bb4860 (writer)
2018-12-08 06:52:46.916534 I | rafthttp: started streaming with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 06:52:46.916920 I | rafthttp: started streaming with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 06:52:46.918644 I | rafthttp: peer 36dea2cf9cf8b40f became active
2018-12-08 06:52:46.918724 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message writer)
2018-12-08 06:52:46.919079 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 writer)
2018-12-08 06:52:46.919611 I | etcdserver/membership: added member 2b82b9112d5c6d00 [https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.919923 I | etcdserver/membership: added member 36dea2cf9cf8b40f [https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.920190 I | etcdserver/membership: added member 45ac11b837bb4860 [https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.940293 I | raft: 2b82b9112d5c6d00 [term: 1] received a MsgHeartbeat message with higher term from 45ac11b837bb4860 [term: 4]
2018-12-08 06:52:46.940544 I | raft: 2b82b9112d5c6d00 became follower at term 4
2018-12-08 06:52:46.940605 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 45ac11b837bb4860 at term 4
2018-12-08 06:52:46.956645 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 06:52:46.961710 I | rafthttp: peer 45ac11b837bb4860 became active
2018-12-08 06:52:46.961796 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 06:52:46.962264 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 06:52:46.979561 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream Message writer)
2018-12-08 06:52:46.980665 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 06:52:46.981517 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 writer)
2018-12-08 06:52:46.983277 N | etcdserver/membership: set the initial cluster version to 3.0
2018-12-08 06:52:46.983377 I | etcdserver/api: enabled capabilities for version 3.0
2018-12-08 06:52:46.984458 I | etcdserver: published {Name:etcd-member-ip-10-0-7-211.ec2.internal ClientURLs:[https://10.0.7.211:2379]} to cluster 3cb94e8b51630707
2018-12-08 06:52:46.984611 I | embed: ready to serve client requests
2018-12-08 06:52:46.984920 I | embed: serving client requests on [::]:2379
WARNING: 2018/12/08 06:52:46 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-12-08 06:52:47.746760 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
2018-12-08 06:52:47.746875 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-08 06:55:58.886844 W | wal: sync duration of 1.404630213s, expected less than 1s
2018-12-08 06:56:31.977480 W | etcdserver: apply entries took too long [579.627031ms for 1 entries]
2018-12-08 06:56:31.977630 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:39.778259 W | etcdserver: apply entries took too long [2.308713015s for 1 entries]
2018-12-08 06:56:39.778668 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:39.778807 W | wal: sync duration of 1.775918245s, expected less than 1s
2018-12-08 06:56:41.266582 W | wal: sync duration of 1.487652289s, expected less than 1s
2018-12-08 06:56:44.948229 W | etcdserver: apply entries took too long [3.681196669s for 4 entries]
2018-12-08 06:56:44.948254 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:45.075204 W | wal: sync duration of 3.808380532s, expected less than 1s
2018-12-08 06:56:59.277846 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1028, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 10)
2018-12-08 06:57:00.232584 W | wal: sync duration of 3.045415715s, expected less than 1s
2018-12-08 06:57:00.898326 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 10)
2018-12-08 06:57:01.425086 W | wal: sync duration of 1.192379521s, expected less than 1s
2018-12-08 06:57:02.198238 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 9)
2018-12-08 06:57:02.617132 W | etcdserver: apply entries took too long [1.1915068s for 1 entries]
2018-12-08 06:57:02.617275 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:57:02.617319 W | wal: sync duration of 1.191904588s, expected less than 1s
2018-12-08 06:57:04.150989 I | raft: 2b82b9112d5c6d00 [term: 4] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 8]
2018-12-08 06:57:04.151025 I | raft: 2b82b9112d5c6d00 became follower at term 8
2018-12-08 06:57:04.151039 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 8
2018-12-08 06:57:04.151049 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 45ac11b837bb4860 at term 8
2018-12-08 06:57:04.725808 W | etcdserver: timed out waiting for read index response
2018-12-08 06:57:04.725861 I | raft: 2b82b9112d5c6d00 no leader at term 8; dropping index reading msg
2018-12-08 06:57:05.029466 I | raft: 2b82b9112d5c6d00 [term: 8] ignored a MsgVote message with lower term from 45ac11b837bb4860 [term: 6]
2018-12-08 06:57:05.939321 I | raft: 2b82b9112d5c6d00 [term: 8] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 9]
2018-12-08 06:57:05.939378 I | raft: 2b82b9112d5c6d00 became follower at term 9
2018-12-08 06:57:05.939408 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 9
2018-12-08 06:57:07.421230 I | raft: 2b82b9112d5c6d00 [term: 9] received a MsgVote message with higher term from 45ac11b837bb4860 [term: 10]
2018-12-08 06:57:07.421286 I | raft: 2b82b9112d5c6d00 became follower at term 10
2018-12-08 06:57:07.421315 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] cast MsgVote for 45ac11b837bb4860 [logterm: 4, index: 1029] at term 10
2018-12-08 06:57:07.423315 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 45ac11b837bb4860 at term 10
2018-12-08 06:57:11.726065 W | etcdserver: timed out waiting for read index response
2018-12-08 06:57:52.861971 W | etcdserver: apply entries took too long [287.709698ms for 2 entries]
2018-12-08 06:57:52.861997 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:58:40.003521 W | wal: sync duration of 1.07701685s, expected less than 1s
2018-12-08 06:59:03.397005 W | wal: sync duration of 3.478824796s, expected less than 1s
2018-12-08 07:00:20.198294 I | raft: 2b82b9112d5c6d00 [term: 10] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 11]
2018-12-08 07:00:20.198335 I | raft: 2b82b9112d5c6d00 became follower at term 11
2018-12-08 07:00:20.198352 I | raft: 2b82b9112d5c6d00 [logterm: 10, index: 4759, vote: 0] cast MsgVote for 36dea2cf9cf8b40f [logterm: 10, index: 4759] at term 11
2018-12-08 07:00:20.198364 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 45ac11b837bb4860 at term 11
2018-12-08 07:00:20.200279 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 36dea2cf9cf8b40f at term 11
2018-12-08 07:00:25.745499 W | etcdserver: timed out waiting for read index response
2018-12-08 07:02:54.605683 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606681 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606800 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606871 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:03:22.447814 W | etcdserver: apply entries took too long [104.730051ms for 1 entries]
2018-12-08 07:03:22.447840 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 07:04:16.093172 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:16.093373 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:16.093563 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:32.882512 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 07:04:32.882638 E | rafthttp: failed to read 45ac11b837bb4860 on stream MsgApp v2 (unexpected EOF)
2018-12-08 07:04:32.882692 I | rafthttp: peer 45ac11b837bb4860 became inactive
2018-12-08 07:04:32.883003 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 07:04:36.670428 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 07:04:36.670452 E | rafthttp: failed to read 36dea2cf9cf8b40f on stream Message (unexpected EOF)
2018-12-08 07:04:36.674767 I | rafthttp: peer 36dea2cf9cf8b40f became inactive
2018-12-08 07:04:36.676627 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 07:04:36.692527 I | raft: 2b82b9112d5c6d00 is starting a new election at term 11
2018-12-08 07:04:36.692557 I | raft: 2b82b9112d5c6d00 became candidate at term 12
2018-12-08 07:04:36.692569 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 12
2018-12-08 07:04:36.692581 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 12
2018-12-08 07:04:36.692592 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 12
2018-12-08 07:04:36.692603 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 36dea2cf9cf8b40f at term 12
2018-12-08 07:04:37.465281 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream Message writer)
2018-12-08 07:04:38.189677 I | raft: 2b82b9112d5c6d00 is starting a new election at term 12
2018-12-08 07:04:38.189807 I | raft: 2b82b9112d5c6d00 became candidate at term 13
2018-12-08 07:04:38.189862 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 13
2018-12-08 07:04:38.189910 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 13
2018-12-08 07:04:38.189955 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 13
2018-12-08 07:04:38.193365 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message writer)
2018-12-08 07:04:38.584175 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 writer)
2018-12-08 07:04:39.789772 I | raft: 2b82b9112d5c6d00 is starting a new election at term 13
2018-12-08 07:04:39.790005 I | raft: 2b82b9112d5c6d00 became candidate at term 14
2018-12-08 07:04:39.790057 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 14
2018-12-08 07:04:39.790105 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 14
2018-12-08 07:04:39.790177 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 14
2018-12-08 07:04:40.237211 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 writer)So still a few slow-disk errors like: Maybe the disk access is hitting CPU limitations? I dunno. But things seem to be moving along sustainably until around 07:03, after which we start to see lots of networking/connectivity issues. Next time I go through this, I'll try to grab etcd-member logs from the other masters as well. Also, in case the whole etcd-member restart is a red herring, I've started looking at the end of the CC @smarterclayton, @crawford, @abhinavdahiya, @deads2k, @sttts, and anyone else who understands how this is supposed to be working and can separate the fatal stuff from the log noise. Thoughts on what is happening? Anything else I should grab next time? |
|
I think the nodes are rebooting because of openshift/machine-config-operator#224 (comment)
|
|
Also, I'd missed that timing earlier. That means that the later failures were after the reboot. Or maybe just noticed after the reboot? Checking on that: Although I don't see 4c18a3e330ad7f208 in the |
|
Closing this in favor of openshift/cluster-dns-operator#63 |
|
It would be useful to straw man three independent PRs
1. That bumps ram only (2x)
2. One that bumps cpu only (2x)
3. One that bumps disk only (this)
To see whether any of the three reduce flakes when we test multiple times.
On Dec 8, 2018, at 3:03 AM, W. Trevor King <notifications@github.com> wrote:
First e2e-aws failure
<https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/844/pull-ci-openshift-installer-master-e2e-aws/2065/build-log.txt>
was:
Dec 8 07:20:16.538: INFO: Couldn't delete ns:
"e2e-tests-limitrange-q8j4c": Error while fetching pod metrics for
selector in namespace "e2e-tests-limitrange-q8j4c": no pods to fetch
metrics for (&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"",
APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"",
Continue:""}, Status:"Failure", Message:"Error while fetching pod
metrics for selector in namespace \"e2e-tests-limitrange-q8j4c\": no
pods to fetch metrics for", Reason:"",
Details:(*v1.StatusDetails)(nil), Code:500}})
Dec 8 07:20:16.541: INFO: Running AfterSuite actions on all node
Dec 8 07:20:16.541: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:319]: Dec 8
07:20:16.538: Couldn't delete ns: "e2e-tests-limitrange-q8j4c": Error
while fetching pod metrics for selector in namespace
"e2e-tests-limitrange-q8j4c": no pods to fetch metrics for
(&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"",
APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"",
Continue:""}, Status:"Failure", Message:"Error while fetching pod
metrics for selector in namespace \"e2e-tests-limitrange-q8j4c\": no
pods to fetch metrics for", Reason:"",
Details:(*v1.StatusDetails)(nil), Code:500}})
Dec 08 07:19:51.571 I ns=openshift-monitoring
deployment=prometheus-operator Scaled up replica set
prometheus-operator-bb4d58c75 to 1
Dec 08 07:19:52.817 I ns=openshift-monitoring
replicaset=prometheus-operator-bb4d58c75 Created pod:
prometheus-operator-bb4d58c75-97fpg
Dec 08 07:19:52.826 I ns=openshift-monitoring
pod=prometheus-operator-bb4d58c75-97fpg Successfully assigned
openshift-monitoring/prometheus-operator-bb4d58c75-97fpg to
ip-10-0-174-141.ec2.internal
Dec 08 07:19:55.676 I ns=openshift-monitoring
pod=prometheus-operator-bb4d58c75-97fpg pulling image
"quay.io/coreos/prometheus-operator:v0.26.0"
Dec 08 07:19:58.683 I ns=openshift-monitoring
pod=prometheus-operator-bb4d58c75-97fpg Successfully pulled image
"quay.io/coreos/prometheus-operator:v0.26.0"
Dec 08 07:19:59.222 I ns=openshift-monitoring
pod=prometheus-operator-bb4d58c75-97fpg Created container
Dec 08 07:19:59.367 I ns=openshift-monitoring
pod=prometheus-operator-bb4d58c75-97fpg Started container
Dec 08 07:19:59.974 I ns=openshift-monitoring
deployment=prometheus-operator Scaled down replica set
prometheus-operator-8dd74c6f to 0
Dec 08 07:20:00.013 W ns=openshift-monitoring
pod=prometheus-operator-8dd74c6f-hll86
node=ip-10-0-153-188.ec2.internal graceful deletion within 30s
Dec 08 07:20:00.079 I ns=openshift-monitoring
replicaset=prometheus-operator-8dd74c6f Deleted pod:
prometheus-operator-8dd74c6f-hll86
Dec 08 07:20:00.698 I ns=openshift-monitoring
pod=prometheus-operator-8dd74c6f-hll86 Killing container with id
cri-o://prometheus-operator:Need to kill Pod
Dec 08 07:20:08.689 W ns=openshift-monitoring
pod=prometheus-operator-8dd74c6f-hll86
node=ip-10-0-153-188.ec2.internal invariant violation (bug): pod
should not transition Running->Pending even when terminated
Dec 08 07:20:08.689 W ns=openshift-monitoring
pod=prometheus-operator-8dd74c6f-hll86
node=ip-10-0-153-188.ec2.internal container=prometheus-operator
container stopped being ready
Dec 08 07:20:09.046 W ns=openshift-monitoring
pod=prometheus-operator-8dd74c6f-hll86
node=ip-10-0-153-188.ec2.internal deleted
failed: (1m22s) 2018-12-08T07:20:16 "[sig-scheduling] LimitRange
should create a LimitRange with defaults and ensure pod has those
defaults applied. [Suite:openshift/conformance/parallel] [Suite:k8s]
[Suite:openshift/smoke-4]"
But the underlying issue was master-instability. Here's when the initial
round of pods died:
[core@ip-10-0-7-211 ~]$ sudo crictl ps -aCONTAINER ID IMAGE
CREATED
STATE NAME ATTEMPT6413013f1b203
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc
9 seconds ago Running operator
2434fa05a795cb
f6df05d6a36426dde85b9c778b435a7aaa12543d69d603f629b8f5273356ec7b
13 seconds ago
Running cluster-dns-operator 19c3e0b96b8325
f06c190859935d127c2efee77beb689fbacb53ec93b88547c25392fc970289f7
14
seconds ago Running operator
1ebc194bc9b966
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:6667ac4aecae183dfd4e6ae4277dd86ca977e0a3b9feefee653043105503c6d6
18 seconds ago Exited tuned
1011545eb60059
05503aa686767edf45b70172c8975c8b9743bb6a6c1c182c181eb36cd610f6fc
18 seconds ago
Running machine-config-server 1d8aa7ead7c3dc
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820
18 seconds ago Running machine-approver-controller
1413a0cca0629b
bb84dbdafdfa61c29ca989c652d63debe03a043c2099f6ad7097ac28865bd406
20 seconds ago
Running cluster-network-operator 1c0ace414b0ce2
25dbbded706585310ae3ebc743bcc59659f826fff1bac24be4a56b83d37e3cc2
25
seconds ago Running machine-config-daemon
1a2e1ef4a0c97b
03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9
27 seconds ago
Running sdn-controller 1c08642e38da06
registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13
27 seconds ago Running cluster-version-operator
18f80023565506
1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8
28 seconds ago
Running sdn 1fb31831e0665c
1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8
28
seconds ago Running openvswitch
189155ae218212
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98
30 seconds ago Running openshift-kube-apiserver
1edafa7d07d214
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315
30 seconds ago Running kube-controller-manager
1abab48f8aaf0e
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315
30 seconds ago Running scheduler
1b735c768b8baf
94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4
38 seconds ago
Running etcd-member 1b79135e346749
b02de22ff740f0bfa7e5dde5aa1a8169051375a5f0c69c28fafefc9408f72b06
39
seconds ago Exited certs
004b4489ae42e3
04a052dbf6cb5ac2afa57eb41c37a2964ee16c7ee62986900aceb38f369c8411
39 seconds ago
Exited discovery 11e38cb58d5063
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:14b67d7a5d1ec05dd45e60663b6e4e0c460cf7f429397dd3a3ec2d1997e52096
2 minutes ago Exited machine-config-daemon
0e8715191161af
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:08fbd1a5a90b59572a3b097389fc17f7ae9b9b1ef7e1f3d19103e280fc656518
2 minutes ago Exited console
09c19639ae072d
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:9ecf75c8384f5ec5431b4a3267466833c1caa212a183ce8b3e7f42bc3f7e1dcc
3 minutes ago Exited machine-config-server
0439c340ab23ce
03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9
3 minutes ago
Exited controller-manager 0a12b3df1bae1f
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98
4 minutes ago Exited openshift-kube-apiserver
074d8d492a43db
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98
4 minutes ago Exited openshift-apiserver
098c47bd0e9246
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc
4 minutes ago Exited installer
06c15d73ec3e14
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:c8c110b8733d0d352ddc5fe35ba9eeac913b7609c2c9c778586f2bb74f281681
4 minutes ago Exited registry
07796346af6b01
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820
5 minutes ago Exited machine-approver-controller
0afb16e1bdfea8
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:40fa0e40cf625f314586d47a4199d5de985b7f980d7c4b650ab7f2c0f74f41b2
5 minutes ago Exited registry-ca-hostmapper
08aff3b343e0dc
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc
6 minutes ago Exited operator
1158f097ba79cb
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315
7 minutes ago Exited kube-controller-manager
0128c03f9fa9a8
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:38d43ca65fce090c19b092b1c0962781c146f9fc65f3228eb96f5aad684c9119
7 minutes ago Exited installer
0bf6ac2a6570f2
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315
8 minutes ago Exited scheduler
0b43c90d9cebb6
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:b532d9351b803b5a03cf6777f12d725b4973851957497ea3e2b37313aadd6750
8 minutes ago Exited operator
04f333ad437f34
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a494ee2152687973d1987598cf82f7606b2c4fbb48c7f09f2e897cb417ab88f1
8 minutes ago Exited installer
028eed681186eb
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc
8 minutes ago Exited installer
0c0d2a89890a29
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:ce86e514320b680f39735323288cfd19caee5a9480b086b4b275454aef94136e
8 minutes ago Exited dns-node-resolver
0d244a6189ea7b
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:e4936a702d7d466a64a6a9359f35c7ad528bba7c35fe5c582a90e46f9051d8b8
8 minutes ago Exited dns
07e12941ef62cd
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:3287ba30af508652bda691490ad7dbd97febf4e90b72e23228f87c9038fc387e
9 minutes ago Exited operator
09f0bcda47b3bc
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f538d65377e0891e70555009f4778c0b1782360ea0f4309adec905cad752593d
9 minutes ago Exited cluster-dns-operator
0dc6d7fba5de79
registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13
9 minutes ago Exited cluster-version-operator
0b46b53bc15b68
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae
9 minutes ago Exited openvswitch
098d49e2e52366
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae
9 minutes ago Exited sdn
062fe7aeacb776
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98
10 minutes ago Exited sdn-controller
0e2026a15792b4
registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a8aa3e53cbaeae806210878f0c7b499b636a963b2a52f4d1eea6db3dfa2fdc98
11 minutes ago Exited cluster-network-operator
0751acc223850b
quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1
13 minutes ago Exited
etcd-member 0
Timing on that first etcd-member pod:
[core@ip-10-0-7-211 ~]$ sudo crictl inspect 751acc223850b4{ "status":
{ "id": "751acc223850b440d265c509f54b4b845b532c5fcbe53f4e6107486aaf81a67f",
"metadata": { "attempt": 0, "name": "etcd-member" },
"state": "CONTAINER_EXITED", "createdAt":
"2018-12-08T06:52:46.817382964Z", "startedAt":
"2018-12-08T06:52:46.839814886Z", "finishedAt":
"2018-12-08T07:07:43.572403945Z", "exitCode": 255, "image": {
"image": "quay.io/coreos/etcd:v3.2.14" }, "imageRef":
"quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1",
"reason": "Error", "message": "", "labels": {
"io.kubernetes.container.name": "etcd-member",
"io.kubernetes.pod.name": "etcd-member-ip-10-0-7-211.ec2.internal",
"io.kubernetes.pod.namespace": "kube-system",
"io.kubernetes.pod.uid": "b691fb2b768f9a08bdaac585b1e6cec4" },
"annotations": { "io.kubernetes.container.hash": "c47b4da2",
"io.kubernetes.container.ports":
"[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath":
"/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30" }, "mounts":
[], "logPath":
"/var/log/pods/b691fb2b768f9a08bdaac585b1e6cec4/etcd-member/0.log" }}
Again, no luck with the termination log:
[core@ip-10-0-7-211 ~]$ sudo crictl exec 751acc223850b4 cat
/dev/termination-logFATA[0000] execing command in container failed:
Internal error occurred: error executing command in container:
container is not created or running
Maybe I can tail -f that file next time. etcd-member logs:
[core@ip-10-0-7-211 ~]$ sudo crictl logs 751acc223850b42018-12-08
06:52:46.851656 I | pkg/flags: recognized and used environment
variable ETCD_DATA_DIR=/var/lib/etcd2018-12-08 06:52:46.851995 I |
pkg/flags: recognized and used environment variable
ETCD_NAME=etcd-member-ip-10-0-7-211.ec2.internal2018-12-08
06:52:46.852129 I | etcdmain: etcd Version: 3.2.142018-12-08
06:52:46.852185 I | etcdmain: Git SHA: fb5cd6f1c2018-12-08
06:52:46.852232 I | etcdmain: Go Version: go1.8.52018-12-08
06:52:46.852342 I | etcdmain: Go OS/Arch: linux/amd642018-12-08
06:52:46.852357 I | etcdmain: setting maximum number of CPUs to 2,
total number of available CPUs is 22018-12-08 06:52:46.852526 I |
embed: peerTLS: cert =
/etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt,
key = /etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key,
ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth =
true2018-12-08 06:52:46.853286 I | embed: listening for peers on
https://0.0.0.0:23802018-12-08 06:52:46.853396 I | embed: listening
for client requests on 0.0.0.0:23792018-12-08 06:52:46.862199 N |
embed: got bootstrap from DNS for etcd-server at
0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:23802018-12-08
06:52:46.862232 N | embed: got bootstrap from DNS for etcd-server at
etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:23802018-12-08
06:52:46.862261 N | embed: got bootstrap from DNS for etcd-server at
1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:23802018-12-08
06:52:46.865255 I | pkg/netutil: resolving
ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380 to
10.0.7.211:23802018-12-08 06:52:46.886503 I | etcdserver: name =
etcd-member-ip-10-0-7-211.ec2.internal2018-12-08 06:52:46.886520 I |
etcdserver: data dir = /var/lib/etcd2018-12-08 06:52:46.886529 I |
etcdserver: member dir = /var/lib/etcd/member2018-12-08
06:52:46.886534 I | etcdserver: heartbeat = 100ms2018-12-08
06:52:46.886538 I | etcdserver: election = 1000ms2018-12-08
06:52:46.886543 I | etcdserver: snapshot count = 1000002018-12-08
06:52:46.886554 I | etcdserver: advertise client URLs =
https://10.0.7.211:23792018-12-08 06:52:46.886561 I | etcdserver:
initial advertise peer URLs = https://10.0.7.211:23802018-12-08
06:52:46.886582 I | etcdserver: initial cluster =
0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380,1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380,etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:23802018-12-08
06:52:46.889270 I | etcdserver: starting member 2b82b9112d5c6d00 in
cluster 3cb94e8b516307072018-12-08 06:52:46.889305 I | raft:
2b82b9112d5c6d00 became follower at term 02018-12-08 06:52:46.889322 I
| raft: newRaft 2b82b9112d5c6d00 [peers: [], term: 0, commit: 0,
applied: 0, lastindex: 0, lastterm: 0]2018-12-08 06:52:46.889329 I |
raft: 2b82b9112d5c6d00 became follower at term 12018-12-08
06:52:46.894760 W | auth: simple token is not cryptographically
signed2018-12-08 06:52:46.898690 I | rafthttp: starting peer
36dea2cf9cf8b40f...2018-12-08 06:52:46.898738 I | rafthttp: started
HTTP pipelining with peer 36dea2cf9cf8b40f2018-12-08 06:52:46.902994 I
| rafthttp: started streaming with peer 36dea2cf9cf8b40f
(writer)2018-12-08 06:52:46.903066 I | rafthttp: started streaming
with peer 36dea2cf9cf8b40f (writer)2018-12-08 06:52:46.906014 I |
rafthttp: started peer 36dea2cf9cf8b40f2018-12-08 06:52:46.906107 I |
rafthttp: added peer 36dea2cf9cf8b40f2018-12-08 06:52:46.906202 I |
rafthttp: starting peer 45ac11b837bb4860...2018-12-08 06:52:46.906275
I | rafthttp: started HTTP pipelining with peer
45ac11b837bb48602018-12-08 06:52:46.911971 I | rafthttp: started peer
45ac11b837bb48602018-12-08 06:52:46.912055 I | rafthttp: added peer
45ac11b837bb48602018-12-08 06:52:46.912131 I | etcdserver: starting
server... [version: 3.2.14, cluster version: to_be_decided]2018-12-08
06:52:46.914329 I | embed: ClientTLS: cert =
/etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt,
key = /etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key,
ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth =
true2018-12-08 06:52:46.915682 I | rafthttp: started streaming with
peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)2018-12-08
06:52:46.916041 I | rafthttp: started streaming with peer
36dea2cf9cf8b40f (stream Message reader)2018-12-08 06:52:46.916369 I |
rafthttp: started streaming with peer 45ac11b837bb4860
(writer)2018-12-08 06:52:46.916446 I | rafthttp: started streaming
with peer 45ac11b837bb4860 (writer)2018-12-08 06:52:46.916534 I |
rafthttp: started streaming with peer 45ac11b837bb4860 (stream MsgApp
v2 reader)2018-12-08 06:52:46.916920 I | rafthttp: started streaming
with peer 45ac11b837bb4860 (stream Message reader)2018-12-08
06:52:46.918644 I | rafthttp: peer 36dea2cf9cf8b40f became
active2018-12-08 06:52:46.918724 I | rafthttp: established a TCP
streaming connection with peer 36dea2cf9cf8b40f (stream Message
writer)2018-12-08 06:52:46.919079 I | rafthttp: established a TCP
streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2
writer)2018-12-08 06:52:46.919611 I | etcdserver/membership: added
member 2b82b9112d5c6d00
[https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380]
to cluster 3cb94e8b516307072018-12-08 06:52:46.919923 I |
etcdserver/membership: added member 36dea2cf9cf8b40f
[https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380]
to cluster 3cb94e8b516307072018-12-08 06:52:46.920190 I |
etcdserver/membership: added member 45ac11b837bb4860
[https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380]
to cluster 3cb94e8b516307072018-12-08 06:52:46.940293 I | raft:
2b82b9112d5c6d00 [term: 1] received a MsgHeartbeat message with higher
term from 45ac11b837bb4860 [term: 4]2018-12-08 06:52:46.940544 I |
raft: 2b82b9112d5c6d00 became follower at term 42018-12-08
06:52:46.940605 I | raft: raft.node: 2b82b9112d5c6d00 elected leader
45ac11b837bb4860 at term 42018-12-08 06:52:46.956645 I | rafthttp:
established a TCP streaming connection with peer 36dea2cf9cf8b40f
(stream Message reader)2018-12-08 06:52:46.961710 I | rafthttp: peer
45ac11b837bb4860 became active2018-12-08 06:52:46.961796 I | rafthttp:
established a TCP streaming connection with peer 45ac11b837bb4860
(stream MsgApp v2 reader)2018-12-08 06:52:46.962264 I | rafthttp:
established a TCP streaming connection with peer 36dea2cf9cf8b40f
(stream MsgApp v2 reader)2018-12-08 06:52:46.979561 I | rafthttp:
established a TCP streaming connection with peer 45ac11b837bb4860
(stream Message writer)2018-12-08 06:52:46.980665 I | rafthttp:
established a TCP streaming connection with peer 45ac11b837bb4860
(stream Message reader)2018-12-08 06:52:46.981517 I | rafthttp:
established a TCP streaming connection with peer 45ac11b837bb4860
(stream MsgApp v2 writer)2018-12-08 06:52:46.983277 N |
etcdserver/membership: set the initial cluster version to
3.02018-12-08 06:52:46.983377 I | etcdserver/api: enabled capabilities
for version 3.02018-12-08 06:52:46.984458 I | etcdserver: published
{Name:etcd-member-ip-10-0-7-211.ec2.internal
ClientURLs:[https://10.0.7.211:2379]} to cluster
3cb94e8b516307072018-12-08 06:52:46.984611 I | embed: ready to serve
client requests2018-12-08 06:52:46.984920 I | embed: serving client
requests on [::]:2379WARNING: 2018/12/08 06:52:46 Failed to dial
0.0.0.0:2379: connection error: desc = "transport: authentication
handshake failed: remote error: tls: bad certificate"; please
retry.2018-12-08 06:52:47.746760 N | etcdserver/membership: updated
the cluster version from 3.0 to 3.22018-12-08 06:52:47.746875 I |
etcdserver/api: enabled capabilities for version 3.22018-12-08
06:55:58.886844 W | wal: sync duration of 1.404630213s, expected less
than 1s2018-12-08 06:56:31.977480 W | etcdserver: apply entries took
too long [579.627031ms for 1 entries]2018-12-08 06:56:31.977630 W |
etcdserver: avoid queries with large range/delete range!2018-12-08
06:56:39.778259 W | etcdserver: apply entries took too long
[2.308713015s for 1 entries]2018-12-08 06:56:39.778668 W | etcdserver:
avoid queries with large range/delete range!2018-12-08 06:56:39.778807
W | wal: sync duration of 1.775918245s, expected less than
1s2018-12-08 06:56:41.266582 W | wal: sync duration of 1.487652289s,
expected less than 1s2018-12-08 06:56:44.948229 W | etcdserver: apply
entries took too long [3.681196669s for 4 entries]2018-12-08
06:56:44.948254 W | etcdserver: avoid queries with large range/delete
range!2018-12-08 06:56:45.075204 W | wal: sync duration of
3.808380532s, expected less than 1s2018-12-08 06:56:59.277846 I |
raft: 2b82b9112d5c6d00 [logterm: 4, index: 1028, vote: 0] ignored
MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4:
lease is not expired (remaining ticks: 10)2018-12-08 06:57:00.232584 W
| wal: sync duration of 3.045415715s, expected less than 1s2018-12-08
06:57:00.898326 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029,
vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index:
1028] at term 4: lease is not expired (remaining ticks: 10)2018-12-08
06:57:01.425086 W | wal: sync duration of 1.192379521s, expected less
than 1s2018-12-08 06:57:02.198238 I | raft: 2b82b9112d5c6d00 [logterm:
4, index: 1029, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f
[logterm: 4, index: 1028] at term 4: lease is not expired (remaining
ticks: 9)2018-12-08 06:57:02.617132 W | etcdserver: apply entries took
too long [1.1915068s for 1 entries]2018-12-08 06:57:02.617275 W |
etcdserver: avoid queries with large range/delete range!2018-12-08
06:57:02.617319 W | wal: sync duration of 1.191904588s, expected less
than 1s2018-12-08 06:57:04.150989 I | raft: 2b82b9112d5c6d00 [term: 4]
received a MsgVote message with higher term from 36dea2cf9cf8b40f
[term: 8]2018-12-08 06:57:04.151025 I | raft: 2b82b9112d5c6d00 became
follower at term 82018-12-08 06:57:04.151039 I | raft:
2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote
from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 82018-12-08
06:57:04.151049 I | raft: raft.node: 2b82b9112d5c6d00 lost leader
45ac11b837bb4860 at term 82018-12-08 06:57:04.725808 W | etcdserver:
timed out waiting for read index response2018-12-08 06:57:04.725861 I
| raft: 2b82b9112d5c6d00 no leader at term 8; dropping index reading
msg2018-12-08 06:57:05.029466 I | raft: 2b82b9112d5c6d00 [term: 8]
ignored a MsgVote message with lower term from 45ac11b837bb4860 [term:
6]2018-12-08 06:57:05.939321 I | raft: 2b82b9112d5c6d00 [term: 8]
received a MsgVote message with higher term from 36dea2cf9cf8b40f
[term: 9]2018-12-08 06:57:05.939378 I | raft: 2b82b9112d5c6d00 became
follower at term 92018-12-08 06:57:05.939408 I | raft:
2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote
from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 92018-12-08
06:57:07.421230 I | raft: 2b82b9112d5c6d00 [term: 9] received a
MsgVote message with higher term from 45ac11b837bb4860 [term:
10]2018-12-08 06:57:07.421286 I | raft: 2b82b9112d5c6d00 became
follower at term 102018-12-08 06:57:07.421315 I | raft:
2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] cast MsgVote for
45ac11b837bb4860 [logterm: 4, index: 1029] at term 102018-12-08
06:57:07.423315 I | raft: raft.node: 2b82b9112d5c6d00 elected leader
45ac11b837bb4860 at term 102018-12-08 06:57:11.726065 W | etcdserver:
timed out waiting for read index response2018-12-08 06:57:52.861971 W
| etcdserver: apply entries took too long [287.709698ms for 2
entries]2018-12-08 06:57:52.861997 W | etcdserver: avoid queries with
large range/delete range!2018-12-08 06:58:40.003521 W | wal: sync
duration of 1.07701685s, expected less than 1s2018-12-08
06:59:03.397005 W | wal: sync duration of 3.478824796s, expected less
than 1s2018-12-08 07:00:20.198294 I | raft: 2b82b9112d5c6d00 [term:
10] received a MsgVote message with higher term from 36dea2cf9cf8b40f
[term: 11]2018-12-08 07:00:20.198335 I | raft: 2b82b9112d5c6d00 became
follower at term 112018-12-08 07:00:20.198352 I | raft:
2b82b9112d5c6d00 [logterm: 10, index: 4759, vote: 0] cast MsgVote for
36dea2cf9cf8b40f [logterm: 10, index: 4759] at term 112018-12-08
07:00:20.198364 I | raft: raft.node: 2b82b9112d5c6d00 lost leader
45ac11b837bb4860 at term 112018-12-08 07:00:20.200279 I | raft:
raft.node: 2b82b9112d5c6d00 elected leader 36dea2cf9cf8b40f at term
112018-12-08 07:00:25.745499 W | etcdserver: timed out waiting for
read index response2018-12-08 07:02:54.605683 W |
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream
("rpc error: code = Unavailable desc = stream error: stream ID 3;
CANCEL")2018-12-08 07:02:54.606681 W | etcdserver/api/v3rpc: failed to
receive watch request from gRPC stream ("rpc error: code = Unavailable
desc = stream error: stream ID 3; CANCEL")2018-12-08 07:02:54.606800 W
| etcdserver/api/v3rpc: failed to receive watch request from gRPC
stream ("rpc error: code = Unavailable desc = stream error: stream ID
3; CANCEL")2018-12-08 07:02:54.606871 W | etcdserver/api/v3rpc: failed
to receive watch request from gRPC stream ("rpc error: code =
Unavailable desc = stream error: stream ID 3; CANCEL")2018-12-08
07:03:22.447814 W | etcdserver: apply entries took too long
[104.730051ms for 1 entries]2018-12-08 07:03:22.447840 W | etcdserver:
avoid queries with large range/delete range!2018-12-08 07:04:16.093172
W | etcdserver/api/v3rpc: failed to receive watch request from gRPC
stream ("rpc error: code = Unavailable desc = stream error: stream ID
3; CANCEL")2018-12-08 07:04:16.093373 W | etcdserver/api/v3rpc: failed
to receive watch request from gRPC stream ("rpc error: code =
Unavailable desc = stream error: stream ID 3; CANCEL")2018-12-08
07:04:16.093563 W | etcdserver/api/v3rpc: failed to receive watch
request from gRPC stream ("rpc error: code = Unavailable desc = stream
error: stream ID 3; CANCEL")2018-12-08 07:04:32.882512 W | rafthttp:
lost the TCP streaming connection with peer 45ac11b837bb4860 (stream
MsgApp v2 reader)2018-12-08 07:04:32.882638 E | rafthttp: failed to
read 45ac11b837bb4860 on stream MsgApp v2 (unexpected EOF)2018-12-08
07:04:32.882692 I | rafthttp: peer 45ac11b837bb4860 became
inactive2018-12-08 07:04:32.883003 W | rafthttp: lost the TCP
streaming connection with peer 45ac11b837bb4860 (stream Message
reader)2018-12-08 07:04:36.670428 W | rafthttp: lost the TCP streaming
connection with peer 36dea2cf9cf8b40f (stream Message
reader)2018-12-08 07:04:36.670452 E | rafthttp: failed to read
36dea2cf9cf8b40f on stream Message (unexpected EOF)2018-12-08
07:04:36.674767 I | rafthttp: peer 36dea2cf9cf8b40f became
inactive2018-12-08 07:04:36.676627 W | rafthttp: lost the TCP
streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2
reader)2018-12-08 07:04:36.692527 I | raft: 2b82b9112d5c6d00 is
starting a new election at term 112018-12-08 07:04:36.692557 I | raft:
2b82b9112d5c6d00 became candidate at term 122018-12-08 07:04:36.692569
I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00
at term 122018-12-08 07:04:36.692581 I | raft: 2b82b9112d5c6d00
[logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at
term 122018-12-08 07:04:36.692592 I | raft: 2b82b9112d5c6d00 [logterm:
11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term
122018-12-08 07:04:36.692603 I | raft: raft.node: 2b82b9112d5c6d00
lost leader 36dea2cf9cf8b40f at term 122018-12-08 07:04:37.465281 W |
rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860
(stream Message writer)2018-12-08 07:04:38.189677 I | raft:
2b82b9112d5c6d00 is starting a new election at term 122018-12-08
07:04:38.189807 I | raft: 2b82b9112d5c6d00 became candidate at term
132018-12-08 07:04:38.189862 I | raft: 2b82b9112d5c6d00 received
MsgVoteResp from 2b82b9112d5c6d00 at term 132018-12-08 07:04:38.189910
I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote
request to 45ac11b837bb4860 at term 132018-12-08 07:04:38.189955 I |
raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request
to 36dea2cf9cf8b40f at term 132018-12-08 07:04:38.193365 W | rafthttp:
lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream
Message writer)2018-12-08 07:04:38.584175 W | rafthttp: lost the TCP
streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2
writer)2018-12-08 07:04:39.789772 I | raft: 2b82b9112d5c6d00 is
starting a new election at term 132018-12-08 07:04:39.790005 I | raft:
2b82b9112d5c6d00 became candidate at term 142018-12-08 07:04:39.790057
I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00
at term 142018-12-08 07:04:39.790105 I | raft: 2b82b9112d5c6d00
[logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at
term 142018-12-08 07:04:39.790177 I | raft: 2b82b9112d5c6d00 [logterm:
11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term
142018-12-08 07:04:40.237211 W | rafthttp: lost the TCP streaming
connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 writer)
So still a few slow-disk errors like:
2018-12-08 06:55:58.886844 W | wal: sync duration of 1.404630213s,
expected less than 1s
Maybe the disk access is hitting CPU limitations? I dunno. But things seem
to be moving along sustainably until around 07:03, after which we start to
see lots of networking/connectivity issues. Next time I go through this,
I'll try to grab etcd-member logs from the other masters as well.
Also, in case the whole etcd-member restart is a red herring, I've started
looking at the end of the Exited queue too. Some thoughts there over in
openshift/machine-config-operator#224
<openshift/machine-config-operator#224>.
CC @smarterclayton <https://github.com/smarterclayton>, @crawford
<https://github.com/crawford>, @abhinavdahiya
<https://github.com/abhinavdahiya>, @deads2k <https://github.com/deads2k>,
@sttts <https://github.com/sttts>, and anyone else who understands how this
is supposed to be working and can separate the fatal stuff from the log
noise. Thoughts on what is happening? Anything else I should grab next time?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#844 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p-VvuSe3cCknCC88RM1o1XgEmXf1ks5u23JcgaJpZM4ZJfwr>
.
|
We'd bumped this to 120 GiB in bca6a92 (#737), but in a recent CI cluster the master-0 etcd-member still complained about sync duration:
I couldn't find the termination log (maybe got reaped when the container went down, which wouldn't be very helpful). And I haven't been able to discover the significance, if any, of the 255 exit. I'm also not clear on why the new pod has a different image (94bc... vs. he original's quay.io/coreos/etcd@sha256:688e...).
On gp2's sliding IOPS scale, 500 GiB gets us lots more space (1.5k IOPS baseline), but the etcd docs recommend m4.large with max 3.6k concurrent IOPS for a 50-node kubernetes cluster. So this is basically a shot in the dark, but we'll see if it helps in CI.