Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 8, 2018

We'd bumped this to 120 GiB in bca6a92 (#737), but in a recent CI cluster the master-0 etcd-member still complained about sync duration:

[core@ip-10-0-6-251 ~]$ sudo crictl ps -a | grep etcd-member
4612be4ebdd15  94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4                             27 minutes ago  Running  etcd-member  1
81b927a307579  quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1  43 minutes ago  Exited   etcd-member  0
[core@ip-10-0-6-251 ~]$ sudo crictl inspect 81b927a307579
{
  "status": {
    "id": "81b927a307579bab5675851d1ffa3e37c288a252c125ffe0a9e7b4a772fae0a7",
    "metadata": {
      "attempt": 0,
      "name": "etcd-member"
    },
    "state": "CONTAINER_EXITED",
    "createdAt": "2018-12-08T05:36:05.119762774Z",
    "startedAt": "2018-12-08T05:36:05.143173509Z",
    "finishedAt": "2018-12-08T05:51:04.889449401Z",
    "exitCode": 255,
    "image": {
      "image": "quay.io/coreos/etcd:v3.2.14"
    },
    "imageRef": "quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1",
    "reason": "Error",
    "message": "",
    "labels": {
      "io.kubernetes.container.name": "etcd-member",
      "io.kubernetes.pod.name": "etcd-member-ip-10-0-6-251.ec2.internal",
      "io.kubernetes.pod.namespace": "kube-system",
      "io.kubernetes.pod.uid": "73a817bd724bb22e2bd6e4ff0ccee478"
    },
    "annotations": {
      "io.kubernetes.container.hash": "f3672bc7",
      "io.kubernetes.container.ports": "[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]",
      "io.kubernetes.container.restartCount": "0",
      "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
      "io.kubernetes.container.terminationMessagePolicy": "File",
      "io.kubernetes.pod.terminationGracePeriod": "30"
    },
    "mounts": [],
    "logPath": "/var/log/pods/73a817bd724bb22e2bd6e4ff0ccee478/etcd-member/0.log"
  }
}
[core@ip-10-0-6-251 ~]$ sudo crictl logs --tail=10 81b927a307579
2018-12-08 05:48:01.220052 W | wal: sync duration of 1.551958035s, expected less than 1s
2018-12-08 05:48:05.013263 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 11; CANCEL")
2018-12-08 05:48:12.491950 W | wal: sync duration of 1.756141212s, expected less than 1s
2018-12-08 05:49:04.229575 W | etcdserver: apply entries took too long [120.740107ms for 1 entries]
2018-12-08 05:49:04.229602 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 05:49:04.712619 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 05:49:06.346062 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 05:49:06.722589 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 05:49:11.048170 W | etcdserver: apply entries took too long [151.877306ms for 1 entries]
2018-12-08 05:49:11.048202 W | etcdserver: avoid queries with large range/delete range!

I couldn't find the termination log (maybe got reaped when the container went down, which wouldn't be very helpful). And I haven't been able to discover the significance, if any, of the 255 exit. I'm also not clear on why the new pod has a different image (94bc... vs. he original's quay.io/coreos/etcd@sha256:688e...).

On gp2's sliding IOPS scale, 500 GiB gets us lots more space (1.5k IOPS baseline), but the etcd docs recommend m4.large with max 3.6k concurrent IOPS for a 50-node kubernetes cluster. So this is basically a shot in the dark, but we'll see if it helps in CI.

We'd bumped this to 120 GiB in bca6a92 (aws: Increase default master
disk size to 120GB for IO, 2018-12-27), but in a recent CI cluster [1]
the master-0 etcd-member still cycled:

  [core@ip-10-0-6-251 ~]$ sudo crictl ps -a | grep etcd-member
  4612be4ebdd15  94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4                             27 minutes ago  Running  etcd-member  1
  81b927a307579  quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1  43 minutes ago  Exited   etcd-member  0
  [core@ip-10-0-6-251 ~]$ sudo crictl inspect 81b927a307579
  {
    "status": {
      "id": "81b927a307579bab5675851d1ffa3e37c288a252c125ffe0a9e7b4a772fae0a7",
      "metadata": {
        "attempt": 0,
        "name": "etcd-member"
      },
      "state": "CONTAINER_EXITED",
      "createdAt": "2018-12-08T05:36:05.119762774Z",
      "startedAt": "2018-12-08T05:36:05.143173509Z",
      "finishedAt": "2018-12-08T05:51:04.889449401Z",
      "exitCode": 255,
      "image": {
        "image": "quay.io/coreos/etcd:v3.2.14"
      },
      "imageRef": "quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1",
      "reason": "Error",
      "message": "",
      "labels": {
        "io.kubernetes.container.name": "etcd-member",
        "io.kubernetes.pod.name": "etcd-member-ip-10-0-6-251.ec2.internal",
        "io.kubernetes.pod.namespace": "kube-system",
        "io.kubernetes.pod.uid": "73a817bd724bb22e2bd6e4ff0ccee478"
      },
      "annotations": {
        "io.kubernetes.container.hash": "f3672bc7",
        "io.kubernetes.container.ports": "[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]",
        "io.kubernetes.container.restartCount": "0",
        "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
        "io.kubernetes.container.terminationMessagePolicy": "File",
        "io.kubernetes.pod.terminationGracePeriod": "30"
      },
      "mounts": [],
      "logPath": "/var/log/pods/73a817bd724bb22e2bd6e4ff0ccee478/etcd-member/0.log"
    }
  }
  [core@ip-10-0-6-251 ~]$ sudo crictl logs --tail=10 81b927a307579
  2018-12-08 05:48:01.220052 W | wal: sync duration of 1.551958035s, expected less than 1s
  2018-12-08 05:48:05.013263 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 11; CANCEL")
  2018-12-08 05:48:12.491950 W | wal: sync duration of 1.756141212s, expected less than 1s
  2018-12-08 05:49:04.229575 W | etcdserver: apply entries took too long [120.740107ms for 1 entries]
  2018-12-08 05:49:04.229602 W | etcdserver: avoid queries with large range/delete range!
  2018-12-08 05:49:04.712619 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
  2018-12-08 05:49:06.346062 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
  2018-12-08 05:49:06.722589 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
  2018-12-08 05:49:11.048170 W | etcdserver: apply entries took too long [151.877306ms for 1 entries]
  2018-12-08 05:49:11.048202 W | etcdserver: avoid queries with large range/delete range!

On gp2's sliding IOPS scale [2], 500 GiB gets us lots more space (1.5k
IOPS baseline), but the etcd docs recommend m4.large with max 3.6k
concurrent IOPS for a 50-node kubernetes cluster [3].  So we'll see if
it helps in CI.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/841/pull-ci-openshift-installer-master-e2e-aws/2064/build-log.txt
[2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes_gp2
[3]: https://github.com/etcd-io/etcd/blob/v3.3.10/Documentation/op-guide/hardware.md#small-cluster
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 8, 2018
@openshift-ci-robot
Copy link
Contributor

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws 18cdb69 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Dec 8, 2018

First e2e-aws failure was:

Dec  8 07:20:16.538: INFO: Couldn't delete ns: "e2e-tests-limitrange-q8j4c": Error while fetching pod metrics for selector  in namespace "e2e-tests-limitrange-q8j4c": no pods to fetch metrics for (&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:""}, Status:"Failure", Message:"Error while fetching pod metrics for selector  in namespace \"e2e-tests-limitrange-q8j4c\": no pods to fetch metrics for", Reason:"", Details:(*v1.StatusDetails)(nil), Code:500}})
Dec  8 07:20:16.541: INFO: Running AfterSuite actions on all node
Dec  8 07:20:16.541: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/framework.go:319]: Dec  8 07:20:16.538: Couldn't delete ns: "e2e-tests-limitrange-q8j4c": Error while fetching pod metrics for selector  in namespace "e2e-tests-limitrange-q8j4c": no pods to fetch metrics for (&errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:""}, Status:"Failure", Message:"Error while fetching pod metrics for selector  in namespace \"e2e-tests-limitrange-q8j4c\": no pods to fetch metrics for", Reason:"", Details:(*v1.StatusDetails)(nil), Code:500}})

Dec 08 07:19:51.571 I ns=openshift-monitoring deployment=prometheus-operator Scaled up replica set prometheus-operator-bb4d58c75 to 1
Dec 08 07:19:52.817 I ns=openshift-monitoring replicaset=prometheus-operator-bb4d58c75 Created pod: prometheus-operator-bb4d58c75-97fpg
Dec 08 07:19:52.826 I ns=openshift-monitoring pod=prometheus-operator-bb4d58c75-97fpg Successfully assigned openshift-monitoring/prometheus-operator-bb4d58c75-97fpg to ip-10-0-174-141.ec2.internal
Dec 08 07:19:55.676 I ns=openshift-monitoring pod=prometheus-operator-bb4d58c75-97fpg pulling image "quay.io/coreos/prometheus-operator:v0.26.0"
Dec 08 07:19:58.683 I ns=openshift-monitoring pod=prometheus-operator-bb4d58c75-97fpg Successfully pulled image "quay.io/coreos/prometheus-operator:v0.26.0"
Dec 08 07:19:59.222 I ns=openshift-monitoring pod=prometheus-operator-bb4d58c75-97fpg Created container
Dec 08 07:19:59.367 I ns=openshift-monitoring pod=prometheus-operator-bb4d58c75-97fpg Started container
Dec 08 07:19:59.974 I ns=openshift-monitoring deployment=prometheus-operator Scaled down replica set prometheus-operator-8dd74c6f to 0
Dec 08 07:20:00.013 W ns=openshift-monitoring pod=prometheus-operator-8dd74c6f-hll86 node=ip-10-0-153-188.ec2.internal graceful deletion within 30s
Dec 08 07:20:00.079 I ns=openshift-monitoring replicaset=prometheus-operator-8dd74c6f Deleted pod: prometheus-operator-8dd74c6f-hll86
Dec 08 07:20:00.698 I ns=openshift-monitoring pod=prometheus-operator-8dd74c6f-hll86 Killing container with id cri-o://prometheus-operator:Need to kill Pod
Dec 08 07:20:08.689 W ns=openshift-monitoring pod=prometheus-operator-8dd74c6f-hll86 node=ip-10-0-153-188.ec2.internal invariant violation (bug): pod should not transition Running->Pending even when terminated
Dec 08 07:20:08.689 W ns=openshift-monitoring pod=prometheus-operator-8dd74c6f-hll86 node=ip-10-0-153-188.ec2.internal container=prometheus-operator container stopped being ready
Dec 08 07:20:09.046 W ns=openshift-monitoring pod=prometheus-operator-8dd74c6f-hll86 node=ip-10-0-153-188.ec2.internal deleted

failed: (1m22s) 2018-12-08T07:20:16 "[sig-scheduling] LimitRange should create a LimitRange with defaults and ensure pod has those defaults applied. [Suite:openshift/conformance/parallel] [Suite:k8s] [Suite:openshift/smoke-4]"

But the underlying issue was master-instability. Here's when the initial round of pods died:

[core@ip-10-0-7-211 ~]$ sudo crictl ps -a
CONTAINER ID        IMAGE                                                                                                                          CREATED             STATE               NAME                          ATTEMPT
6413013f1b203       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc    9 seconds ago       Running             operator                      2
434fa05a795cb       f6df05d6a36426dde85b9c778b435a7aaa12543d69d603f629b8f5273356ec7b                                                               13 seconds ago      Running             cluster-dns-operator          1
9c3e0b96b8325       f06c190859935d127c2efee77beb689fbacb53ec93b88547c25392fc970289f7                                                               14 seconds ago      Running             operator                      1
ebc194bc9b966       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:6667ac4aecae183dfd4e6ae4277dd86ca977e0a3b9feefee653043105503c6d6    18 seconds ago      Exited              tuned                         1
011545eb60059       05503aa686767edf45b70172c8975c8b9743bb6a6c1c182c181eb36cd610f6fc                                                               18 seconds ago      Running             machine-config-server         1
d8aa7ead7c3dc       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820    18 seconds ago      Running             machine-approver-controller   1
413a0cca0629b       bb84dbdafdfa61c29ca989c652d63debe03a043c2099f6ad7097ac28865bd406                                                               20 seconds ago      Running             cluster-network-operator      1
c0ace414b0ce2       25dbbded706585310ae3ebc743bcc59659f826fff1bac24be4a56b83d37e3cc2                                                               25 seconds ago      Running             machine-config-daemon         1
a2e1ef4a0c97b       03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9                                                               27 seconds ago      Running             sdn-controller                1
c08642e38da06       registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13   27 seconds ago      Running             cluster-version-operator      1
8f80023565506       1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8                                                               28 seconds ago      Running             sdn                           1
fb31831e0665c       1d2ec4ba1e697f9c0eb69c154888e6f09007a3d2aad4c34bb7868cec86b8f8f8                                                               28 seconds ago      Running             openvswitch                   1
89155ae218212       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98    30 seconds ago      Running             openshift-kube-apiserver      1
edafa7d07d214       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315    30 seconds ago      Running             kube-controller-manager       1
abab48f8aaf0e       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315    30 seconds ago      Running             scheduler                     1
b735c768b8baf       94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4                                                               38 seconds ago      Running             etcd-member                   1
b79135e346749       b02de22ff740f0bfa7e5dde5aa1a8169051375a5f0c69c28fafefc9408f72b06                                                               39 seconds ago      Exited              certs                         0
04b4489ae42e3       04a052dbf6cb5ac2afa57eb41c37a2964ee16c7ee62986900aceb38f369c8411                                                               39 seconds ago      Exited              discovery                     1
1e38cb58d5063       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:14b67d7a5d1ec05dd45e60663b6e4e0c460cf7f429397dd3a3ec2d1997e52096    2 minutes ago       Exited              machine-config-daemon         0
e8715191161af       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:08fbd1a5a90b59572a3b097389fc17f7ae9b9b1ef7e1f3d19103e280fc656518    2 minutes ago       Exited              console                       0
9c19639ae072d       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:9ecf75c8384f5ec5431b4a3267466833c1caa212a183ce8b3e7f42bc3f7e1dcc    3 minutes ago       Exited              machine-config-server         0
439c340ab23ce       03de8f11d9e07ee2b23be6d48dc849b9a5e24e4ab4c3ab758bdcd583b3b8fbd9                                                               3 minutes ago       Exited              controller-manager            0
a12b3df1bae1f       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98    4 minutes ago       Exited              openshift-kube-apiserver      0
74d8d492a43db       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98    4 minutes ago       Exited              openshift-apiserver           0
98c47bd0e9246       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc    4 minutes ago       Exited              installer                     0
6c15d73ec3e14       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:c8c110b8733d0d352ddc5fe35ba9eeac913b7609c2c9c778586f2bb74f281681    4 minutes ago       Exited              registry                      0
7796346af6b01       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:581af93fda80257651d621dade879e438f846a5bf39040dd0259006fc3b73820    5 minutes ago       Exited              machine-approver-controller   0
afb16e1bdfea8       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:40fa0e40cf625f314586d47a4199d5de985b7f980d7c4b650ab7f2c0f74f41b2    5 minutes ago       Exited              registry-ca-hostmapper        0
8aff3b343e0dc       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc    6 minutes ago       Exited              operator                      1
158f097ba79cb       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315    7 minutes ago       Exited              kube-controller-manager       0
128c03f9fa9a8       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:38d43ca65fce090c19b092b1c0962781c146f9fc65f3228eb96f5aad684c9119    7 minutes ago       Exited              installer                     0
bf6ac2a6570f2       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:4d0106d7428828c87ed905728742fbc11bd8b30d0c87165359699d0a475e2315    8 minutes ago       Exited              scheduler                     0
b43c90d9cebb6       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:b532d9351b803b5a03cf6777f12d725b4973851957497ea3e2b37313aadd6750    8 minutes ago       Exited              operator                      0
4f333ad437f34       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a494ee2152687973d1987598cf82f7606b2c4fbb48c7f09f2e897cb417ab88f1    8 minutes ago       Exited              installer                     0
28eed681186eb       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:299840e1ef37549af1c6bb9b45ed1f4eb48ca51b0384772709ce88e3d9d60bfc    8 minutes ago       Exited              installer                     0
c0d2a89890a29       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:ce86e514320b680f39735323288cfd19caee5a9480b086b4b275454aef94136e    8 minutes ago       Exited              dns-node-resolver             0
d244a6189ea7b       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:e4936a702d7d466a64a6a9359f35c7ad528bba7c35fe5c582a90e46f9051d8b8    8 minutes ago       Exited              dns                           0
7e12941ef62cd       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:3287ba30af508652bda691490ad7dbd97febf4e90b72e23228f87c9038fc387e    9 minutes ago       Exited              operator                      0
9f0bcda47b3bc       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f538d65377e0891e70555009f4778c0b1782360ea0f4309adec905cad752593d    9 minutes ago       Exited              cluster-dns-operator          0
dc6d7fba5de79       registry.svc.ci.openshift.org/ci-op-1mpypn4i/release@sha256:77dd81dbdb38c941fc288f551f39ddef1de251384cbfb8f6755ff7f072ab9a13   9 minutes ago       Exited              cluster-version-operator      0
b46b53bc15b68       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae    9 minutes ago       Exited              openvswitch                   0
98d49e2e52366       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:0f51e8c6713cf23fac9b4b61d3e10e453936c139ee9a58171090b5ffe7cd37ae    9 minutes ago       Exited              sdn                           0
62fe7aeacb776       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:f82a3c247a4c59538a3d40ad1a2257383420440e15c4675b2e11ad620601bf98    10 minutes ago      Exited              sdn-controller                0
e2026a15792b4       registry.svc.ci.openshift.org/ci-op-1mpypn4i/stable@sha256:a8aa3e53cbaeae806210878f0c7b499b636a963b2a52f4d1eea6db3dfa2fdc98    11 minutes ago      Exited              cluster-network-operator      0
751acc223850b       quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1                                    13 minutes ago      Exited              etcd-member                   0

Timing on that first etcd-member pod:

[core@ip-10-0-7-211 ~]$ sudo crictl inspect 751acc223850b4
{
  "status": {
    "id": "751acc223850b440d265c509f54b4b845b532c5fcbe53f4e6107486aaf81a67f",
    "metadata": {
      "attempt": 0,
      "name": "etcd-member"
    },
    "state": "CONTAINER_EXITED",
    "createdAt": "2018-12-08T06:52:46.817382964Z",
    "startedAt": "2018-12-08T06:52:46.839814886Z",
    "finishedAt": "2018-12-08T07:07:43.572403945Z",
    "exitCode": 255,
    "image": {
      "image": "quay.io/coreos/etcd:v3.2.14"
    },
    "imageRef": "quay.io/coreos/etcd@sha256:688e6c102955fe927c34db97e6352d0e0962554735b2db5f2f66f3f94cfe8fd1",
    "reason": "Error",
    "message": "",
    "labels": {
      "io.kubernetes.container.name": "etcd-member",
      "io.kubernetes.pod.name": "etcd-member-ip-10-0-7-211.ec2.internal",
      "io.kubernetes.pod.namespace": "kube-system",
      "io.kubernetes.pod.uid": "b691fb2b768f9a08bdaac585b1e6cec4"
    },
    "annotations": {
      "io.kubernetes.container.hash": "c47b4da2",
      "io.kubernetes.container.ports": "[{\"name\":\"peer\",\"hostPort\":2380,\"containerPort\":2380,\"protocol\":\"TCP\"},{\"name\":\"server\",\"hostPort\":2379,\"containerPort\":2379,\"protocol\":\"TCP\"}]",
      "io.kubernetes.container.restartCount": "0",
      "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
      "io.kubernetes.container.terminationMessagePolicy": "File",
      "io.kubernetes.pod.terminationGracePeriod": "30"
    },
    "mounts": [],
    "logPath": "/var/log/pods/b691fb2b768f9a08bdaac585b1e6cec4/etcd-member/0.log"
  }
}

Again, no luck with the termination log:

[core@ip-10-0-7-211 ~]$ sudo crictl exec 751acc223850b4 cat /dev/termination-log
FATA[0000] execing command in container failed: Internal error occurred: error executing command in container: container is not created or running

Maybe I can tail -f that file next time. etcd-member logs:

[core@ip-10-0-7-211 ~]$ sudo crictl logs 751acc223850b4
2018-12-08 06:52:46.851656 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2018-12-08 06:52:46.851995 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd-member-ip-10-0-7-211.ec2.internal
2018-12-08 06:52:46.852129 I | etcdmain: etcd Version: 3.2.14
2018-12-08 06:52:46.852185 I | etcdmain: Git SHA: fb5cd6f1c
2018-12-08 06:52:46.852232 I | etcdmain: Go Version: go1.8.5
2018-12-08 06:52:46.852342 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-08 06:52:46.852357 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-12-08 06:52:46.852526 I | embed: peerTLS: cert = /etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-peer:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-08 06:52:46.853286 I | embed: listening for peers on https://0.0.0.0:2380
2018-12-08 06:52:46.853396 I | embed: listening for client requests on 0.0.0.0:2379
2018-12-08 06:52:46.862199 N | embed: got bootstrap from DNS for etcd-server at 0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.862232 N | embed: got bootstrap from DNS for etcd-server at etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.862261 N | embed: got bootstrap from DNS for etcd-server at 1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.865255 I | pkg/netutil: resolving ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380 to 10.0.7.211:2380
2018-12-08 06:52:46.886503 I | etcdserver: name = etcd-member-ip-10-0-7-211.ec2.internal
2018-12-08 06:52:46.886520 I | etcdserver: data dir = /var/lib/etcd
2018-12-08 06:52:46.886529 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-08 06:52:46.886534 I | etcdserver: heartbeat = 100ms
2018-12-08 06:52:46.886538 I | etcdserver: election = 1000ms
2018-12-08 06:52:46.886543 I | etcdserver: snapshot count = 100000
2018-12-08 06:52:46.886554 I | etcdserver: advertise client URLs = https://10.0.7.211:2379
2018-12-08 06:52:46.886561 I | etcdserver: initial advertise peer URLs = https://10.0.7.211:2380
2018-12-08 06:52:46.886582 I | etcdserver: initial cluster = 0=https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380,1=https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380,etcd-member-ip-10-0-7-211.ec2.internal=https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380
2018-12-08 06:52:46.889270 I | etcdserver: starting member 2b82b9112d5c6d00 in cluster 3cb94e8b51630707
2018-12-08 06:52:46.889305 I | raft: 2b82b9112d5c6d00 became follower at term 0
2018-12-08 06:52:46.889322 I | raft: newRaft 2b82b9112d5c6d00 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-12-08 06:52:46.889329 I | raft: 2b82b9112d5c6d00 became follower at term 1
2018-12-08 06:52:46.894760 W | auth: simple token is not cryptographically signed
2018-12-08 06:52:46.898690 I | rafthttp: starting peer 36dea2cf9cf8b40f...
2018-12-08 06:52:46.898738 I | rafthttp: started HTTP pipelining with peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.902994 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (writer)
2018-12-08 06:52:46.903066 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (writer)
2018-12-08 06:52:46.906014 I | rafthttp: started peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.906107 I | rafthttp: added peer 36dea2cf9cf8b40f
2018-12-08 06:52:46.906202 I | rafthttp: starting peer 45ac11b837bb4860...
2018-12-08 06:52:46.906275 I | rafthttp: started HTTP pipelining with peer 45ac11b837bb4860
2018-12-08 06:52:46.911971 I | rafthttp: started peer 45ac11b837bb4860
2018-12-08 06:52:46.912055 I | rafthttp: added peer 45ac11b837bb4860
2018-12-08 06:52:46.912131 I | etcdserver: starting server... [version: 3.2.14, cluster version: to_be_decided]
2018-12-08 06:52:46.914329 I | embed: ClientTLS: cert = /etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-server:ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-08 06:52:46.915682 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 06:52:46.916041 I | rafthttp: started streaming with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 06:52:46.916369 I | rafthttp: started streaming with peer 45ac11b837bb4860 (writer)
2018-12-08 06:52:46.916446 I | rafthttp: started streaming with peer 45ac11b837bb4860 (writer)
2018-12-08 06:52:46.916534 I | rafthttp: started streaming with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 06:52:46.916920 I | rafthttp: started streaming with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 06:52:46.918644 I | rafthttp: peer 36dea2cf9cf8b40f became active
2018-12-08 06:52:46.918724 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message writer)
2018-12-08 06:52:46.919079 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 writer)
2018-12-08 06:52:46.919611 I | etcdserver/membership: added member 2b82b9112d5c6d00 [https://ci-op-1mpypn4i-1d3f3-etcd-0.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.919923 I | etcdserver/membership: added member 36dea2cf9cf8b40f [https://ci-op-1mpypn4i-1d3f3-etcd-1.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.920190 I | etcdserver/membership: added member 45ac11b837bb4860 [https://ci-op-1mpypn4i-1d3f3-etcd-2.origin-ci-int-aws.dev.rhcloud.com:2380] to cluster 3cb94e8b51630707
2018-12-08 06:52:46.940293 I | raft: 2b82b9112d5c6d00 [term: 1] received a MsgHeartbeat message with higher term from 45ac11b837bb4860 [term: 4]
2018-12-08 06:52:46.940544 I | raft: 2b82b9112d5c6d00 became follower at term 4
2018-12-08 06:52:46.940605 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 45ac11b837bb4860 at term 4
2018-12-08 06:52:46.956645 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 06:52:46.961710 I | rafthttp: peer 45ac11b837bb4860 became active
2018-12-08 06:52:46.961796 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 06:52:46.962264 I | rafthttp: established a TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 06:52:46.979561 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream Message writer)
2018-12-08 06:52:46.980665 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 06:52:46.981517 I | rafthttp: established a TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 writer)
2018-12-08 06:52:46.983277 N | etcdserver/membership: set the initial cluster version to 3.0
2018-12-08 06:52:46.983377 I | etcdserver/api: enabled capabilities for version 3.0
2018-12-08 06:52:46.984458 I | etcdserver: published {Name:etcd-member-ip-10-0-7-211.ec2.internal ClientURLs:[https://10.0.7.211:2379]} to cluster 3cb94e8b51630707
2018-12-08 06:52:46.984611 I | embed: ready to serve client requests
2018-12-08 06:52:46.984920 I | embed: serving client requests on [::]:2379
WARNING: 2018/12/08 06:52:46 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
2018-12-08 06:52:47.746760 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
2018-12-08 06:52:47.746875 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-08 06:55:58.886844 W | wal: sync duration of 1.404630213s, expected less than 1s
2018-12-08 06:56:31.977480 W | etcdserver: apply entries took too long [579.627031ms for 1 entries]
2018-12-08 06:56:31.977630 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:39.778259 W | etcdserver: apply entries took too long [2.308713015s for 1 entries]
2018-12-08 06:56:39.778668 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:39.778807 W | wal: sync duration of 1.775918245s, expected less than 1s
2018-12-08 06:56:41.266582 W | wal: sync duration of 1.487652289s, expected less than 1s
2018-12-08 06:56:44.948229 W | etcdserver: apply entries took too long [3.681196669s for 4 entries]
2018-12-08 06:56:44.948254 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:56:45.075204 W | wal: sync duration of 3.808380532s, expected less than 1s
2018-12-08 06:56:59.277846 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1028, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 10)
2018-12-08 06:57:00.232584 W | wal: sync duration of 3.045415715s, expected less than 1s
2018-12-08 06:57:00.898326 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 10)
2018-12-08 06:57:01.425086 W | wal: sync duration of 1.192379521s, expected less than 1s
2018-12-08 06:57:02.198238 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] ignored MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 4: lease is not expired (remaining ticks: 9)
2018-12-08 06:57:02.617132 W | etcdserver: apply entries took too long [1.1915068s for 1 entries]
2018-12-08 06:57:02.617275 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:57:02.617319 W | wal: sync duration of 1.191904588s, expected less than 1s
2018-12-08 06:57:04.150989 I | raft: 2b82b9112d5c6d00 [term: 4] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 8]
2018-12-08 06:57:04.151025 I | raft: 2b82b9112d5c6d00 became follower at term 8
2018-12-08 06:57:04.151039 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 8
2018-12-08 06:57:04.151049 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 45ac11b837bb4860 at term 8
2018-12-08 06:57:04.725808 W | etcdserver: timed out waiting for read index response
2018-12-08 06:57:04.725861 I | raft: 2b82b9112d5c6d00 no leader at term 8; dropping index reading msg
2018-12-08 06:57:05.029466 I | raft: 2b82b9112d5c6d00 [term: 8] ignored a MsgVote message with lower term from 45ac11b837bb4860 [term: 6]
2018-12-08 06:57:05.939321 I | raft: 2b82b9112d5c6d00 [term: 8] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 9]
2018-12-08 06:57:05.939378 I | raft: 2b82b9112d5c6d00 became follower at term 9
2018-12-08 06:57:05.939408 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] rejected MsgVote from 36dea2cf9cf8b40f [logterm: 4, index: 1028] at term 9
2018-12-08 06:57:07.421230 I | raft: 2b82b9112d5c6d00 [term: 9] received a MsgVote message with higher term from 45ac11b837bb4860 [term: 10]
2018-12-08 06:57:07.421286 I | raft: 2b82b9112d5c6d00 became follower at term 10
2018-12-08 06:57:07.421315 I | raft: 2b82b9112d5c6d00 [logterm: 4, index: 1029, vote: 0] cast MsgVote for 45ac11b837bb4860 [logterm: 4, index: 1029] at term 10
2018-12-08 06:57:07.423315 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 45ac11b837bb4860 at term 10
2018-12-08 06:57:11.726065 W | etcdserver: timed out waiting for read index response
2018-12-08 06:57:52.861971 W | etcdserver: apply entries took too long [287.709698ms for 2 entries]
2018-12-08 06:57:52.861997 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 06:58:40.003521 W | wal: sync duration of 1.07701685s, expected less than 1s
2018-12-08 06:59:03.397005 W | wal: sync duration of 3.478824796s, expected less than 1s
2018-12-08 07:00:20.198294 I | raft: 2b82b9112d5c6d00 [term: 10] received a MsgVote message with higher term from 36dea2cf9cf8b40f [term: 11]
2018-12-08 07:00:20.198335 I | raft: 2b82b9112d5c6d00 became follower at term 11
2018-12-08 07:00:20.198352 I | raft: 2b82b9112d5c6d00 [logterm: 10, index: 4759, vote: 0] cast MsgVote for 36dea2cf9cf8b40f [logterm: 10, index: 4759] at term 11
2018-12-08 07:00:20.198364 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 45ac11b837bb4860 at term 11
2018-12-08 07:00:20.200279 I | raft: raft.node: 2b82b9112d5c6d00 elected leader 36dea2cf9cf8b40f at term 11
2018-12-08 07:00:25.745499 W | etcdserver: timed out waiting for read index response
2018-12-08 07:02:54.605683 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606681 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606800 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:02:54.606871 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:03:22.447814 W | etcdserver: apply entries took too long [104.730051ms for 1 entries]
2018-12-08 07:03:22.447840 W | etcdserver: avoid queries with large range/delete range!
2018-12-08 07:04:16.093172 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:16.093373 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:16.093563 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 3; CANCEL")
2018-12-08 07:04:32.882512 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 reader)
2018-12-08 07:04:32.882638 E | rafthttp: failed to read 45ac11b837bb4860 on stream MsgApp v2 (unexpected EOF)
2018-12-08 07:04:32.882692 I | rafthttp: peer 45ac11b837bb4860 became inactive
2018-12-08 07:04:32.883003 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream Message reader)
2018-12-08 07:04:36.670428 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message reader)
2018-12-08 07:04:36.670452 E | rafthttp: failed to read 36dea2cf9cf8b40f on stream Message (unexpected EOF)
2018-12-08 07:04:36.674767 I | rafthttp: peer 36dea2cf9cf8b40f became inactive
2018-12-08 07:04:36.676627 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 reader)
2018-12-08 07:04:36.692527 I | raft: 2b82b9112d5c6d00 is starting a new election at term 11
2018-12-08 07:04:36.692557 I | raft: 2b82b9112d5c6d00 became candidate at term 12
2018-12-08 07:04:36.692569 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 12
2018-12-08 07:04:36.692581 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 12
2018-12-08 07:04:36.692592 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 12
2018-12-08 07:04:36.692603 I | raft: raft.node: 2b82b9112d5c6d00 lost leader 36dea2cf9cf8b40f at term 12
2018-12-08 07:04:37.465281 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream Message writer)
2018-12-08 07:04:38.189677 I | raft: 2b82b9112d5c6d00 is starting a new election at term 12
2018-12-08 07:04:38.189807 I | raft: 2b82b9112d5c6d00 became candidate at term 13
2018-12-08 07:04:38.189862 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 13
2018-12-08 07:04:38.189910 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 13
2018-12-08 07:04:38.189955 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 13
2018-12-08 07:04:38.193365 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream Message writer)
2018-12-08 07:04:38.584175 W | rafthttp: lost the TCP streaming connection with peer 45ac11b837bb4860 (stream MsgApp v2 writer)
2018-12-08 07:04:39.789772 I | raft: 2b82b9112d5c6d00 is starting a new election at term 13
2018-12-08 07:04:39.790005 I | raft: 2b82b9112d5c6d00 became candidate at term 14
2018-12-08 07:04:39.790057 I | raft: 2b82b9112d5c6d00 received MsgVoteResp from 2b82b9112d5c6d00 at term 14
2018-12-08 07:04:39.790105 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 45ac11b837bb4860 at term 14
2018-12-08 07:04:39.790177 I | raft: 2b82b9112d5c6d00 [logterm: 11, index: 9956] sent MsgVote request to 36dea2cf9cf8b40f at term 14
2018-12-08 07:04:40.237211 W | rafthttp: lost the TCP streaming connection with peer 36dea2cf9cf8b40f (stream MsgApp v2 writer)

So still a few slow-disk errors like:

2018-12-08 06:55:58.886844 W | wal: sync duration of 1.404630213s, expected less than 1s

Maybe the disk access is hitting CPU limitations? I dunno. But things seem to be moving along sustainably until around 07:03, after which we start to see lots of networking/connectivity issues. Next time I go through this, I'll try to grab etcd-member logs from the other masters as well.

Also, in case the whole etcd-member restart is a red herring, I've started looking at the end of the Exited queue too. Some thoughts there over in openshift/machine-config-operator#224.

CC @smarterclayton, @crawford, @abhinavdahiya, @deads2k, @sttts, and anyone else who understands how this is supposed to be working and can separate the fatal stuff from the log noise. Thoughts on what is happening? Anything else I should grab next time?

@abhinavdahiya
Copy link
Contributor

@wking

I think the nodes are rebooting because of openshift/machine-config-operator#224 (comment)

journalctl --list-boot should show how many times our machines are rebooting

@wking
Copy link
Member Author

wking commented Dec 8, 2018

Also, journalctl output for the master discussed in the above comment. Looks like the reboot triggered at 7:04:28 is the likely source of etcd going down (with a final etcd-member log from that pod at 07:04:40):

Dec 08 07:04:15 ip-10-0-7-211 systemd[1]: Starting RPM-OSTree System Management Daemon...
Dec 08 07:04:15 ip-10-0-7-211 rpm-ostree[12945]: Reading config file '/etc/rpm-ostreed.conf'
Dec 08 07:04:15 ip-10-0-7-211 rpm-ostree[12945]: In idle state; will auto-exit in 62 seconds
Dec 08 07:04:15 ip-10-0-7-211 dbus[4090]: [system] Successfully activated service 'org.projectatomic.rpmostree1'
Dec 08 07:04:15 ip-10-0-7-211 systemd[1]: Started RPM-OSTree System Management Daemon.
Dec 08 07:04:15 ip-10-0-7-211 rpm-ostree[12945]: client(id:cli dbus:1.298 unit:crio-1e38cb58d5063c1b644050342ce2818b3ecfa0cc086863faa223c87b216c9b40.scope uid:0) added; new total=1
Dec 08 07:04:15 ip-10-0-7-211 rpm-ostree[12945]: client(id:cli dbus:1.298 unit:crio-1e38cb58d5063c1b644050342ce2818b3ecfa0cc086863faa223c87b216c9b40.scope uid:0) vanished; remaining=0
Dec 08 07:04:15 ip-10-0-7-211 rpm-ostree[12945]: In idle state; will auto-exit in 60 seconds
Dec 08 07:04:15 ip-10-0-7-211 hyperkube[4102]: E1208 07:04:15.894849    4102 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStream
ID=253, ErrCode=NO_ERROR, debug=""
Dec 08 07:04:15 ip-10-0-7-211 hyperkube[4102]: E1208 07:04:15.896214    4102 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStream
ID=253, ErrCode=NO_ERROR, debug=""
Dec 08 07:04:15 ip-10-0-7-211 hyperkube[4102]: E1208 07:04:15.896485    4102 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStream
ID=253, ErrCode=NO_ERROR, debug=""
Dec 08 07:04:16 ip-10-0-7-211 kernel: type=1400 audit(1544252656.891:4): avc:  granted  { setsecparam } for  pid=12580 comm="tuned" scontext=system_u:system_r:spc_t:s0 tcontext=system_u:object_r:security_t:s0 tc
lass=security
Dec 08 07:04:24 ip-10-0-7-211 sudo[13012]:     core : TTY=pts/0 ; PWD=/var/home/core ; USER=root ; COMMAND=/bin/crictl logs 751acc223850b4
Dec 08 07:04:24 ip-10-0-7-211 sudo[13012]: pam_unix(sudo:session): session opened for user root by core(uid=0)
Dec 08 07:04:25 ip-10-0-7-211 sudo[13012]: pam_unix(sudo:session): session closed for user root
Dec 08 07:04:28 ip-10-0-7-211 hyperkube[4102]: E1208 07:04:28.884315    4102 file.go:109] Unable to process watch event: can't process config file "/etc/kubernetes/manifests/etcd-member.yaml": /etc/kubernetes/ma
nifests/etcd-member.yaml: couldn't parse as pod(Object 'Kind' is missing in 'null'), please check config file.
Dec 08 07:04:28 ip-10-0-7-211 root[13036]: machine-config-daemon initiating reboot: Node will reboot into config ea8bbb8e8e084123f670a4bf90258ac8
Dec 08 07:04:28 ip-10-0-7-211 systemd-logind[4069]: System is rebooting.

I'd missed that timing earlier. That means that the later failures were after the reboot. Or maybe just noticed after the reboot? Checking on that:

Dec 08 06:55:41 ip-10-0-7-211 systemd[1]: Started libcontainer container 4c18a3e330ad7f208c51c6cb1e0d3570d067beadfe40751553fe89df84e7dc8f.
...
Dec 08 07:05:38 ip-10-0-7-211 hyperkube[4056]: I1208 07:05:38.791983    4056 server.go:1023] Started kubelet
Dec 08 07:05:38 ip-10-0-7-211 systemd[1]: Started Kubernetes Kubelet.
Dec 08 07:05:38 ip-10-0-7-211 systemd[1]: Reached target Multi-User System.
Dec 08 07:05:38 ip-10-0-7-211 hyperkube[4056]: I1208 07:05:38.801367    4056 volume_manager.go:247] Starting Kubelet Volume Manager
Dec 08 07:05:38 ip-10-0-7-211 hyperkube[4056]: E1208 07:05:38.805910    4056 event.go:212] Unable to write event: 'Post https://ci-op-1mpypn4i-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/default/events: dial tcp 10.0.54.112:6443: connect: connection refused' (may retry after sleeping)
...
Dec 08 07:05:39 ip-10-0-7-211 hyperkube[4056]: E1208 07:05:39.213351    4056 eviction_manager.go:243] eviction manager: failed to get get summary stats: failed to get node info: node "ip-10-0-7-211.ec2.internal" not found
Dec 08 07:05:39 ip-10-0-7-211 hyperkube[4056]: W1208 07:05:39.507051    4056 pod_container_deletor.go:75] Container "a884c4b9ec99ea9d80b99ea1c4c78a205895ee69c613498f8ed1167b0d1ad4c2" not found in pod's containers
Dec 08 07:05:39 ip-10-0-7-211 hyperkube[4056]: W1208 07:05:39.507097    4056 pod_container_deletor.go:75] Container "4c18a3e330ad7f208c51c6cb1e0d3570d067beadfe40751553fe89df84e7dc8f" not found in pod's containers
...

Although I don't see 4c18a3e330ad7f208 in the crictl ps -a output.

@wking
Copy link
Member Author

wking commented Dec 8, 2018

Closing this in favor of openshift/cluster-dns-operator#63

@wking wking closed this Dec 8, 2018
@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 8, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants