etcdserver: request timed out, possibly due to connection lost

# Version

```console
$ openshift-install version
openshift-install v0.9.1
```

# Platform (aws|libvirt|openstack):

All.

# What happened?

In [an e2e-aws run][1] mentioned [here][2]:

```
fail [k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:248]: Expected error:
    <*errors.errorString | 0xc4212bc710>: {
        s: "pod Create API error: etcdserver: request timed out, possibly due to connection lost",
    }
    pod Create API error: etcdserver: request timed out, possibly due to connection lost
not to have occurred
```

# What you expected to happen?

No errors due to etcd delays.

# How to reproduce it (as minimally and precisely as possible)?

There have been a lot of these in CI recently, although I'm not sure what would have changed.  AWS has had a number of performance issues for us today though, including slow resource generation.  Maybe our CI disks are just running slower than usual or something?

# Anything else we need to know?

Details or a similar issue in the etcd logs:

```console
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1054/pull-ci-openshift-installer-master-e2e-aws/2824/artifacts/e2e-aws/pods/kube-system_etcd-member-ip-10-0-13-114.ec2.internal_etcd-member.log.gz | gunzip | grep -B2 -A3 'etcdserver: request timed out' | head -n 9
2019-01-12 01:35:58.954371 I | raft: raft.node: 1b29101e3d7dd22a lost leader bd31e70ef4e40f8b at term 13
2019-01-12 01:35:59.812308 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2019-01-12 01:35:59.812418 W | etcdserver: read-only range request "key:\"/openshift.io/podtemplates\" range_end:\"/openshift.io/podtemplatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.336055856s) to execute
2019-01-12 01:35:59.812518 W | etcdserver: read-only range request "key:\"/openshift.io/services/endpoints/kube-system/kube-scheduler\" " with result "error:etcdserver: request timed out" took too long (8.292539027s) to execute
2019-01-12 01:35:59.812576 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082020841s) to execute
2019-01-12 01:35:59.812635 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082939895s) to execute
2019-01-12 01:36:02.056897 I | raft: 1b29101e3d7dd22a [term: 13] ignored a MsgReadIndexResp message with lower term from bd31e70ef4e40f8b [term: 12]
2019-01-12 01:36:02.554105 W | wal: sync duration of 3.599668164s, expected less than 1s
2019-01-12 01:36:03.654282 I | raft: 1b29101e3d7dd22a is starting a new election at term 13
```

This seems similar to etcd-io/etcd#9464, which talks about ticks for election and pre-voting as potential fixes, and about bumping to 3.4 to get them.  Are their plans for bumping [the elderly 3.1.14][3] we use for [bootstrap health checks][4]?  Or the more respectable [3.3.10 the machine-config operator suggests for the masters][5]?  I guess we'd have to bump to 3.4 for pre-voting, since 3.3.10 already contains the backported-to-3.3.x etcd-io/etcd@3282d9070 (which landed in 3.3.3).  Or maybe the problem is something else entirely :p.

As a minor pivot, it seems safe enough for us to move up to 3.3.10 to catch up with openshift/machine-config-operator@59f809676.

/kind bug

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1054/pull-ci-openshift-installer-master-e2e-aws/2824/build-log.txt
[2]: https://github.com/openshift/installer/pull/1054#issuecomment-453712139
[3]: https://github.com/openshift/installer/blob/v0.9.1/pkg/asset/ignition/bootstrap/bootstrap.go#L34
[4]: https://github.com/openshift/installer/blob/v0.9.1/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L178-L201
[5]: https://github.com/openshift/machine-config-operator/blob/e1d1ea0df81beb1fce3335ea2df3f7b8096e7bd7/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: request timed out, possibly due to connection lost #1059

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

etcdserver: request timed out, possibly due to connection lost #1059

Description

Version

Platform (aws|libvirt|openstack):

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions