Skip to content

etcdserver: request timed out, possibly due to connection lost #1059

@wking

Description

@wking

Version

$ openshift-install version
openshift-install v0.9.1

Platform (aws|libvirt|openstack):

All.

What happened?

In an e2e-aws run mentioned here:

fail [k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:248]: Expected error:
    <*errors.errorString | 0xc4212bc710>: {
        s: "pod Create API error: etcdserver: request timed out, possibly due to connection lost",
    }
    pod Create API error: etcdserver: request timed out, possibly due to connection lost
not to have occurred

What you expected to happen?

No errors due to etcd delays.

How to reproduce it (as minimally and precisely as possible)?

There have been a lot of these in CI recently, although I'm not sure what would have changed. AWS has had a number of performance issues for us today though, including slow resource generation. Maybe our CI disks are just running slower than usual or something?

Anything else we need to know?

Details or a similar issue in the etcd logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1054/pull-ci-openshift-installer-master-e2e-aws/2824/artifacts/e2e-aws/pods/kube-system_etcd-member-ip-10-0-13-114.ec2.internal_etcd-member.log.gz | gunzip | grep -B2 -A3 'etcdserver: request timed out' | head -n 9
2019-01-12 01:35:58.954371 I | raft: raft.node: 1b29101e3d7dd22a lost leader bd31e70ef4e40f8b at term 13
2019-01-12 01:35:59.812308 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2019-01-12 01:35:59.812418 W | etcdserver: read-only range request "key:\"/openshift.io/podtemplates\" range_end:\"/openshift.io/podtemplatet\" count_only:true " with result "error:etcdserver: request timed out" took too long (7.336055856s) to execute
2019-01-12 01:35:59.812518 W | etcdserver: read-only range request "key:\"/openshift.io/services/endpoints/kube-system/kube-scheduler\" " with result "error:etcdserver: request timed out" took too long (8.292539027s) to execute
2019-01-12 01:35:59.812576 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082020841s) to execute
2019-01-12 01:35:59.812635 W | etcdserver: read-only range request "key:\"/openshift.io/pods/openshift-cluster-kube-scheduler-operator/openshift-cluster-kube-scheduler-operator-56f567694-87qpg\" " with result "error:etcdserver: request timed out" took too long (9.082939895s) to execute
2019-01-12 01:36:02.056897 I | raft: 1b29101e3d7dd22a [term: 13] ignored a MsgReadIndexResp message with lower term from bd31e70ef4e40f8b [term: 12]
2019-01-12 01:36:02.554105 W | wal: sync duration of 3.599668164s, expected less than 1s
2019-01-12 01:36:03.654282 I | raft: 1b29101e3d7dd22a is starting a new election at term 13

This seems similar to etcd-io/etcd#9464, which talks about ticks for election and pre-voting as potential fixes, and about bumping to 3.4 to get them. Are their plans for bumping the elderly 3.1.14 we use for bootstrap health checks? Or the more respectable 3.3.10 the machine-config operator suggests for the masters? I guess we'd have to bump to 3.4 for pre-voting, since 3.3.10 already contains the backported-to-3.3.x etcd-io/etcd@3282d9070 (which landed in 3.3.3). Or maybe the problem is something else entirely :p.

As a minor pivot, it seems safe enough for us to move up to 3.3.10 to catch up with openshift/machine-config-operator@59f809676.

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions