Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Nov 27, 2018

With colocated masters we are seeing about ~120 IOP/s sustained, and
a 30GB gp2 drive is limited to 100 IOP/s. etcd is ~75% of the write
workload, but we are seeing some syncs take ~1s and occasional
heartbeat latency. Increase master disk by 4x to get slightly more
head room - this would in practice result in about $30-40/month more
disk on top of the ~200/mo the instances cost.

openshift/origin#21552 may be related

With colocated masters we are seeing about ~120 IOP/s sustained, and
a 30GB gp2 drive is limited to 100 IOP/s. etcd is ~75% of the write
workload, but we are seeing some syncs take ~1s and occasional
heartbeat latency. Increase master disk by 4x to get slightly more
head room - this would in practice result in about $30-40/month more
disk on top of the ~200/mo the instances cost.
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 27, 2018
@smarterclayton
Copy link
Contributor Author

/hold

look at the ec2 numbers to verify this has an impact

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2018
@smarterclayton
Copy link
Contributor Author

/retest

2 similar comments
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

may be seeing higher CPU use due to io throttling

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

variable "tectonic_aws_master_root_volume_size" {
type = "string"
default = "30"
default = "120"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to be driving these defaults from the Terraform config (because that doesn't work for folks who want to install our assets themselves without going through Terraform). Ideally we'd set it in this structure here and then push that into Terraform here. But the cluster API doesn't seem to support root volume configs at the moment (I didn't see any open issues about that, but maybe we have someone in sig-cluster-lifecycle that can ask about getting it added). In the meantime, we're pulling this straight from the install-config, although folks that do not get this added via the install-config may not have it set in the Terraform variables at all (in which case your default here will matter). So this probably works as you have it, but only as long as the cluster-API operator doesn't have to get involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to be driving these defaults from the Terraform config (because that doesn't work for folks who want to install our assets themselves without going through Terraform).

A good start would be to not have defaults in terraform and drive all options throught installconfig or cluster-api and slowly we move most into not installconfig.
https://jira.coreos.com/browse/CORS-888

@crawford
Copy link
Contributor

/lgtm

We are going to merge this as-is so we can hopefully cut down on CI flakes.

@enxebre the Machine API will need knobs for adjusting this (if they don't already exist).

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [crawford,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

/unhold

@smarterclayton smarterclayton removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2018
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member

wking commented Nov 30, 2018

/retest

3 similar comments
@wking
Copy link
Member

wking commented Nov 30, 2018

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

e2e-aws:

fail [github.com/openshift/origin/test/extended/deployments/deployments.go:541]: Expected error:
    <*errors.errorString | 0xc4200c7560>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

failed: (6m18s) "[Feature:DeploymentConfig] deploymentconfigs with multiple image change triggers [Conformance] should run a successful deployment with a trigger used by different containers [Suite:openshift/conformance/parallel/minimal] [Suite:openshift/smoke-4]"

just the one failure. Big money 🎲

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

e2e-aws:

E1130 17:45:05.004725     497 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.081429     497 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.105388     497 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.184884     497 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.284384     497 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.303982     497 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.383532     497 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.398963     497 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource

which I think is the OpenShift API flake.

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

e2e-aws had more of the couldn't get resource list errors.

/retest

@wking
Copy link
Member

wking commented Nov 30, 2018

Maybe openshift/origin@5fa8ee7 has a fix...

/retest

@wking
Copy link
Member

wking commented Dec 1, 2018

Hrm, maybe not. e2e-aws:

1130 23:33:01.714285     438 metrics_grabber.go:81] Master node is not registered. Grabbing metrics from Scheduler, ControllerManager and ClusterAutoscaler is disabled.
Nov 30 23:33:01.816: INFO: 
Latency metrics for node ip-10-0-43-254.ec2.internal
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:1m18.932963s}
Nov 30 23:33:01.816: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:26.949836s}
[AfterEach] [Top Level]
  /tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:142
STEP: Dumping a list of prepulled images on each node...
Nov 30 23:33:01.847: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Nov 30 23:33:01.872: INFO: Running AfterSuite actions on all node
Nov 30 23:33:01.872: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/util/cli.go:668]: Nov 30 23:32:59.860: Unauthorized

failed: (14.3s) "[Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel] [Suite:openshift/smoke-4]"
...
fail [github.com/openshift/origin/test/extended/util/cli.go:668]: Nov 30 23:33:23.091: the server could not find the requested resource (post oauthclients.oauth.openshift.io)

failed: (24.7s) "[sig-apps] Deployment deployment should delete old replica sets [Suite:openshift/conformance/parallel] [Suite:k8s] [Suite:openshift/smoke-4]"

Let's try again with [edit, actually the origin commit hadn't changed]:

$ oc adm release info registry.svc.ci.openshift.org/openshift/origin-release:v4.0 --commits | grep origin | head -n1
  cli                                           https://github.com/openshift/origin                                        5fa8ee77312ed76b49e896122ef64208216d32e1

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 1, 2018

@smarterclayton: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt bca6a92 link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants