aws: Increase default master disk size to 120GB for IO #737

smarterclayton · 2018-11-27T15:02:04Z

With colocated masters we are seeing about ~120 IOP/s sustained, and
a 30GB gp2 drive is limited to 100 IOP/s. etcd is ~75% of the write
workload, but we are seeing some syncs take ~1s and occasional
heartbeat latency. Increase master disk by 4x to get slightly more
head room - this would in practice result in about $30-40/month more
disk on top of the ~200/mo the instances cost.

openshift/origin#21552 may be related

With colocated masters we are seeing about ~120 IOP/s sustained, and a 30GB gp2 drive is limited to 100 IOP/s. etcd is ~75% of the write workload, but we are seeing some syncs take ~1s and occasional heartbeat latency. Increase master disk by 4x to get slightly more head room - this would in practice result in about $30-40/month more disk on top of the ~200/mo the instances cost.

smarterclayton · 2018-11-27T15:02:22Z

/hold

look at the ec2 numbers to verify this has an impact

smarterclayton · 2018-11-27T15:20:59Z

/retest

smarterclayton · 2018-11-27T15:35:08Z

/retest

smarterclayton · 2018-11-27T17:53:11Z

/retest

smarterclayton · 2018-11-27T19:13:11Z

/retest

may be seeing higher CPU use due to io throttling

smarterclayton · 2018-11-28T00:47:49Z

/test e2e-aws

wking · 2018-11-28T07:00:56Z

data/data/aws/variables-aws.tf

 variable "tectonic_aws_master_root_volume_size" {
  type        = "string"
-  default     = "30"
+  default     = "120"


We don't want to be driving these defaults from the Terraform config (because that doesn't work for folks who want to install our assets themselves without going through Terraform). Ideally we'd set it in this structure here and then push that into Terraform here. But the cluster API doesn't seem to support root volume configs at the moment (I didn't see any open issues about that, but maybe we have someone in sig-cluster-lifecycle that can ask about getting it added). In the meantime, we're pulling this straight from the install-config, although folks that do not get this added via the install-config may not have it set in the Terraform variables at all (in which case your default here will matter). So this probably works as you have it, but only as long as the cluster-API operator doesn't have to get involved.

We don't want to be driving these defaults from the Terraform config (because that doesn't work for folks who want to install our assets themselves without going through Terraform).

A good start would be to not have defaults in terraform and drive all options throught installconfig or cluster-api and slowly we move most into not installconfig.
https://jira.coreos.com/browse/CORS-888

crawford · 2018-11-28T23:55:38Z

/lgtm

We are going to merge this as-is so we can hopefully cut down on CI flakes.

@enxebre the Machine API will need knobs for adjusting this (if they don't already exist).

openshift-ci-robot · 2018-11-28T23:55:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2018-11-29T00:25:13Z

/unhold

openshift-bot · 2018-11-29T01:35:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-11-29T03:36:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-11-29T11:40:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2018-11-30T03:51:38Z

/retest

wking · 2018-11-30T06:06:44Z

/retest

wking · 2018-11-30T15:50:39Z

/retest

wking · 2018-11-30T16:28:04Z

/retest

wking · 2018-11-30T17:21:51Z

e2e-aws:

fail [github.com/openshift/origin/test/extended/deployments/deployments.go:541]: Expected error:
    <*errors.errorString | 0xc4200c7560>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

failed: (6m18s) "[Feature:DeploymentConfig] deploymentconfigs with multiple image change triggers [Conformance] should run a successful deployment with a trigger used by different containers [Suite:openshift/conformance/parallel/minimal] [Suite:openshift/smoke-4]"

just the one failure. Big money 🎲

/retest

wking · 2018-11-30T19:38:52Z

e2e-aws:

E1130 17:45:05.004725     497 memcache.go:147] couldn't get resource list for authorization.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.081429     497 memcache.go:147] couldn't get resource list for build.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.105388     497 memcache.go:147] couldn't get resource list for image.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.184884     497 memcache.go:147] couldn't get resource list for oauth.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.284384     497 memcache.go:147] couldn't get resource list for quota.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.303982     497 memcache.go:147] couldn't get resource list for route.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.383532     497 memcache.go:147] couldn't get resource list for security.openshift.io/v1: the server could not find the requested resource
E1130 17:45:05.398963     497 memcache.go:147] couldn't get resource list for template.openshift.io/v1: the server could not find the requested resource

which I think is the OpenShift API flake.

/retest

wking · 2018-11-30T20:41:04Z

e2e-aws had more of the couldn't get resource list errors.

/retest

wking · 2018-11-30T23:02:17Z

Maybe openshift/origin@5fa8ee7 has a fix...

/retest

wking · 2018-12-01T00:10:12Z

Hrm, maybe not. e2e-aws:

1130 23:33:01.714285     438 metrics_grabber.go:81] Master node is not registered. Grabbing metrics from Scheduler, ControllerManager and ClusterAutoscaler is disabled.
Nov 30 23:33:01.816: INFO: 
Latency metrics for node ip-10-0-43-254.ec2.internal
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.5 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.9 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation:create Method:pod_worker_latency_microseconds Quantile:0.99 Latency:2m3.028633s}
Nov 30 23:33:01.816: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.99 Latency:1m18.932963s}
Nov 30 23:33:01.816: INFO: {Operation: Method:pod_start_latency_microseconds Quantile:0.9 Latency:26.949836s}
[AfterEach] [Top Level]
  /tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:142
STEP: Dumping a list of prepulled images on each node...
Nov 30 23:33:01.847: INFO: Waiting up to 3m0s for all (but 100) nodes to be ready
Nov 30 23:33:01.872: INFO: Running AfterSuite actions on all node
Nov 30 23:33:01.872: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/util/cli.go:668]: Nov 30 23:32:59.860: Unauthorized

failed: (14.3s) "[Feature:Platform][Smoke] Managed cluster should start all core operators [Suite:openshift/conformance/parallel] [Suite:openshift/smoke-4]"
...
fail [github.com/openshift/origin/test/extended/util/cli.go:668]: Nov 30 23:33:23.091: the server could not find the requested resource (post oauthclients.oauth.openshift.io)

failed: (24.7s) "[sig-apps] Deployment deployment should delete old replica sets [Suite:openshift/conformance/parallel] [Suite:k8s] [Suite:openshift/smoke-4]"

Let's try again with [edit, actually the origin commit hadn't changed]:

$ oc adm release info registry.svc.ci.openshift.org/openshift/origin-release:v4.0 --commits | grep origin | head -n1
  cli                                           https://github.com/openshift/origin                                        5fa8ee77312ed76b49e896122ef64208216d32e1

/retest

openshift-ci-robot · 2018-12-01T00:41:29Z

@smarterclayton: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`bca6a92`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 27, 2018

openshift-ci-robot requested review from rajatchopra and staebler November 27, 2018 15:02

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2018

wking reviewed Nov 28, 2018

View reviewed changes

openshift-ci-robot assigned crawford Nov 28, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2018

smarterclayton removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2018

openshift-merge-robot merged commit bc09e02 into openshift:master Dec 1, 2018

wking mentioned this pull request Dec 8, 2018

data/aws/variables-aws: Bump master volume to 500 GiB for I/O #844

Closed

This was referenced Jan 15, 2019

aws: Increase the default master instance size to reduce etcd timeouts #1069

Merged

Add size and type support for aws volumes #1079

Merged

wking mentioned this pull request May 6, 2019

[AWS] Correcting UPI gp2 volumes to match IPI sizes #1712

Merged

aws: Increase default master disk size to 120GB for IO #737

aws: Increase default master disk size to 120GB for IO #737

Uh oh!

Conversation

smarterclayton commented Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Nov 27, 2018

Uh oh!

smarterclayton commented Nov 27, 2018

Uh oh!

smarterclayton commented Nov 27, 2018

Uh oh!

smarterclayton commented Nov 27, 2018

Uh oh!

smarterclayton commented Nov 27, 2018

Uh oh!

smarterclayton commented Nov 28, 2018

Uh oh!

wking Nov 28, 2018

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya Nov 29, 2018

Choose a reason for hiding this comment

Uh oh!

crawford commented Nov 28, 2018

Uh oh!

openshift-ci-robot commented Nov 28, 2018

Uh oh!

smarterclayton commented Nov 29, 2018

Uh oh!

openshift-bot commented Nov 29, 2018

Uh oh!

openshift-bot commented Nov 29, 2018

Uh oh!

openshift-bot commented Nov 29, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Nov 30, 2018

Uh oh!

wking commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

smarterclayton commented Nov 27, 2018 •

edited

Loading

wking commented Dec 1, 2018 •

edited

Loading

openshift-ci-robot commented Dec 1, 2018 •

edited

Loading