Skip to content

machines: Set defaults for machine instance types#5841

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
rna-afk:fix_instance_types
May 11, 2022
Merged

machines: Set defaults for machine instance types#5841
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
rna-afk:fix_instance_types

Conversation

@rna-afk
Copy link
Copy Markdown
Contributor

@rna-afk rna-afk commented Apr 25, 2022

Setting all compute nodes to use the instance types that provide
4 VPUs and 16 GB RAM for AWS, GCP and VSphere platforms.

@openshift-ci openshift-ci Bot requested review from barbacbd and r4f4 April 25, 2022 17:29
Copy link
Copy Markdown
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. So it looks like you identified aws, gcp, nutanix, and vsphere. gcp I think is close enough so let's leave it alone.

For vsphere and nutanix let's get owner opinions: @thunderboltsid, @rvanderp3, @jcpowermac. I suspect resources might be tighter on these platforms.

For context, it has been pointed out to us that aws has fewer resources on compute nodes that gcp/azure. While we're fixing aws we're also checking other platforms.

Comment thread pkg/asset/machines/worker.go Outdated
@rna-afk rna-afk force-pushed the fix_instance_types branch 2 times, most recently from 500e9d3 to cf9e527 Compare April 25, 2022 22:20
@barbacbd
Copy link
Copy Markdown
Contributor

barbacbd commented May 2, 2022

@rna-afk I saw that the aws default instance type was set to gp3 but would other "generic" or general purpose instances make sense? I don't know if there are other conditions that need to be met.

@rna-afk
Copy link
Copy Markdown
Contributor Author

rna-afk commented May 2, 2022

We need a lot of throughput for all our operations on the cluster. The general guideline is to provide a premium disk type for the control-plane nodes and any for the compute nodes but we set the default to premium for better performance. I think gp3 for the root volume type of the VMs is fine.

Comment thread pkg/asset/machines/worker.go Outdated
Comment on lines 179 to 180
Copy link
Copy Markdown
Contributor

@thunderboltsid thunderboltsid May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies 16 CPU units (CPUs x Cores per CPU). Is that what we want? I actually have an open PR that was reducing these even further.. For 4 CPU units, I'd recommend setting NumCoresPerSocket to 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved the code from [1] and the minimum requirement is 2 and 2 but the vsphere folks set it to 4 and 4 as part of the performance recommendation for vsphere. I'll remove the nutanix change though to let you change it to the ideal value, thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I understand where I made a mistake. I'll remove those changes then. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rna-afk I can also take care of bumping up the RAM to 16GB in my PR.

@rna-afk rna-afk force-pushed the fix_instance_types branch 3 times, most recently from 2be3979 to 70c6afe Compare May 2, 2022 18:07
@jcpowermac
Copy link
Copy Markdown
Contributor

For vsphere and nutanix let's get owner opinions: @thunderboltsid, @rvanderp3, @jcpowermac. I suspect resources might be tighter on these platforms.

In CI we have to up to this value anyway to pass jobs. So lgtm to me. We should update documentation right?

@patrickdillon
Copy link
Copy Markdown
Contributor

We should update documentation right?

Yeah this is a good point. This may need more docs than the release note I originally mentioned. Although when I'm looking at the docs:
AWS
[GCP]https://docs.openshift.com/container-platform/4.9/installing/installing_gcp/installing-gcp-account.html#installation-gcp-limits_installing-gcp-account
Hm.. not finding the equivalent for vSphere?

It looks like AWS/GCP don't mention memory in those sections, so it doesn't look like they need updates. Not sure if anywhere else would need to be updated for AWS/GCP. @rna-afk can you take a quick look?

@patrickdillon
Copy link
Copy Markdown
Contributor

/approve

Updating aws, gcp, & vsphere looks good to me.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2022
@rna-afk
Copy link
Copy Markdown
Contributor Author

rna-afk commented May 2, 2022

Almost all docs need changing specifically in the compute section where the minimum requirements are 2 VCPUs and 8 GB RAM.
AWS
Azure

@rna-afk
Copy link
Copy Markdown
Contributor Author

rna-afk commented May 2, 2022

Almost all docs need changing specifically in the compute section where the minimum requirements are 2 VCPUs and 8 GB RAM.

This actually depends on if we want to increase our minimum requirements spec. Other than this I don't see any other docs change.

@mjpytlak
Copy link
Copy Markdown

mjpytlak commented May 3, 2022

@patrickdillon @rna-afk Please let @bscott-rh and I know if doc updates beyond the release note are required and to which platforms. Based on comments, not sure which platforms this PR will ultimately land on.

@sadasu
Copy link
Copy Markdown
Contributor

sadasu commented May 3, 2022

@rna-afk it appears that we are changing the CPU and memory defaults just for AWS, GCP and vSphere. If that is the case, could you please update the commit message to reflect that (and not all platforms)? Also, #5841 (comment) needs to updated to reflect that too. I'll defer to the other reviewers with more knowledge of the platforms to decide if the defaults are the right ones.

@patrickdillon
Copy link
Copy Markdown
Contributor

Almost all docs need changing specifically in the compute section where the minimum requirements are 2 VCPUs and 8 GB RAM.

This actually depends on if we want to increase our minimum requirements spec. Other than this I don't see any other docs change.

No, this wouldn't change minimum requirements.

@patrickdillon
Copy link
Copy Markdown
Contributor

/lgtm

@rna-afk would you add a note in the jira card for the docs team that says what the new default instance type is for each platform updated in this pr is?

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2022
@patrickdillon
Copy link
Copy Markdown
Contributor

/hold
/lgtm cancel

@openshift-ci openshift-ci Bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels May 5, 2022
Comment thread pkg/asset/machines/worker.go Outdated
Comment on lines 159 to 160
Copy link
Copy Markdown
Contributor

@patrickdillon patrickdillon May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to bump CPUs too?
/cc @jcpowermac

@openshift-ci openshift-ci Bot requested a review from jcpowermac May 5, 2022 00:07
@patrickdillon
Copy link
Copy Markdown
Contributor

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2022
@rna-afk rna-afk force-pushed the fix_instance_types branch from 40dbc3c to daeb04b Compare May 5, 2022 13:18
Comment thread pkg/asset/machines/worker.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it is 4 NumCoresPerSocket and here it is 1. Which is right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 cpus and 1 cores per socket is right. I'll fix it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to set masters to 4 vcpus and 4 numcores and workers to 4 vcpus and 1 numcores

Setting compute nodes to use the instance types that provide
4 VPUs and 16 GB RAM for AWS, GCP and VSphere.
@rna-afk rna-afk force-pushed the fix_instance_types branch from daeb04b to 16f2827 Compare May 9, 2022 14:55
@patrickdillon
Copy link
Copy Markdown
Contributor

/lgtm

We could update https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/ipi-conf-vsphere-commands.sh to no longer explicitly set the parameters that are being updated here

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2022
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@rvanderp3
Copy link
Copy Markdown
Contributor

/lgtm

We could update https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/ipi-conf-vsphere-commands.sh to no longer explicitly set the parameters that are being updated here

@patrickdillon sure, good idea. I'll follow up on that.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2022

@rna-afk: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure 16f2827 link false /test e2e-azure
ci/prow/e2e-vsphere 16f2827 link false /test e2e-vsphere
ci/prow/e2e-ovirt 16f2827 link false /test e2e-ovirt
ci/prow/okd-e2e-aws 16f2827 link false /test okd-e2e-aws
ci/prow/e2e-metal-assisted 16f2827 link false /test e2e-metal-assisted
ci/prow/e2e-libvirt 16f2827 link false /test e2e-libvirt
ci/prow/e2e-metal-ipi 16f2827 link false /test e2e-metal-ipi
ci/prow/e2e-ibmcloud 16f2827 link false /test e2e-ibmcloud
ci/prow/e2e-gcp 16f2827 link false /test e2e-gcp
ci/prow/e2e-crc 16f2827 link false /test e2e-crc

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8d15c81 into openshift:master May 11, 2022
jcpowermac added a commit to jcpowermac/installer that referenced this pull request May 16, 2022
Virtual machines running on
vSphere should be configured for cores
over sockets for NUMA. The change
in openshift#5841 missed that.

Changing vSphere back to using a single
socket and multiple cores.
@rvanderp3
Copy link
Copy Markdown
Contributor

/lgtm
We could update https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/vsphere/ipi-conf-vsphere-commands.sh to no longer explicitly set the parameters that are being updated here

@patrickdillon sure, good idea. I'll follow up on that.

started a PR to address this: openshift/release#29234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants