Skip to content

Remove instanceType option m4 from AWS IPI default deployment#5162

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
mtulio:SPLAT-254e3
Oct 7, 2021
Merged

Remove instanceType option m4 from AWS IPI default deployment#5162
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
mtulio:SPLAT-254e3

Conversation

@mtulio
Copy link
Copy Markdown
Contributor

@mtulio mtulio commented Aug 20, 2021

The deployments in AWS region us-east-1 (biggest region) is being impacted with a cluster more expensive and without the benefits of Nitro Instances (faster CPU, network, EBS, etc). The evaluation process on that Region choose the oldest instanceType when the evaluation process of preferred zone - it expects that the instance is offered in all zones. The region us-east-1 has 6 zones, and one us-east-1e is not offering Nitro Instances. Looking the zones across all regions[1], this is the unique exception.

Initially I've studied two other options, there is:
A) Create a restricted list of zones for each region
B) Choose the preferred type from the Types when a type has available in >= 3 zones (N*Masters)

But when looking the availability of m5 across all zones[1], remove the m4 from the list seems to be a better option.

Further we could improve the capacity election when m5 instance is not available, but I guess it will be a more complex implementation, and this current change could be bring to stakeholders and short term benefit.

Please see the details of study in SPLAT-254

Stakeholders impacted:

  • all users running default AWS IPI

[1] Filters by region of describe_instance_type_offerings used by installer to choose the preferred instance type.

                count(m5.large)  count(m4.large)  count(m5VSm4)
region                                                         
af-south-1                    3                0              3
ap-east-1                     3                0              3
ap-northeast-1                3                3              0
ap-northeast-2                4                2              2
ap-south-1                    3                2              1
ap-southeast-1                3                3              0
ap-southeast-2                3                3              0
ca-central-1                  3                2              1
eu-central-1                  3                3              0
eu-north-1                    3                0              3
eu-south-1                    3                0              3
eu-west-1                     3                3              0
eu-west-2                     3                3              0
eu-west-3                     3                0              3
me-south-1                    3                0              3
sa-east-1                     3                2              1
us-east-1                     5                6             -1  <-- here
us-east-2                     3                3              0
us-west-1                     2                2              0
us-west-2                     4                3              1

@mtulio mtulio changed the title refact: remove instance option m4 from IPI refact: remove instance option m4 from AWS IPI default deployment Aug 20, 2021
@mtulio mtulio changed the title refact: remove instance option m4 from AWS IPI default deployment remove instance option m4 from AWS IPI default deployment Aug 20, 2021
@mtulio mtulio changed the title remove instance option m4 from AWS IPI default deployment Remove instanceType option m4 from AWS IPI default deployment Aug 20, 2021
The deployments in AWS region us-east-1 (biggest region) is being impacted with a cluster more expensive and without the benefits of Nitro Instances (faster CPU, network, EBS, etc). The evaluation process on that Region choose the oldest instanceType when the evaluation process of preferred zone - it expects that the instance is offered in all zones. The region us-east-1 has 6 zones, and one us-east-1e is not offering Nitro Instances. Looking the zones across all regions[1], this is the unique exception.

Regarding the performance and costs, the nitro is a better choice and it should not impact the HA on that specific zone that will use 5, instead of 6 zones.
@jstuever
Copy link
Copy Markdown
Contributor

jstuever commented Aug 24, 2021

There may be an issue here if m5 does not exist in all six of us-east-1 zones. The PreferredInstanceType function ensures that the selected type exists in all selected zones. For the control plane machines, the selected zones defaults to all zones in the region. Therefore, with the most basic install-config.yaml, I expect this change is always breaking for us-east-1 region. The current e2e-aws test succeeded in us-west-1, which is not the targeted region.

@jstuever
Copy link
Copy Markdown
Contributor

jstuever commented Aug 24, 2021

/hold
Need to test for this manually.

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2021
@jstuever
Copy link
Copy Markdown
Contributor

/test e2e-aws

@jstuever
Copy link
Copy Markdown
Contributor

Also keeping in mind our CI sets the default to m5 and selects only the zones where this type exists. As these values are patched into the install-config.yaml, we will probably not see the issue I mentioned above in CI.

mtulio added a commit to mtulio/installer that referenced this pull request Aug 24, 2021
AWS recently launched the instance m6i, new generation of Nitro instance for general purpose (m5).
ATM the instance is available in a few regions, but the fallback will work well for m5 that was tested on PR openshift#5162
@jstuever
Copy link
Copy Markdown
Contributor

/hold cancel
I appear to have been wrong. I tested this manually using 6 masters. It picked up the m5.xlarge type without issue and simply ignored the problematic zone reusing the first one a second time.

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2021
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Sep 3, 2021

I tested this manually using 6 masters. It picked up the m5.xlarge type without issue and simply ignored the problematic zone reusing the first one a second time.

@jstuever yes, the getInstanceTypeZoneInfo() will not return the "missing AZ" if the query set (types) has only m5.

@rvanderp3
Copy link
Copy Markdown
Contributor

/retest

Copy link
Copy Markdown
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Oct 6, 2021
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2021
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Oct 6, 2021

@mtulio: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-workers-rhel7 d767935 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-single-node d767935 link false /test e2e-aws-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

8 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit dd37fc0 into openshift:master Oct 7, 2021
@mtulio mtulio deleted the SPLAT-254e3 branch October 7, 2021 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants