Skip to content

Bug 1945017: Increase system reserved memory from 1Gi to 1.8Gi to support single node clusters#2504

Closed
omertuc wants to merge 1 commit intoopenshift:masterfrom
omertuc:more_system_reserved_memory
Closed

Bug 1945017: Increase system reserved memory from 1Gi to 1.8Gi to support single node clusters#2504
omertuc wants to merge 1 commit intoopenshift:masterfrom
omertuc:more_system_reserved_memory

Conversation

@omertuc
Copy link
Copy Markdown
Contributor

@omertuc omertuc commented Mar 31, 2021

When running E2E tests on single node clusters, the 1Gi reserved for
system memory is insufficient.

During this PR: #2501 -

I had 3 e2e test runs on AWS single node, the peak recorded system memory
usage during those tests was 1.40, 1.31 and 1.19 GiB respectively. In
this PR I also saw a run that peaked at 1.56 GiB:

image

image

The SystemMemoryExceedsReservation alerts demands that the actual usage
would be less than 90% of the amount reserved, so that means the
corresponding thresholds that should be set are at least 1.44, 1.46, 1.32 and
1.74 GiB.

Or in short, the reserved memory should be increased to 1.8GiB to
support single node (with some hopefully sufficient padding).

Possible future improvements -

  1. Different threshold depending on whether the cluster is a single node
    cluster or not

  2. Find a way to lower single node system memory usage

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omertuc
To complete the pull request process, please assign kikisdeliveryservice after the PR has been reviewed.
You can assign the PR to them by writing /assign @kikisdeliveryservice in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@omertuc omertuc changed the title Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters Bug 1945017 - Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters Mar 31, 2021
@omertuc omertuc changed the title Bug 1945017 - Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters Mar 31, 2021
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Mar 31, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: This pull request references Bugzilla bug 1945017, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2 similar comments
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: This pull request references Bugzilla bug 1945017, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: This pull request references Bugzilla bug 1945017, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/test e2e-aws-workers-rhel7
/test okd-e2e-aws

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

@rphillips any issues with this?

@omertuc please update the BZ as bot requested

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/bugzilla refresh
/test e2e-aws-serial
/test e2e-vsphere-upgrade
/test e2e-aws-workers-rhel7
/test okd-e2e-aws
/test ?

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/bugzilla refresh

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/test ?

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: An error was encountered querying GitHub for users with public email (wabouham@redhat.com) for bug 1945017 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. Post "http://ghproxy/graphql": dial tcp 172.30.229.2:80: i/o timeout

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

Details

In response to this:

/bugzilla refresh
/test e2e-aws-serial
/test e2e-vsphere-upgrade
/test e2e-aws-workers-rhel7
/test okd-e2e-aws
/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/test e2e-aws-serial
/test e2e-vsphere-upgrade
/test e2e-aws-workers-rhel7
/test okd-e2e-aws

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: An error was encountered querying GitHub for users with public email (wabouham@redhat.com) for bug 1945017 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. Post "http://ghproxy/graphql": dial tcp 172.30.229.2:80: i/o timeout

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-agnostic-upgrade
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-workers-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test images
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-serial
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-machine-config-operator-master-e2e-vsphere-upgrade
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Mar 31, 2021

/test e2e-aws-single-node

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-agnostic-upgrade
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-workers-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test images
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-serial
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
  • pull-ci-openshift-machine-config-operator-master-e2e-vsphere-upgrade
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify
Details

In response to this:

/bugzilla refresh
/test e2e-aws-serial
/test e2e-vsphere-upgrade
/test e2e-aws-workers-rhel7
/test okd-e2e-aws
/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ode clusters

When running E2E tests on single node clusters, the 1Gi reserved for
system memory is insufficient.

During this PR: #2501 -

I had 3 e2e test runs on AWS single node, the peak recorded system memory
usage during those tests was 1.40, 1.31 and 1.19 GiB respectively. In
this PR I also saw a run that peaked at 1.56 GiB.

The SystemMemoryExceedsReservation alerts demands that the actual usage
would be less than 90% of the amount reserved, so that means the
corresponding thresholds that should be set are at least 1.44, 1.46, 1.32 and
1.74 GiB.

Or in short, the reserved memory should be increased to 1.8GiB to
support single node (with some hopefully sufficient padding).

Possible future improvements -

1) Different threshold depending on whether the cluster is a single node
cluster or not

2) Find a way to lower single node system memory usage
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: This pull request references Bugzilla bug 1945017, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@omertuc omertuc changed the title Bug 1945017: Increase system reserved memory from 1Gi to 1.6Gi to support single node clusters Bug 1945017: Increase system reserved memory from 1Gi to 1.8Gi to support single node clusters Apr 1, 2021
@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Apr 1, 2021

@omertuc please update the BZ as bot requested

@kikisdeliveryservice It has to has to be triaged before I can do that, see comment in bz

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Apr 1, 2021

/test e2e-aws-workers-rhel7

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Apr 1, 2021

/test okd-e2e-aws

@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Apr 1, 2021

/test e2e-aws
/test e2e-aws-serial
/test e2e-ovn-step-registry

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 1, 2021

@omertuc: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-single-node 1cdb92755e860ce5142595896b5b509f8e16ed97 link /test e2e-aws-single-node
ci/prow/e2e-agnostic-upgrade 6948ae0 link /test e2e-agnostic-upgrade
ci/prow/e2e-vsphere-upgrade 6948ae0 link /test e2e-vsphere-upgrade
ci/prow/e2e-aws-workers-rhel7 6948ae0 link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-serial 6948ae0 link /test e2e-aws-serial
ci/prow/okd-e2e-aws 6948ae0 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@rphillips
Copy link
Copy Markdown
Contributor

The PR will somehow need to detect if SNO is enabled to set the reserve to 1.8GB

@mrunalp
Copy link
Copy Markdown
Member

mrunalp commented Apr 1, 2021

/hold
The preferred approach would be to bump up the system reserved just for the single node CI jobs.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2021
@omertuc
Copy link
Copy Markdown
Contributor Author

omertuc commented Apr 1, 2021

Closed in favor of openshift/release#17403

@omertuc omertuc closed this Apr 1, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@omertuc: This pull request references Bugzilla bug 1945017. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1945017: Increase system reserved memory from 1Gi to 1.8Gi to support single node clusters

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants