Skip to content

OCPBUGS-24012: Add templates for gc_thresh sysctls#4048

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
jmencak:4.15-gc-thresh
Dec 7, 2023
Merged

OCPBUGS-24012: Add templates for gc_thresh sysctls#4048
openshift-merge-bot[bot] merged 1 commit into
openshift:masterfrom
jmencak:4.15-gc-thresh

Conversation

@jmencak
Copy link
Copy Markdown
Contributor

@jmencak jmencak commented Nov 29, 2023

- What I did

The net.ipv[46].neigh.default.gc_thresh[1-3] have traditionally been set by the Node Tuning Operator. With OpenShift 4.13, NTO is now an optional operator. These sysctls tune kernel's ARP cache. For large scale clusters, the default kernel parameters are too low and result in issues such as rhbz#1384746. In the absence of NTO and customer clusters getting larger, these sysctls need to be set elsewhere.

Setting these values early on cluster boot (prior to container startups) will also prevent issues such as OCPBUGS-24012.

Fixes: OCPBUGS-24012

- How to verify it

Check the file /etc/sysctl.d/gc-thresh.conf exists in a new cluster with:

net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536

- Description for the changelog
Add /etc/sysctl.d/gc-thresh.conf file to bump ARP cache limits.

The net.ipv[46].neigh.default.gc_thresh[1-3] have traditionally been set
by the Node Tuning Operator.  With OpenShift 4.13, NTO is now an
optional operator.  These sysctls tune kernel's ARP cache.  For large
scale clusters, the default kernel parameters are too low and result in
issues such as rhbz#1384746.  In the absence of NTO and
customer clusters getting larger, these sysctls need to be set
elsewhere.

Setting these values early on cluster boot (prior to container startups)
will also prevent issues such as OCPBUGS-24012.

Fixes: OCPBUGS-24012
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 29, 2023
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jmencak: This pull request references Jira Issue OCPBUGS-24012, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

- What I did

The net.ipv[46].neigh.default.gc_thresh[1-3] have traditionally been set by the Node Tuning Operator. With OpenShift 4.13, NTO is now an optional operator. These sysctls tune kernel's ARP cache. For large scale clusters, the default kernel parameters are too low and result in issues such as rhbz#1384746. In the absence of NTO and customer clusters getting larger, these sysctls need to be set elsewhere.

Setting these values early on cluster boot (prior to container startups) will also prevent issues such as OCPBUGS-24012.

Fixes: OCPBUGS-24012

- How to verify it

Check the file /etc/sysctl.d/gc-thresh.conf exists in a new cluster with:

net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536

- Description for the changelog
Add /etc/sysctl.d/gc-thresh.conf file to bump ARP cache limits.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Nov 29, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 29, 2023
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jmencak: This pull request references Jira Issue OCPBUGS-24012, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (liqcui@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 29, 2023
@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Nov 29, 2023

/retest

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that for most clusters, we don't hit these thresholds, so it's mostly a safe change?

Can approve from the MCO code standpoint but it would be best if someone could test whether this is working properly

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 30, 2023
@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Nov 30, 2023

Am I correct in understanding that for most clusters, we don't hit these thresholds, so it's mostly a safe change?

That's correct. These sysctls are not needed on regular-sized clusters. The issues arise with 100s of nodes. With OpenShiftSDN, I also remember large number of routes were also causing the same issues, but I cannot reproduce the "large route number" issue with OVNKubernetes, so it seems only large number of nodes issue.

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 4, 2023

@dagrayvid , you're most familiar with this given the work on optional NTO and #3440

Can I have a review of this, please? Thank you.

@dagrayvid
Copy link
Copy Markdown
Contributor

It looks good to me.

I wonder if we want to standardize on a file name scheme for the sysctl .conf files in this directory? I see one that is named <name>-conf.yaml, two that are named <name>.yaml, and three that are named <name>.conf.yaml.

It's not urgent to fix this in this PR.

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 4, 2023

It looks good to me.

I wonder if we want to standardize on a file name scheme for the sysctl .conf files in this directory? I see one that is named <name>-conf.yaml, two that are named <name>.yaml, and three that are named <name>.conf.yaml.

It's not urgent to fix this in this PR.

Good point, @dagrayvid. Not sure on what would the MCO team want to standardize on. At this point, I'm aiming to get the fix in for 4.15.0 and we're running out of time. The cleanup should be a part of another PR in my view.

@dagrayvid
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 4, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dagrayvid, jmencak, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD cc572a0 and 2 for PR HEAD 5588797 in total

@yuqi-zhang
Copy link
Copy Markdown
Contributor

The naming schema fortunately doesn't affect actual processing at all, so that's a minor point we can clean up in a later PR.

There's a known hypershift failure, once that's fixed this should be able to merge.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f691dcd and 1 for PR HEAD 5588797 in total

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 5, 2023

/retest

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 5, 2023

/test e2e-hypershift

@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 6, 2023

/retest

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 57c1b4e and 0 for PR HEAD 5588797 in total

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/hold

Revision 5588797 was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2023
@jmencak
Copy link
Copy Markdown
Contributor Author

jmencak commented Dec 7, 2023

/hold cancel
/retest

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Dec 7, 2023

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit db97b28 into openshift:master Dec 7, 2023
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@jmencak: Jira Issue OCPBUGS-24012: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-24012 has been moved to the MODIFIED state.

Details

In response to this:

- What I did

The net.ipv[46].neigh.default.gc_thresh[1-3] have traditionally been set by the Node Tuning Operator. With OpenShift 4.13, NTO is now an optional operator. These sysctls tune kernel's ARP cache. For large scale clusters, the default kernel parameters are too low and result in issues such as rhbz#1384746. In the absence of NTO and customer clusters getting larger, these sysctls need to be set elsewhere.

Setting these values early on cluster boot (prior to container startups) will also prevent issues such as OCPBUGS-24012.

Fixes: OCPBUGS-24012

- How to verify it

Check the file /etc/sysctl.d/gc-thresh.conf exists in a new cluster with:

net.ipv4.neigh.default.gc_thresh1=8192
net.ipv4.neigh.default.gc_thresh2=32768
net.ipv4.neigh.default.gc_thresh3=65536
net.ipv6.neigh.default.gc_thresh1=8192
net.ipv6.neigh.default.gc_thresh2=32768
net.ipv6.neigh.default.gc_thresh3=65536

- Description for the changelog
Add /etc/sysctl.d/gc-thresh.conf file to bump ARP cache limits.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jmencak jmencak deleted the 4.15-gc-thresh branch December 7, 2023 09:36
@openshift-bot
Copy link
Copy Markdown
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-proxy-pull-test-container-v4.16.0-202312071150.p0.gdb97b28.assembly.stream for distgit openshift-proxy-pull-test.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-12-07-225558

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants