-
Notifications
You must be signed in to change notification settings - Fork 2.1k
add infra to cpou upgrade #58965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add infra to cpou upgrade #58965
Conversation
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
1cf926b to
f6a277b
Compare
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
First 24 nodes job failed. And chainupgrade-toimage step failed because of timeout TIMEOUT env in cucushift-chainupgrade-toimage ref, default is 120 Before the timeout, the upgrade was 97% completed. Increase the TIMEOUT to 150 and retry TIMEOUT = 150 test also failed Increase the TIMEOUT to 240 and retry TIMEOUT = 240 test also failed |
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@qiliRedHat, If the problem persists, please contact Test Platform. |
|
@qiliRedHat, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
435d37f to
cf212e3
Compare
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
The failure is not related to the TIMEOUT. The Cluster Operator machine-config is degraded because ' MachineConfigPool infra has not progressed to latest configuration'. |
|
Opened a bug https://issues.redhat.com/browse/OCPBUGS-45045 |
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-24nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
fb078b7 to
d4c3876
Compare
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-120nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
| ITERATION_MULTIPLIER_ENV: "6" | ||
| MAX_UNAVAILABLE_WORKER: "3" | ||
| MCO_CONF_DAY2_CUSTOM_MCP: '[{"mcp_name": "infra"}]' | ||
| PAUSED_MCP_NAME: worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see only worker nodes paused, it seems not a control-plane-only update, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiajliu Yes, the job runs control-plane-only update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so both worker and infra mcp should be paused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiajliu I only paused worker mcp PAUSED_MCP_NAME: worker
For MCO_CONF_DAY2_CUSTOM_MCP: '[{"mcp_name": "infra"}]', it is to overwrite expected_mcp that only allows master and worker mcps by default https://github.com/openshift/release/blob/master/ci-operator/step-registry/cucushift/upgrade/cpou/pause-worker-mcp/cucushift-upgrade-cpou-pause-worker-mcp-commands.sh#L45-L47. This step 'check all actual mcp, if any of them unknown then break the job.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only paused worker mcp PAUSED_MCP_NAME: worker
Hmm, for a control-plane-only update, all non-master mcp should be paused before upgrade. If your test requirement is to do cpou update, I guess both infra and worker are expected to be paused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiajliu From perfscale test team's point of view, either is ok for us. It will be good if I can get some guidance about what is recommended officially.
The current way I did (infra mcp not paused) is based on a pr 21175, and related Jira OTA-448: Add upgrade tests for a cluster with infra nodes. From the description of pr #21175, my understanding is the infra mcp is expected to be unpaused.
I will start a slack thread between the pr owner and us to clarify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part lgtm |
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-120nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
In the new test job with infra mcp paused as well, the timeout 2h is not enough. I'll extend it to 2h30m. |
extend the timeout of cucushift-upgrade-cpou-unpause-worker-mcp from 1h10m to 2h to support 120 worker nodes
fa3a741 to
4e3ba99
Compare
|
/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-cpou-loaded-upgrade-from-4.16-loaded-upgrade-416to418-120nodes |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
A total of 529 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse ack |
|
@qiliRedHat: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
lgtm |
|
/lgtm |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jiajliu, liqcui, qiliRedHat The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@qiliRedHat: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
* add infra to cpou upgrade extend the timeout of cucushift-upgrade-cpou-unpause-worker-mcp from 1h10m to 2h to support 120 worker nodes * remove required-for-upgrade and pause infra mcp * update the timeout to 2h30m for 120 workers and 3 infra nodes cpou upgrade
* add infra to cpou upgrade extend the timeout of cucushift-upgrade-cpou-unpause-worker-mcp from 1h10m to 2h to support 120 worker nodes * remove required-for-upgrade and pause infra mcp * update the timeout to 2h30m for 120 workers and 3 infra nodes cpou upgrade
* add infra to cpou upgrade extend the timeout of cucushift-upgrade-cpou-unpause-worker-mcp from 1h10m to 2h to support 120 worker nodes * remove required-for-upgrade and pause infra mcp * update the timeout to 2h30m for 120 workers and 3 infra nodes cpou upgrade
https://issues.redhat.com/browse/OCPQE-27104