Skip to content

Conversation

@hasbro17
Copy link
Contributor

@hasbro17 hasbro17 commented Sep 26, 2022

Adds the etcd-scaling workflow for aws/azure/gcp/vsphere which runs the openshift/etcd/scaling test suite.

The following presubmits are added to master/4.13/4.12 branches for openshift/cluster-etcd-operator CI:

  • e2e-aws-ovn-etcd-scaling
    • mandatory
  • e2e-azure-ovn-etcd-scaling
    • optional
  • e2e-gcp-ovn-etcd-scaling
    • optional
  • e2e-vsphere-ovn-etcd-scaling
    • optional

Along with nightly jobs for aws/azure/gcp/vsphere for 4.12/4.13

This is required since the etcd vertical scaling test is being moved out of openshift/conformance/serial into openshift/etcd/scaling in order to reduce disruptions in the serial suite.
See openshift/origin#27444

@hasbro17
Copy link
Contributor Author

/cc @tjungblu @dgoodwin

@openshift-ci openshift-ci bot requested review from dgoodwin and tjungblu September 26, 2022 21:31
@stbenjam
Copy link
Member

I think it probably makes sense to add a periodic to openshift/release running on some interval - once a day? With the intention of graduating them to payload informers. You could open it as a separate PR

@hasbro17
Copy link
Contributor Author

Yeah good idea. I was also thinking of periodics to cover all the required platforms. I'll do a follow up once I know these steps are correctly running the new openshift/etcd/scaling suite.

@stbenjam
Copy link
Member

You may also want to move the 4.11 jobs to another PR as well - we'll have to backport the origin PR first before those can work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add @wking, I'm sure he also will have to go and plumb some stuff there once in a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to add update in your follow-up PR, let's get this through first ;)

@tjungblu
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022
Copy link
Contributor

@Elbehery Elbehery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

nice work @hasbro17

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new line here as the other files ?

@stbenjam
Copy link
Member

/approve
/hold

Pending merge of the origin side, feel free to remove hold when ready

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 27, 2022
@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from 6f19d5d to b0df5ab Compare September 27, 2022 21:25
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2022
@hasbro17
Copy link
Contributor Author

Since we're still waiting for CI to go through on the origin PR, I've updated this one to remove the 4.11 presubmits until we get that backported in origin.

@tjungblu if you could retag again that would be great.

@tjungblu
Copy link
Contributor

yes sir!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 28, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2022
@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from b0df5ab to dc6b4e4 Compare October 4, 2022 21:32
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2022
@hasbro17
Copy link
Contributor Author

hasbro17 commented Oct 4, 2022

@stbenjam I've also lumped in the changes for nightly periodics for the scaling job but let me know if you'd prefer that to be a follow up.
a51a985

Also since these aren't informing jobs I'm not exactly sure where these would these show up.
Not in the release status or testgrid right?
e.g: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.nightly/release/4.12.0-0.nightly-2022-10-04-081353

Just wondering how we can monitor and get alerted on these jobs, as other repos like the CPSMO would likely break these again in the future.

@stbenjam
Copy link
Member

stbenjam commented Oct 5, 2022

I think it's fine to keep it all here. The periodics on the openshift/release repo will show up in Sippy, TRT can also configure monitoring for you, maybe on the 7 day pass rate. Once the job is proven stable, you can graduate to be a release informer.

Drop a card in https://issues.redhat.com/secure/CreateIssue.jspa?pid=12323832&issuetype=17 with 1) names of the periodic jobs, 2) channel to send the alert to, 3) reasonable 7 day threshold (maybe 60% to start?)

@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from a51a985 to 1038a91 Compare October 5, 2022 21:13
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2022
@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from 0918eff to 0b6cf97 Compare November 14, 2022 00:44
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2022
@hasbro17
Copy link
Contributor Author

/test pj-rehearse

@hasbro17
Copy link
Contributor Author

/hold

I'm expecting as some early/late tests might fail for the vertical scaling test.
In that case I will likely an additional test suite and workflow that doesn't run those tests so we can use that as a mandatory presubmit on the cluster-etcd-operator repo PRs while we debug those early/late test failures in the periodics.

Comment on lines 818 to 963
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure why make jobs is now marking all jobs as always run. Wasn't doing that the last time I updated these.
Possibly an update in ci-operator but need to confirm what to tweak in the presubmits to not run these. Don't just want to edit these back manually.

Comment on lines 1011 to 1020
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure why make jobs is pruning these now.

@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from 0b6cf97 to 91165ac Compare November 17, 2022 20:35
@hasbro17
Copy link
Contributor Author

Fixed the extraneous make job changes. Now to rehearse the new presubmits again:

/test pj-rehearse

@stbenjam
Copy link
Member

/lgtm

The kubePodNotReady failures look like something TRT is currently investigating, but I think the rehearsals look good otherwise?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 21, 2022
@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from 91165ac to 830ff97 Compare November 21, 2022 22:05
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 21, 2022
@hasbro17
Copy link
Contributor Author

hasbro17 commented Nov 21, 2022

Taking a closer look at some of the rehearse jobs:
periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-etcd-scaling
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/32623/rehearse-32623-periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-etcd-scaling/1577769995419521024

error: suite "openshift/etcd/scaling" does not exist
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-10-05T22:05:49Z"}
error: failed to execute wrapped command: exit status 1 
�[36mINFO�[0m[2022-10-05T22:05:55Z] Step e2e-aws-etcd-scaling-openshift-e2e-test failed after 1m30s. 

But the test suite exists for other jobs, e.g:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/32623/rehearse-32623-pull-ci-openshift-cluster-etcd-operator-master-e2e-azure-etcd-scaling/1577769993800519680

Oct 05 22:41:33.950 I e2e-test/"[sig-etcd][Feature:EtcdVerticalScaling] etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node [Timeout:60m][apigroup:machine.openshift.io] [Suite:openshift/conformance/parallel]" started
Oct 05 22:41:33.950 - 2769s I e2e-test/"[sig-etcd][Feature:EtcdVerticalScaling] etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node [Timeout:60m][apigroup:machine.openshift.io] [Suite:openshift/conformance/parallel]" e2e test finished As "Passed"

And more confusingly it's showing the Suite:openshift/conformance/parallel tag even though it's definitely there on 4.12
https://github.com/openshift/origin/blob/release-4.12/cmd/openshift-tests/e2e.go#L431

Perhaps because it's the nightly that's outdated?

�[36mINFO�[0m[2022-10-05T21:17:43Z] Resolved release initial to registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-04-081353 
�[36mINFO�[0m[2022-10-05T21:17:43Z] Resolved release latest to registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-05-053337 

This commit adds new workflows to run the etcd-scaling test suite
on aws/gcp/azure/vsphere platforms.
This also adds the presubmits to run these jobs as CI for the cluster-etcd-operator repo
as well as periodic jobs.
@hasbro17 hasbro17 force-pushed the add-etcd-scaling-job-step branch from 830ff97 to 39b2540 Compare November 22, 2022 22:31
@hasbro17
Copy link
Contributor Author

@stbenjam I think we can merge this now if you want to tag again.

/test pj-rehearse
/unhold

The rehearse jobs look good to me. The test itself is passing on all platforms for both pull-ci-* and periodic-ci-* jobs, although certain platforms like gcp and vsphere can fail due to the issue of pathological events repeating or some other invariant failures that require further looking into.

Only puzzling thing is the absence of rehearse runs for the 4.13 nightly periodic gcp and azure jobs i.e:

  • rehearse-32623-periodic-ci-openshift-release-master-nightly-4.13-e2e-azure-ovn-etcd-scaling
  • rehearse-32623-periodic-ci-openshift-release-master-nightly-4.13-e2e-gcp-ovn-etcd-scaling

https://prow.ci.openshift.org/pr-history/?org=openshift&repo=release&pr=32623

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 22, 2022
@stbenjam
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, hasbro17, stbenjam, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam
Copy link
Member

/pj-rehearse ack

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Nov 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 23, 2022

@hasbro17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/cluster-etcd-operator/release-4.11/e2e-vsphere-etcd-scaling 6f19d5d823367ab35b3966889fdfa548572e9853 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.11/e2e-gcp-etcd-scaling 6f19d5d823367ab35b3966889fdfa548572e9853 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.11/e2e-azure-etcd-scaling 6f19d5d823367ab35b3966889fdfa548572e9853 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.11/e2e-aws-etcd-scaling 6f19d5d823367ab35b3966889fdfa548572e9853 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.12-e2e-gcp-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-azure-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-vsphere-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.12-e2e-azure-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-aws-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.12/e2e-azure-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-gcp-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.12/e2e-aws-etcd-scaling 1038a911de29a706af40e03658154e1ccfd9fe72 link unknown /test pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/release-4.12/e2e-azure-ovn-etcd-scaling 0918eff6c2c3ca239779f4ce6ec98f09ae003d9d link unknown /test pj-rehearse
ci/prow/multi-arch-gen-valid 0918eff6c2c3ca239779f4ce6ec98f09ae003d9d link true /test multi-arch-gen-valid
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-agnostic-ovn-upgrade 0b6cf97aa02b533685de7b70d1940a1dd00b4b5c link unknown /pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-aws-disruptive 0b6cf97aa02b533685de7b70d1940a1dd00b4b5c link unknown /pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-aws-disruptive-ovn 0b6cf97aa02b533685de7b70d1940a1dd00b4b5c link unknown /pj-rehearse
ci/prow/pr-reminder-config 91165acbaf2614ea7cbeac336aaeea3b32bf9cce link true /test pr-reminder-config
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-ovn-etcd-scaling 39b2540 link unknown /pj-rehearse
ci/rehearse/openshift/cluster-etcd-operator/master/e2e-vsphere-ovn-etcd-scaling 39b2540 link unknown /pj-rehearse
ci/rehearse/periodic-ci-openshift-release-master-nightly-4.13-e2e-vsphere-ovn-etcd-scaling 39b2540 link unknown /pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 8dc18b8 into openshift:master Nov 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 23, 2022

@hasbro17: Updated the following 7 configmaps:

  • ci-operator-4.13-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-release-4.13.yaml using file ci-operator/config/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-release-4.13.yaml
  • job-config-master-presubmits configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-master-presubmits.yaml using file ci-operator/jobs/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-master-presubmits.yaml
  • job-config-4.12 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-release-4.12-presubmits.yaml using file ci-operator/jobs/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-release-4.12-presubmits.yaml
  • job-config-4.13 configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-release-4.13-presubmits.yaml using file ci-operator/jobs/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-release-4.13-presubmits.yaml
  • job-config-master-periodics configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-release-master-periodics.yaml using file ci-operator/jobs/openshift/release/openshift-release-master-periodics.yaml
  • ci-operator-master-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-master.yaml using file ci-operator/config/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-master.yaml
    • key openshift-release-master__nightly-4.12.yaml using file ci-operator/config/openshift/release/openshift-release-master__nightly-4.12.yaml
    • key openshift-release-master__nightly-4.13.yaml using file ci-operator/config/openshift/release/openshift-release-master__nightly-4.13.yaml
  • ci-operator-4.12-configs configmap in namespace ci at cluster app.ci using the following files:
    • key openshift-cluster-etcd-operator-release-4.12.yaml using file ci-operator/config/openshift/cluster-etcd-operator/openshift-cluster-etcd-operator-release-4.12.yaml
Details

In response to this:

Adds the etcd-scaling workflow for aws/azure/gcp/vsphere which runs the openshift/etcd/scaling test suite.

The following presubmits are added to master/4.13/4.12 branches for openshift/cluster-etcd-operator CI:

  • e2e-aws-ovn-etcd-scaling
  • mandatory
  • e2e-azure-ovn-etcd-scaling
  • optional
  • e2e-gcp-ovn-etcd-scaling
  • optional
  • e2e-vsphere-ovn-etcd-scaling
  • optional

Along with nightly jobs for aws/azure/gcp/vsphere for 4.12/4.13

This is required since the etcd vertical scaling test is being moved out of openshift/conformance/serial into openshift/etcd/scaling in order to reduce disruptions in the serial suite.
See openshift/origin#27444

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants