OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry#32249
Conversation
Effectively neutralizing 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (openshift#32153), 2022-09-13), until we teach origin's test case to understand this disabling: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=6h&type=junit&name=^periodic-&search=Prometheus+when+installed+on+the+cluster+should+report+telemetry+if+a+cloud.openshift.com+token+is+present' | grep 'failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-ovn-serial-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-ppc64le-powervs (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-serial-aws-arm64 (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-image-ecosystem-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-upgrade-from-nightly-4.10-ocp-e2e-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.11-upgrade-from-stable-4.10-ocp-e2e-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-e2e-aws-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.11-e2e-aws-cgroupsv2 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-cgroupsv2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-crun (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 10 runs, 100% failed, 20% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-rt-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-e2e-aws-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-e2e-azure-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.7-e2e-gcp-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-alibaba (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-fips (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-proxy (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-single-node-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-azure (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-gcp-rt (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-e2e-vsphere-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-single-node-workers (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-e2e-vsphere-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-openshift-e2e-aws-single-node-workers-upgrade-conformance (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-cgroupsv2 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-crun (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.12-e2e-azure-sdn-fips-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-ovn-techpreview-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-sdn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-fips (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-proxy (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-serial (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-azure (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp-rt (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.7-e2e-vsphere-serial (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-okd-4.12-e2e-aws-ovn (all) - 1 runs, 100% failed, 100% of failures match = 100% impact With failures like [1]: : [sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Run #0: 1m1s { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:465]: Unexpected error: <errors.aggregate | len:2, cap:2>: [ { s: "promQL query returned unexpected results:\nmetricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1\n[]", }, { s: "promQL query returned unexpected results:\nfederate_samples{job=\"telemeter-client\"} >= 10\n[]", }, ] [promQL query returned unexpected results: metricsclient_request_send{client="federate_to",job="telemeter-client",status_code="200"} >= 1 [], promQL query returned unexpected results: federate_samples{job="telemeter-client"} >= 10 []] occurred [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-crun/1569617920785387520
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stbenjam, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/override ci/prow/build-clusters This PR is critical and we cannot wait for the fix of the build-cluster job. |
|
@hongkailiu: Overrode contexts on behalf of hongkailiu: ci/prow/build-clusters DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…Telemetry for e2e-aws I'd disabled Telemetry for the bulk of the CI fleet in 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (openshift#32153), 2022-09-13). But that lead to many failures for: Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present so I'd flipped the default to keeping Telemetry enabled in d61129c (ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry (openshift#32249), 2022-09-13). Now I'm looking to teach the origin test-case skip about the mechanism I used to disable Telemetry, and I want an origin master presubmit with Telemetry disabled. The only run_if_changed origin master presubmits are e2e-gcp-builds, e2e-aws-jenkins, e2e-gcp-image-ecosystem, and e2e-aws-image-registry, and none of those sound like job that will run the test-case I'm interested in (although maybe they do; I haven't dug in to confirm). But e2e-aws is optional, so having the presubmit temporarily failing for other origin master pull requests won't block changes from landing. We'll revert this change and return the job to the CI-wide Telemetry default once we've confirmed that the test-case skips are smart enough.
…Telemetry for e2e-aws (#32252) I'd disabled Telemetry for the bulk of the CI fleet in 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (#32153), 2022-09-13). But that lead to many failures for: Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present so I'd flipped the default to keeping Telemetry enabled in d61129c (ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry (#32249), 2022-09-13). Now I'm looking to teach the origin test-case skip about the mechanism I used to disable Telemetry, and I want an origin master presubmit with Telemetry disabled. The only run_if_changed origin master presubmits are e2e-gcp-builds, e2e-aws-jenkins, e2e-gcp-image-ecosystem, and e2e-aws-image-registry, and none of those sound like job that will run the test-case I'm interested in (although maybe they do; I haven't dug in to confirm). But e2e-aws is optional, so having the presubmit temporarily failing for other origin master pull requests won't block changes from landing. We'll revert this change and return the job to the CI-wide Telemetry default once we've confirmed that the test-case skips are smart enough.
We'd tried this previously in 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (openshift#32153), 2022-09-13), but had to roll it back with d61129c (ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry (openshift#32249), 2022-09-13), to avoid failing: Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present Subsequently, openshift/origin@76652fa4fa (test/extended/prometheus: Consider telemeterClient.enabled, 2022-09-15, openshift/origin#27422) taught that test-case about the config knob this step uses to disable Telemetry. Those test-case changes are present in origin test suites starting in 4.12: $ for Y in $(seq 11 15); do git --no-pager grep 'should report telemetry' "origin/release-4.${Y}" -- test/extended/prometheus/prometheus.go; done origin/release-4.11:test/extended/prometheus/prometheus.go: g.It("should report telemetry if a cloud.openshift.com token is present [Late]", func() { origin/release-4.12:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.13:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.14:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { origin/release-4.15:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { and 4.10 is end-of-life since 2023-09-10 [1]. That leaves tests using 4.11 versions of the origin suite, and I'm addressing those via the JOB_NAME checks [2,3]. The checks are brittle, leaving out 4.9 and earlier, and possibly not matching some 4.11 jobs, but they will hopefully be sufficient to get us through until 4.11 goes end-of-life on 2024-02-10 [3]. And when the defaulting logic breaks down, jobs that have an opinion can set TELEMETRY_ENABLED explicitly to match their needs. [1]: https://access.redhat.com/support/policy/updates/openshift/#dates [2]: https://docs.ci.openshift.org/docs/architecture/step-registry/#available-environment-variables [3]: https://docs.prow.k8s.io/docs/jobs/#job-environment-variables
We'd tried this previously in 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (openshift#32153), 2022-09-13), but had to roll it back with d61129c (ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry (openshift#32249), 2022-09-13), to avoid failing: Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present Subsequently, openshift/origin@76652fa4fa (test/extended/prometheus: Consider telemeterClient.enabled, 2022-09-15, openshift/origin#27422) taught that test-case about the config knob this step uses to disable Telemetry. Those test-case changes are present in origin test suites starting in 4.12: $ for Y in $(seq 11 15); do git --no-pager grep 'should report telemetry' "origin/release-4.${Y}" -- test/extended/prometheus/prometheus.go; done origin/release-4.11:test/extended/prometheus/prometheus.go: g.It("should report telemetry if a cloud.openshift.com token is present [Late]", func() { origin/release-4.12:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.13:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.14:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { origin/release-4.15:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { and 4.10 is end-of-life since 2023-09-10 [1]. That leaves tests using 4.11 versions of the origin suite, and I'm addressing those via the JOB_NAME checks [2,3]. The checks are brittle, leaving out 4.9 and earlier, and possibly not matching some 4.11 jobs, but they will hopefully be sufficient to get us through until 4.11 goes end-of-life on 2024-02-10 [3]. And when the defaulting logic breaks down, jobs that have an opinion can set TELEMETRY_ENABLED explicitly to match their needs. [1]: https://access.redhat.com/support/policy/updates/openshift/#dates [2]: https://docs.ci.openshift.org/docs/architecture/step-registry/#available-environment-variables [3]: https://docs.prow.k8s.io/docs/jobs/#job-environment-variables
* ci-operator/step-registry/ipi/conf/telemetry: Disable by default (again) We'd tried this previously in 3c1da8e (OTA-740: ci-operator/step-registry/ipi/conf/telemetry: Disable Telemetry (#32153), 2022-09-13), but had to roll it back with d61129c (ci-operator/step-registry/ipi/conf/telemetry: Restore Telemetry (#32249), 2022-09-13), to avoid failing: Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present Subsequently, openshift/origin@76652fa4fa (test/extended/prometheus: Consider telemeterClient.enabled, 2022-09-15, openshift/origin#27422) taught that test-case about the config knob this step uses to disable Telemetry. Those test-case changes are present in origin test suites starting in 4.12: $ for Y in $(seq 11 15); do git --no-pager grep 'should report telemetry' "origin/release-4.${Y}" -- test/extended/prometheus/prometheus.go; done origin/release-4.11:test/extended/prometheus/prometheus.go: g.It("should report telemetry if a cloud.openshift.com token is present [Late]", func() { origin/release-4.12:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.13:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Late]", func() { origin/release-4.14:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { origin/release-4.15:test/extended/prometheus/prometheus.go: g.It("should report telemetry [Serial] [Late]", func() { and 4.10 is end-of-life since 2023-09-10 [1]. That leaves tests using 4.11 versions of the origin suite, and I'm addressing those via the JOB_NAME checks [2,3]. The checks are brittle, leaving out 4.9 and earlier, and possibly not matching some 4.11 jobs, but they will hopefully be sufficient to get us through until 4.11 goes end-of-life on 2024-02-10 [3]. And when the defaulting logic breaks down, jobs that have an opinion can set TELEMETRY_ENABLED explicitly to match their needs. [1]: https://access.redhat.com/support/policy/updates/openshift/#dates [2]: https://docs.ci.openshift.org/docs/architecture/step-registry/#available-environment-variables [3]: https://docs.prow.k8s.io/docs/jobs/#job-environment-variables * Make 4.11 regex pass shellcheck --------- Co-authored-by: W. Trevor King <wking@tremily.us>
Effectively neutralizing 3c1da8e (#32153), until we teach origin's test case to understand this disabling:
With failures like: