Skip to content

Conversation

@derekhiggins
Copy link
Contributor

@derekhiggins derekhiggins commented Nov 28, 2022

Test for allowedDisruption duration early in test and skip
if not present.

-- old
Returning nil (when no historical data is present) causes a nil pointer dereference in "Image registry" disruption tests expecting a value.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 28, 2022
@openshift-ci-robot
Copy link

@derekhiggins: This pull request references Jira Issue OCPBUGS-4190, which is invalid:

  • expected the bug to target the "4.13.0" version, but it targets "4.13" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Returning nil (when no historical data is present) causes a nil pointer dereference in "Image registry" disruption tests expecting a value.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from bparees and deads2k November 28, 2022 15:59
@derekhiggins
Copy link
Contributor Author

cc @soltysh @dgoodwin

@derekhiggins
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 28, 2022
@openshift-ci-robot
Copy link

@derekhiggins: This pull request references Jira Issue OCPBUGS-4190, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dgoodwin
Copy link
Contributor

/hold

I don't think this will work out, absence of disruption data usually means we lack the runs to make a determination, it's an explicit state very different from just assuming one second. We'll surpass one second almost all the time, failing jobs widely.

Whatever test is using the nil is probably where you need to update so it knows how to understand it.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2022
@dgoodwin
Copy link
Contributor

The jira unfortunately does not link to any job runs or include any stack traces, guessing it was around line 186? It looks like that is where the fix is needed.

@dgoodwin
Copy link
Contributor

dgoodwin commented Nov 28, 2022

It looks like I missed this code path in: 9d2719f but you can see the approach taken in the synthetic tests, and we should be able to do similar around line 186.

ExpectNoDisruptionForDuration(
		f,
		*allowedDisruption,
		end.Sub(start),
		events,
		fmt.Sprintf("%s was unreachable during disruption: %v", t.backend.GetLocator(), disruptionDetails),
	)

Thanks for noticing and going after a fix Derek.

@openshift-ci-robot
Copy link

@derekhiggins: This pull request references Jira Issue OCPBUGS-4190, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Test for allowedDisruption duration early in test and skip
if not present.

-- old
Returning nil (when no historical data is present) causes a nil pointer dereference in "Image registry" disruption tests expecting a value.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


allowedDisruption, disruptionDetails, err := t.getAllowedDisruption(f)
if allowedDisruption == nil{
framework.Logf(fmt.Sprintf("Skipping: %s: No historical data", t.testName))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried calling skipper.Skipf(.... here but this resulted in the entire upgrade disruption test being skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the e2eskipper won't work here. You would have to write a FrameworkSkipf in test/extended/util/disruption/disruption.go to generate a skipped test, which probably requires adding that functionality to TestSummaries.

@derekhiggins
Copy link
Contributor Author

The jira unfortunately does not link to any job runs or include any stack traces, guessing it was around line 186? It looks like that is where the fix is needed.

Yup, it was line 186,
*allowedDisruption,

It looks like I missed this code path in: 9d2719f but you can see the approach taken in the synthetic tests, and we should be able to do similar around line 186.

I couldn't tell how to skip the test when it had already started without skipping the entire set of tests, so I ended up
returning from the goroutine early, happy to try something else if you can point me in the right direction.

Thanks for noticing and going after a fix Derek.

no prob

@derekhiggins
Copy link
Contributor Author

/hold
handling of err needs to be fixed.

Also the seems to trigger on aws jobs (from pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade)
Nov 29 11:48:56.019: INFO: Skipping: [sig-imageregistry] Image registry remains available using reused connections: No historical data

stopCh := make(chan struct{})
defer close(stopCh)

allowedDisruption, disruptionDetails, err := t.getAllowedDisruption(f)
Copy link
Contributor Author

@derekhiggins derekhiggins Nov 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to fix this "err" isn't being checked in the correct place.

@stbenjam
Copy link
Member

The jira unfortunately does not link to any job runs or include any stack traces, guessing it was around line 186? It looks like that is where the fix is needed.

Here's an example I stumbled across https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-upgrade-ovn-ipv6/1597863031415508992

I attempted a partial fix of the problem of alwaysAllowOneSecond returning nil #27574, but after looking at this I think something like this approach is correct, and if we get nil that means no data.

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 2, 2022

/lgtm
/approve
/hold

Held just so someone can check the optional presubmits, if they look healthy feel free to clear. I will try to remember.

Thanks again for fixing this Derek.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 2, 2022
@derekhiggins
Copy link
Contributor Author

Held just so someone can check the optional presubmits, if they look healthy feel free to clear. I will try to remember.

I'm still trying to figure out why my skip triggered on e2e-gcp-ovn-upgrade, any ideas if this would be expected?
see:

Nov 29 11:59:02.732: INFO: Skipping: [sig-imageregistry] Image registry remains available using new connections: No historical data
Nov 29 11:59:03.277: INFO: Skipping: [sig-imageregistry] Image registry remains available using reused connections: No historical data

in
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27574/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1597547482412224512/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/build-log.txt

I'm deploying a gcp cluster at the moment to see if I can find out

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 2, 2022

cat ./pkg/synthetictests/allowedbackenddisruption/query_results.json | jq '.[] | select(.Platform == "gcp") | select(.FromRelease == "4.13") | select(.BackendName == "image-registry-new-connections") | select(.Network == "ovn")'

This implies that we do not have the required 100 runs per 3 weeks for that particular combo. We do have minor upgrades though, from 4.12 -> 4.13, but not micro.

Because it's not in the datafile, it is expected this would get skipped with your PR.

Test for allowedDisruption duration early in test and skip
if not present.
@derekhiggins
Copy link
Contributor Author

cat ./pkg/synthetictests/allowedbackenddisruption/query_results.json | jq '.[] | select(.Platform == "gcp") | select(.FromRelease == "4.13") | select(.BackendName == "image-registry-new-connections") | select(.Network == "ovn")'

This implies that we do not have the required 100 runs per 3 weeks for that particular combo. We do have minor upgrades though, from 4.12 -> 4.13, but not micro.

Because it's not in the datafile, it is expected this would get skipped with your PR.

that makes sense but not sure why these weren't failing with the same nil dereference so,

anyways, I've fixed the whitespace in the PR

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2022
@derekhiggins
Copy link
Contributor Author

/retest-required
CI results never got reported

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2022

@derekhiggins: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-cmd d7b7134 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-gcp-csi d7b7134 link false /test e2e-gcp-csi
ci/prow/e2e-aws-ovn-single-node-serial d7b7134 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-openstack-ovn d7b7134 link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-sdn d7b7134 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-metal-ipi-ovn-ipv6 d7b7134 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-rt-upgrade d7b7134 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-single-node-upgrade d7b7134 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 5, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 5, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekhiggins, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 5, 2022

/retest-required

@derekhiggins
Copy link
Contributor Author

/unhold
optional presubmit failures look unrelated

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 6, 2022
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6fea59e and 2 for PR HEAD 4bc6210 in total

@openshift-merge-robot openshift-merge-robot merged commit 46a1914 into openshift:master Dec 6, 2022
@openshift-ci-robot
Copy link

@derekhiggins: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-4190 has been moved to the MODIFIED state.

Details

In response to this:

Test for allowedDisruption duration early in test and skip
if not present.

-- old
Returning nil (when no historical data is present) causes a nil pointer dereference in "Image registry" disruption tests expecting a value.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@derekhiggins
Copy link
Contributor Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@derekhiggins: new pull request created: #27610

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgoodwin added a commit to dgoodwin/origin that referenced this pull request Jan 6, 2023
Discovered that we are getting no data for image registry since Dec 7th
due to PR openshift#27574 which shut off the entire monitor if we have less than
the required 100 runs to appear in the file of allowances to enforce. I
guess at this time, we did not have enough runs with image registry for
4.13. (which is unusual)

Skip the test, but let the monitor run.
dgoodwin added a commit to dgoodwin/origin that referenced this pull request Feb 28, 2023
Discovered that we are getting no data for image registry since Dec 7th
due to PR openshift#27574 which shut off the entire monitor if we have less than
the required 100 runs to appear in the file of allowances to enforce. I
guess at this time, we did not have enough runs with image registry for
4.13. (which is unusual)

Skip the test, but let the monitor run.
tjungblu pushed a commit to tjungblu/origin that referenced this pull request Apr 11, 2023
Discovered that we are getting no data for image registry since Dec 7th
due to PR openshift#27574 which shut off the entire monitor if we have less than
the required 100 runs to appear in the file of allowances to enforce. I
guess at this time, we did not have enough runs with image registry for
4.13. (which is unusual)

Skip the test, but let the monitor run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants