Skip to content

Enhance RetryingRouteInconsistency#7390

Closed
mgencur wants to merge 2 commits intoknative:masterfrom
mgencur:retrying_route_inconsistency
Closed

Enhance RetryingRouteInconsistency#7390
mgencur wants to merge 2 commits intoknative:masterfrom
mgencur:retrying_route_inconsistency

Conversation

@mgencur
Copy link
Copy Markdown
Contributor

@mgencur mgencur commented Mar 26, 2020

Fixes a downstream problem with routing traffic to the application. There can be more routers which start proper routing at different times. This PR works around that by enforcing several successful responses in sequence which might go through different routers. All of them must work before proceeding.

Fixes #

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Mar 26, 2020
@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Mar 26, 2020
Copy link
Copy Markdown
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgencur: 0 warnings.

Details

In response to this:

  • also update TestSingleConcurrency to provide more logging on failure

Fixes #

Proposed Changes

Release Note


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment thread test/v1/route.go Outdated
Comment thread test/v1/route.go Outdated

// RetryingRouteInconsistency retries common requests seen when creating a new route
func RetryingRouteInconsistency(innerCheck spoof.ResponseChecker) spoof.ResponseChecker {
const neededSuccesses = 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the thing below basically: We need this downstream because our routers are inconsistent in between themselves too 😂 . So we're adding a retry that checks that we've seen enough succeeding requests in a row to assume all routers are good.

Not sure we want this upstream though, as this problem currently doesn't exist here AFAIK.

On a broader note: We might want to discuss having helpers like the ones we need here upstream and configurable. I could imagine other systems having similar issues if they really want to use the Knative E2E tests on a public service for example, with domains etc all setup.

Wondering if we might want to add flags to control this, i.e. --ingress-retriable-status-code="404,503" --ingress-consecutive-requests="5" and Knative upstream's CI just runs with both of these turned off?

Just a thought from the top of my head, but definitely something we might want to discuss and explore I'd think.

@mgencur mgencur force-pushed the retrying_route_inconsistency branch from 8126a6d to 9f3292c Compare March 26, 2020 13:55
@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented Mar 26, 2020

I've extracted the changes to TestSingleConcurrency into #7392

@knative-test-reporter-robot
Copy link
Copy Markdown

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-integration-tests pull-knative-serving-integration-tests 1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

@markusthoemmes
Copy link
Copy Markdown
Contributor

Please add a description as to why this change is needed. I think we should bring this up with the test/productivity working group and see if we can find common ground with these helpers. I completely agree with the sentiment of being willing to run these tests in virtually any environment, but I think we need to be able to relax/tighten requirements here.

@vagababov
Copy link
Copy Markdown
Contributor

Martin, are you still pursuing this?

@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented May 7, 2020

@vagababov Yes. I'd like this to be merged. The function solves a downstream problem - same as before. It it just more robust. So, I don't see a reason for not merging that.
@markusthoemmes make sense?? I'm not convinced about adding new flags to make it more configurable so I'd defer next steps to you.

@markusthoemmes
Copy link
Copy Markdown
Contributor

We'll need to send this via the respective working group as mentioned above. I don't think we want to generally water down our assertions here because of downstream projects. At least not unconditionally.

@tcnghia
Copy link
Copy Markdown
Contributor

tcnghia commented May 15, 2020

@markusthoemmes I am fine with this change
/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 15, 2020
@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Jun 3, 2020

Historically we needed this because of real issues, and I think we stamped out most of them. I'm hesitant to relax our upstream checks because of this, while I understand the downstream pain here. Should we put some additional retry logic under a flag that's off by default?

If anything I think I'd rather see this adopt an SLO-based approach similar to the probers we run because this check let's through 100 failures followed by 5 successes.

I think that if we had an SLO we were trying to achieve that was configurable via a flag (default to 100%, no minimum population) then we could simply check until that's satisfied. This could be allowed to tolerate some early failures, so long as over time it rises above some minimum acceptable level.

WDYT?

@markusthoemmes
Copy link
Copy Markdown
Contributor

That matches my sentiment too, yeah. It should be configurable via a flag. Not sure if we need SLO support though. In this case it's actually okay to fail 100 times to then pass 5 consecutive times.

@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Jun 4, 2020

In this case it's actually okay to fail 100 times to then pass 5 consecutive times

Do you mean in the case you folks see downstream or in general?

@markusthoemmes
Copy link
Copy Markdown
Contributor

Definitely behind a flag.

@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented Jun 4, 2020

OK. It seems the consensus is to have it behind a flag. I can see a little problem with the SLO approach in this case because it could pass even if the failures (though infrequent) are at the end of the checks. At that point we really expect that everything works and is ready.
IMO, a flag such as --ingress-retries="5" would be a good choice. The additional logic would be triggered only if it's defined and is positive. This would also ensure that at the very end of our checks everything is ready.

@markusthoemmes
Copy link
Copy Markdown
Contributor

Yep, I agree. --ingress-retries is a little confusing though as it's not the amount of retries but the number of consecutive successes.

@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented Jun 4, 2020

I'm afraid any self-explanatory name will be too long. A comment on the flag will help to clarify the exact meaning. E.g. --ingress-consecutive-successful-requests. Too long?

@mgencur mgencur force-pushed the retrying_route_inconsistency branch from 9f3292c to 1d8c080 Compare June 4, 2020 17:05
@knative-prow-robot
Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2020
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mgencur
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread test/e2e_flags.go Outdated
Comment thread test/v1/route.go Outdated
@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented Jun 4, 2020

The first impl is above. Another option would be to move the RetryingRouteInconsistency to knative.dev/pkg and put it inside the default response checkers. For example:

func EventuallyMatchesBody(expected string) spoof.ResponseChecker {
	return RetryingRouteInconsistency(
		func(resp *spoof.Response) (bool, error) {
			if !strings.Contains(string(resp.Body), expected) {
				// Returning (false, nil) causes SpoofingClient.Poll to retry.
				return false, nil
			}

			return true, nil
		})
}

This would allow us to delete all occurrences of RetryingRouteInconsistency from all tests in Knative Serving. But we'd need to move the flag too :-/

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2020
@knative-prow-robot
Copy link
Copy Markdown
Contributor

@mgencur: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot
Copy link
Copy Markdown
Contributor

@mgencur: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-knative-serving-contour-tls 69f5ec0 link /test pull-knative-serving-contour-tls

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@tcnghia
Copy link
Copy Markdown
Contributor

tcnghia commented Sep 2, 2020

@mgencur is this PR still needed? can you please rebase? thanks

@mgencur
Copy link
Copy Markdown
Contributor Author

mgencur commented Sep 2, 2020

@tcnghia my impression was that this is not desired so I didn't rebase lately. Any comment on that @markusthoemmes ?

@tcnghia
Copy link
Copy Markdown
Contributor

tcnghia commented Sep 3, 2020

@mgencur in that case let's close the PR? thanks

@mgencur mgencur closed this Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants