-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Bug 1886620: deflake e2e test "Application behind service load balancer with PDB is not disrupted " #25606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/test help |
|
@aojea: The specified target(s) for
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-aws-upgrade |
|
e2e-gcp-upgrade failed with
|
|
/test e2e-aws-upgrade |
|
e2e-aws-upgrade
Oct 13 23:04:15.140: INFO: Service service-test is unreachable on new connections: Get "http://a0347bf8648bf4a5083eb042e4497fc2-900604380.us-west-2.elb.amazonaws.com:80/echo?msg=Hello": read tcp 10.128.22.98:55682->34.223.185.90:80: read: connection reset by peer Oct 13 23:06:08.960: INFO: Service service-test is unreachable on new connections: Get "http://a0347bf8648bf4a5083eb042e4497fc2-900604380.us-west-2.elb.amazonaws.com:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers) e2e-gcp-upgrade
Oct 13 23:06:35.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers) I think that with 1 second granularity we are getting into race scenarios, let's see if 3 seconds granularity make a difference, both tests went through at the first time with minimum disruption |
|
/test e2e-aws-upgrade |
|
ci/prow/e2e-aws-upgrade
ci/prow/e2e-gcp-upgrade
/test e2e-aws-upgrade |
|
ci/prow/e2e-aws-upgrade OK 😄 /test e2e-aws-upgrade |
|
2/2 perfect smile /test e2e-aws-upgrade |
|
3/3 without errors /test e2e-aws-upgrade |
|
4/4 without errors /test e2e-aws-upgrade |
|
e2e-aws-upgrade failed the installation
e2e-gcp-upgrade failed the installation
/test e2e-aws-upgrade |
|
5/5 without errors seems that is enough evidence in a test that was totally flake, we can always revisit |
instead of using client-go to do the GET probes against the PDB service, we use directly the net/http package to avoid dependencies and unneeded overhead. Signed-off-by: Antonio Ojea <aojea@redhat.com>
|
@aojea: This pull request references Bugzilla bug 1886620, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
|
||
| m.AddSampler( | ||
| monitor.StartSampling(ctx, m, time.Second, func(previous bool) (condition *monitor.Condition, next bool) { | ||
| data, err := continuousClient.Get().AbsPath("echo").Param("msg", "Hello").DoRaw(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this close the resp.Body()?
|
@aojea: This pull request references Bugzilla bug 1886620, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @knobunc |
|
/retest |
|
@knobunc seems my intuiton was right, the client-go is rate limiting
|
|
/approve |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@aojea: All pull requests linked via external trackers have merged: Bugzilla bug 1886620 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.6 |
|
@aojea: new pull request created: #25634 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.5 |
|
@aojea: #25606 failed to apply on top of branch "release-4.5": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The test "Application behind service load balancer with PDB is not disrupted" basically does the following:
To monitor the application there are 2 "threads":
Previously the test was using the Service with
externalTrafficPolicy=Cluster, that means that the Cloud Provider LB can forward the traffic to any node, and it is possible that it has to do double hops and NAT inside the cluster. After switching to externalTrafficPolicy=Local it was noticed a big improvement.The test "service load balancer with PDB is not disrupted during upgrade" is using client-go
UnversionedRESTClientFor()for testing the Service under test. Since the application is very simple we don't need all the overhead (it adds headers and more bytes to the requests) and this dependency, that may affect the test without us noticing (i.e it has a rate limiter, I think that is disabled by default but if this changes 🤷 )The results is that the test went through 5/5 times #25606 (comment)
However, there is still a problem, and is that I observed there is always some disruption for NEW connections
I've tried to increase the sample to monitor every 3 seconds but still shows errors, however this does not show up in the monitor that reuses the connection.
Since this improves the test, let's merge and iterate.
Signed-off-by: Antonio Ojea aojea@redhat.com