Skip to content

Conversation

@aojea
Copy link

@aojea aojea commented Oct 13, 2020

The test "Application behind service load balancer with PDB is not disrupted" basically does the following:

  • Creates an app using PDB with 2 pods exposing an app that echoes the request
  • Exposes the app using a Service Type LoadBalancer (using externalTrafficPolicy=Local)
  • Monitors the application during the upgrade performing GET request to the exposed endpoint and checking that the answer is correct.

To monitor the application there are 2 "threads":

  • one that reuses the TCP connection to perform the checks
  • other that creates a new TCP connection every time

Previously the test was using the Service with externalTrafficPolicy=Cluster, that means that the Cloud Provider LB can forward the traffic to any node, and it is possible that it has to do double hops and NAT inside the cluster. After switching to externalTrafficPolicy=Local it was noticed a big improvement.

The test "service load balancer with PDB is not disrupted during upgrade" is using client-go UnversionedRESTClientFor() for testing the Service under test. Since the application is very simple we don't need all the overhead (it adds headers and more bytes to the requests) and this dependency, that may affect the test without us noticing (i.e it has a rate limiter, I think that is disabled by default but if this changes 🤷 )

The results is that the test went through 5/5 times #25606 (comment)

However, there is still a problem, and is that I observed there is always some disruption for NEW connections

ct 15 14:27:26.639 I ns/e2e-k8s-service-lb-available-9540 svc/service-test Service started responding to GET requests over new connections
Oct 15 14:30:10.431 E ns/e2e-k8s-service-lb-available-9540 svc/service-test Service stopped responding to GET requests over new connections

I've tried to increase the sample to monitor every 3 seconds but still shows errors, however this does not show up in the monitor that reuses the connection.

Since this improves the test, let's merge and iterate.

Signed-off-by: Antonio Ojea aojea@redhat.com

@aojea aojea changed the title use golang net/http directly for probes [WIP] use golang net/http directly for probes Oct 13, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 13, 2020
@aojea
Copy link
Author

aojea commented Oct 13, 2020

/test help

@openshift-ci-robot
Copy link

@aojea: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-fips
  • /test e2e-aws-image-registry
  • /test e2e-aws-jenkins
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-upgrade
  • /test e2e-azure
  • /test e2e-cmd
  • /test e2e-gcp
  • /test e2e-gcp-builds
  • /test e2e-gcp-image-ecosystem
  • /test e2e-gcp-upgrade
  • /test e2e-openstack
  • /test e2e-vsphere
  • /test images
  • /test verify
  • /test verify-deps
  • /test extended_gssapi
  • /test extended_ldap_groups
  • /test extended_networking

Use /test all to run the following jobs:

  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-fips
  • pull-ci-openshift-origin-master-e2e-aws-serial
  • pull-ci-openshift-origin-master-e2e-cmd
  • pull-ci-openshift-origin-master-e2e-gcp
  • pull-ci-openshift-origin-master-e2e-gcp-upgrade
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps
Details

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Author

aojea commented Oct 13, 2020

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 13, 2020

e2e-gcp-upgrade failed with

msg="Error: Request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-lxbpm7d0-db044-hw69q-w@openshift-gce-devel-ci.iam.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" returned error: Batch request and retried single request "Create IAM Members roles/compute.viewer serviceAccount:ci-op-lxbpm7d0-db044-hw69q-w@openshift-gce-devel-ci.iam.gserviceaccount.com for \"project \\\"openshift-gce-devel-ci\\\"\"" both failed. Final error: Error applying IAM policy for project "openshift-gce-devel-ci": Error setting IAM policy for project "openshift-gce-devel-ci": googleapi: Error 400: The number of members in the policy (1,501) is larger than the maximum allowed size 1,500., badReque

@aojea
Copy link
Author

aojea commented Oct 13, 2020

/test e2e-aws-upgrade
/test e2e-gcp-upgrade
jobs timed out

@aojea
Copy link
Author

aojea commented Oct 14, 2020

e2e-aws-upgrade

Service was unreachable during disruption for at least 3s of 47m56s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Oct 13 23:04:15.140: INFO: Service service-test is unreachable on new connections: Get "http://a0347bf8648bf4a5083eb042e4497fc2-900604380.us-west-2.elb.amazonaws.com:80/echo?msg=Hello": read tcp 10.128.22.98:55682->34.223.185.90:80: read: connection reset by peer

Oct 13 23:06:08.960: INFO: Service service-test is unreachable on new connections: Get "http://a0347bf8648bf4a5083eb042e4497fc2-900604380.us-west-2.elb.amazonaws.com:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

e2e-gcp-upgrade

Service was unreachable during disruption for at least 13s of 45m40s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Oct 13 23:06:35.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Oct 13 23:06:47.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Oct 13 23:11:44.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Oct 13 23:11:54.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Oct 13 23:14:23.389: INFO: Service service-test is unreachable on new connections: Get "http://35.227.22.146:80/echo?msg=Hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I think that with 1 second granularity we are getting into race scenarios, let's see if 3 seconds granularity make a difference, both tests went through at the first time with minimum disruption

@aojea
Copy link
Author

aojea commented Oct 14, 2020

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 14, 2020

ci/prow/e2e-aws-upgrade

PDB passed without any issue

ci/prow/e2e-gcp-upgrade

Service was unreachable during disruption for at least 3s of 49m0s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 14, 2020

ci/prow/e2e-aws-upgrade OK
ci/prow/e2e-gcp-upgrade OK

😄

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea aojea changed the title [WIP] use golang net/http directly for probes [WIP] use golang net/http directly for PDB upgrade test probes Oct 14, 2020
@aojea
Copy link
Author

aojea commented Oct 14, 2020

2/2 perfect
ci/prow/e2e-aws-upgrade OK
ci/prow/e2e-gcp-upgrade OK

smile

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 15, 2020

3/3 without errors
ci/prow/e2e-aws-upgrade OK
ci/prow/e2e-gcp-upgrade OK

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 15, 2020

4/4 without errors
ci/prow/e2e-aws-upgrade OK
ci/prow/e2e-gcp-upgrade OK

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 15, 2020

e2e-aws-upgrade failed the installation

  • 44x kubelet: Back-off pulling image "registry.build01.ci.openshift.org/ci-op-tirif2mn/stable:tests"
  • 2x kubelet: Error: ImagePullBackOff
  • 1x kubelet: Back-off pulling image "registry.build01.ci.openshift.org/ci-op-tirif2mn/stable:tests"
  • 1x kubelet: Error: ImagePullBackOff
  • 3x kubelet: Failed to pull image "registry.build01.ci.openshift.org/ci-op-tirif2mn/stable:tests": rpc error: code = Unknown desc = Error reading signatures: Error downloading signatures for sha256:427def1b5ac3989e13586865276e6dd0f37ec13c99d70e36eb49ff91d47ce74a in registry.build01.ci.openshift.org/ci-op-tirif2mn/stable: received unexpected HTTP status: 504 Gateway Time-out

e2e-gcp-upgrade failed the installation

level=fatal msg="Bootstrap failed to complete: failed waiting for Kubernetes API: Get "https://api.ci-op-tirif2mn-db044.origin-ci-int-gce.dev.openshift.com:6443/version?timeout=32s\": dial tcp 35.231.89.161:6443: connect: connection refused"

/test e2e-aws-upgrade
/test e2e-gcp-upgrade

@aojea
Copy link
Author

aojea commented Oct 15, 2020

5/5 without errors
ci/prow/e2e-aws-upgrade OK
ci/prow/e2e-gcp-upgrade OK

seems that is enough evidence in a test that was totally flake, we can always revisit

instead of using client-go to do the GET probes against the PDB service,
we use directly the net/http package to avoid dependencies and unneeded
overhead.

Signed-off-by: Antonio Ojea <aojea@redhat.com>
@aojea aojea changed the title [WIP] use golang net/http directly for PDB upgrade test probes Bug 1886620: deflake e2e test "service load balancer with PDB is not disrupted during upgrade" Oct 15, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 15, 2020
@openshift-ci-robot
Copy link

@aojea: This pull request references Bugzilla bug 1886620, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1886620: deflake e2e test "service load balancer with PDB is not disrupted during upgrade"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Oct 15, 2020

m.AddSampler(
monitor.StartSampling(ctx, m, time.Second, func(previous bool) (condition *monitor.Condition, next bool) {
data, err := continuousClient.Get().AbsPath("echo").Param("msg", "Hello").DoRaw(ctx)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this close the resp.Body()?

@openshift-ci-robot
Copy link

@aojea: This pull request references Bugzilla bug 1886620, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1886620: deflake e2e test "service load balancer with PDB is not disrupted during upgrade"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea aojea changed the title Bug 1886620: deflake e2e test "service load balancer with PDB is not disrupted during upgrade" Bug 1886620: deflake e2e test "Application behind service load balancer with PDB is not disrupted " Oct 15, 2020
@aojea
Copy link
Author

aojea commented Oct 15, 2020

/assign @knobunc

#25606 (comment)

@aojea
Copy link
Author

aojea commented Oct 15, 2020

/retest

@aojea
Copy link
Author

aojea commented Oct 21, 2020

@knobunc @deads2k can you PTAL

@aojea
Copy link
Author

aojea commented Oct 23, 2020

@knobunc seems my intuiton was right, the client-go is rate limiting

that may affect the test without us noticing (i.e it has a rate limiter, I think that is disabled by default but if this changes shrug )

kubernetes/kubernetes#95825

@knobunc
Copy link
Contributor

knobunc commented Oct 26, 2020

/approve

@knobunc
Copy link
Contributor

knobunc commented Oct 26, 2020

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, knobunc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Oct 26, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 26a7be9 into openshift:master Oct 27, 2020
@openshift-ci-robot
Copy link

@aojea: All pull requests linked via external trackers have merged:

Bugzilla bug 1886620 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1886620: deflake e2e test "Application behind service load balancer with PDB is not disrupted "

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Author

aojea commented Oct 27, 2020

/cherrypick release-4.6

@openshift-cherrypick-robot

@aojea: new pull request created: #25634

Details

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Author

aojea commented Oct 27, 2020

/cherrypick release-4.5

@openshift-cherrypick-robot

@aojea: #25606 failed to apply on top of branch "release-4.5":

Applying: use net/http instead of client-go for e2e PDB test
Using index info to reconstruct a base tree...
M	test/e2e/upgrade/service/service.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/upgrade/service/service.go
CONFLICT (content): Merge conflict in test/e2e/upgrade/service/service.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 use net/http instead of client-go for e2e PDB test
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-4.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants