Skip to content

Verify ready replicas in e2e test TestMinScale#6783

Merged
knative-prow-robot merged 1 commit intoknative:masterfrom
MIBc:e2e
Feb 12, 2020
Merged

Verify ready replicas in e2e test TestMinScale#6783
knative-prow-robot merged 1 commit intoknative:masterfrom
MIBc:e2e

Conversation

@MIBc
Copy link
Copy Markdown
Contributor

@MIBc MIBc commented Feb 8, 2020

Fixex #6716

@knative-prow-robot knative-prow-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 8, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Feb 8, 2020
@knative-prow-robot knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 8, 2020
@knative-prow-robot
Copy link
Copy Markdown
Contributor

Hi @MIBc. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Feb 8, 2020
@vagababov
Copy link
Copy Markdown
Contributor

/ok-to-test

@knative-prow-robot knative-prow-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 8, 2020
@markusthoemmes
Copy link
Copy Markdown
Contributor

@MIBc thanks for doing this. The original issue contains a note that says:

we also need to verify that before the revision is marked ready the number of ready endpoints is also at least N.

Have you taken that into account? Seems like we're only checking ReadyReplicas from the deployment but all of our components actually key-off of the Ready addresses of the revision's private service.

@markusthoemmes
Copy link
Copy Markdown
Contributor

/assign @vagababov @tanzeeb

@tanzeeb
Copy link
Copy Markdown
Contributor

tanzeeb commented Feb 10, 2020

/lgtm

The test may have occasional flakes (#5255), but they'll hopefully be resolved by the fix for #6733.

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2020
@MIBc
Copy link
Copy Markdown
Contributor Author

MIBc commented Feb 10, 2020

I have read the note carefully. May be we can check the endponit number of revision before it is ready. @markusthoemmes

@vagababov
Copy link
Copy Markdown
Contributor

Yes, please check endpoints, since that what is drives internal logic, rather than deployment state.

@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 11, 2020
Copy link
Copy Markdown
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is almost there! Some small touchups from my end.

Comment thread test/e2e/minscale_readiness_test.go Outdated
}

// Revision becomes ready
t.Log("Waiting for revision becomes ready")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Log("Waiting for revision becomes ready")
t.Log("Waiting for revision to become ready")

Comment thread test/e2e/minscale_readiness_test.go Outdated
}

// Route becomes ready
t.Log("Waiting for route becomes ready")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Log("Waiting for route becomes ready")
t.Log("Waiting for route to become ready")

Comment thread test/e2e/minscale_readiness_test.go Outdated

// Before becoming ready, observe revision endpoints
endpointName := revName + "-private"
t.Log("Waiting for ready endponits to scale to minScale before revison becoming ready")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Log("Waiting for ready endponits to scale to minScale before revison becoming ready")
t.Log("Waiting for ready endpoints to scale to minScale before the revision becomes ready")

Comment thread test/e2e/minscale_readiness_test.go Outdated
})
}

func waitForDesiredEndpoints(t *testing.T, clients *test.Clients, endpointName string, cond func(int32) bool) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even go as far as using this as waitForDesiredScale instead of adding these calls in addition.

Comment thread test/e2e/minscale_readiness_test.go Outdated
func waitForDesiredEndpoints(t *testing.T, clients *test.Clients, endpointName string, cond func(int32) bool) error {
endpoints := clients.KubeClient.Kube.CoreV1().Endpoints(test.ServingNamespace)

return wait.PollImmediate(time.Second, 10*time.Minute, func() (bool, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return wait.PollImmediate(time.Second, 10*time.Minute, func() (bool, error) {
return wait.PollImmediate(time.Second, 1*time.Minute, func() (bool, error) {

To stay the same as above?

Comment thread test/e2e/minscale_readiness_test.go Outdated
}

// Before becoming ready, observe revision endpoints
endpointName := revName + "-private"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fetch this via API? See

func numberOfReadyPods(ctx *testContext) (float64, error) {
// SKS name matches that of revision.
n := ctx.resources.Revision.Name
sks, err := ctx.clients.NetworkingClient.ServerlessServices.Get(n, metav1.GetOptions{})
if err != nil {
ctx.t.Logf("Error getting SKS %q: %v", n, err)
return 0, fmt.Errorf("error retrieving sks %q: %w", n, err)
}
if sks.Status.PrivateServiceName == "" {
ctx.t.Logf("SKS %s has not yet reconciled", n)
// Not an error, but no pods either.
return 0, nil
}
eps, err := ctx.clients.KubeClient.Kube.CoreV1().Endpoints(test.ServingNamespace).Get(
sks.Status.PrivateServiceName, metav1.GetOptions{})
if err != nil {
return 0, fmt.Errorf("failed to get endpoints %s: %w", sks.Status.PrivateServiceName, err)
}
return float64(resources.ReadyAddressCount(eps)), nil
}
for an example of how to do this.

Comment thread test/e2e/minscale_readiness_test.go Outdated
Comment on lines +171 to +174
if len(endpoint.Subsets) == 0 || len(endpoint.Subsets[0].Addresses) == 0 {
return false, nil
}
return cond(int32(len(endpoint.Subsets[0].Addresses))), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should use the resources.ReadyAddressCount helper. See

func numberOfReadyPods(ctx *testContext) (float64, error) {
// SKS name matches that of revision.
n := ctx.resources.Revision.Name
sks, err := ctx.clients.NetworkingClient.ServerlessServices.Get(n, metav1.GetOptions{})
if err != nil {
ctx.t.Logf("Error getting SKS %q: %v", n, err)
return 0, fmt.Errorf("error retrieving sks %q: %w", n, err)
}
if sks.Status.PrivateServiceName == "" {
ctx.t.Logf("SKS %s has not yet reconciled", n)
// Not an error, but no pods either.
return 0, nil
}
eps, err := ctx.clients.KubeClient.Kube.CoreV1().Endpoints(test.ServingNamespace).Get(
sks.Status.PrivateServiceName, metav1.GetOptions{})
if err != nil {
return 0, fmt.Errorf("failed to get endpoints %s: %w", sks.Status.PrivateServiceName, err)
}
return float64(resources.ReadyAddressCount(eps)), nil
}
for an example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I will improve it.

@MIBc
Copy link
Copy Markdown
Contributor Author

MIBc commented Feb 12, 2020

/retest

Comment thread test/e2e/minscale_readiness_test.go Outdated

sks, err := clients.NetworkingClient.ServerlessServices.Get(revName, metav1.GetOptions{})
if err != nil {
t.Fatalf("Error retrieving sks %q: %w", revName, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%w works only with fmt.Errorf.

Suggested change
t.Fatalf("Error retrieving sks %q: %w", revName, err)
t.Fatalf("Error retrieving sks %q: %v", revName, err)

Copy link
Copy Markdown
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for the changes!

/hold

for the outstanding nits.

Comment thread test/e2e/minscale_readiness_test.go Outdated
endpoints := clients.KubeClient.Kube.CoreV1().Endpoints(test.ServingNamespace)

return wait.PollImmediate(time.Second, 1*time.Minute, func() (bool, error) {
deployment, err := deployments.Get(deploymentName, metav1.GetOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think fetching the deployment is actually required anymore. Shall we remove all mentions of it and redo the error messages above to not include it as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If endpoints satisfy the scale, the deploy satisfies too.

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 12, 2020
@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2020
Comment thread test/e2e/minscale_readiness_test.go Outdated
revName := latestRevisionName(t, clients, names.Config)
deploymentName := revName + "-deployment"

sks, err := clients.NetworkingClient.ServerlessServices.Get(revName, metav1.GetOptions{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, per the test failure this likely needs a loop to wait for the SKS to actually contain a PrivateServiceName and be there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I have find that.

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2020
@knative-test-reporter-robot
Copy link
Copy Markdown

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-integration-tests pull-knative-serving-integration-tests 1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

@MIBc
Copy link
Copy Markdown
Contributor Author

MIBc commented Feb 12, 2020

/retest

Copy link
Copy Markdown
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to let the int64 nit pass and merge, but the SKS poll runs at a danger of returning a wrong result so... sorry but here are my last two comments hopefully 😂 . Thanks for being patient!

Comment thread test/e2e/minscale_readiness_test.go Outdated
}

return cond(*deployment.Spec.Replicas), nil
return cond(int32(resources.ReadyAddressCount(endpoint))), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's change the cond funcs to just take int64 to avoid unnecessary casts (doesn't matter performance wise, but will pollute the code eventually)

Comment thread test/e2e/minscale_readiness_test.go Outdated
if err != nil {
t.Fatalf("Failed to get sks after it was seen to be live: %v", err)
}
return sks.Status.PrivateServiceName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to be set? 🤔 AFAIK we need to go through at least one reconcilation cycle to set this. I'd propose to cache it from the above like so:

func serverlessServicesName(t *testing.T, clients *test.Clients, revisionName string) string {
	var privateServiceName string
	if err := wait.PollImmediate(time.Second, 1*time.Minute, func() (bool, error) {
		_, err := clients.NetworkingClient.ServerlessServices.Get(revisionName, metav1.GetOptions{})
		if err != nil {
			return false, nil
		}
		
		privateServiceName = sks.Status.PrivateServiceName
		if privateServiceName == "" {
			return false, nil
		} 
		return true, nil
	}); err != nil {
		t.Fatalf("Error retrieving sks %q: %v", revisionName, err)
	}
	return privateServiceName
}

That should make this safe against potential inconsistencies. I'd also bump the poll timeout to 1 minute here probably.

Fixex knative#6716
* Waiting for ready endponits to scale to minScale
* Move `teardown` function in front of the `Fatalf`. Otherwise the k8s
resources may not delete.
Copy link
Copy Markdown
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Noice! 🎉

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2020
@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markusthoemmes, MIBc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@markusthoemmes
Copy link
Copy Markdown
Contributor

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2020
@knative-prow-robot knative-prow-robot merged commit d466565 into knative:master Feb 12, 2020
@MIBc MIBc deleted the e2e branch February 13, 2020 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants