Skip to content

Add retries and fallback to actual API calls to status updates.#6162

Merged
knative-prow-robot merged 8 commits intoknative:masterfrom
markusthoemmes:reduce-update-conflicts
Dec 11, 2019
Merged

Add retries and fallback to actual API calls to status updates.#6162
knative-prow-robot merged 8 commits intoknative:masterfrom
markusthoemmes:reduce-update-conflicts

Conversation

@markusthoemmes
Copy link
Copy Markdown
Contributor

Proposed Changes

See title.

Release Note

Reduced update conflicts.

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Dec 5, 2019
@knative-prow-robot knative-prow-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 5, 2019
Copy link
Copy Markdown
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes: 0 warnings.

Details

In response to this:

Proposed Changes

See title.

Release Note

Reduced update conflicts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Copy Markdown
Member

dprotaso commented Dec 5, 2019

if we're reconciling an 'out of date' object should we be updating the status of the latest version of the object?

@markusthoemmes
Copy link
Copy Markdown
Contributor Author

@dprotaso I thought along the same lines, this is mostly asking Dr. Empirical if this does anything. It indeed cuts down the conflicts from ~300 to

      3 fulfilled on services.serving.knative.dev
     26 fulfilled on ingresses.networking.internal.knative.dev
     27 fulfilled on podautoscalers.autoscaling.internal.knative.dev
     46 fulfilled on configurations.serving.knative.dev

I'll work on adding more debugging next week to try to find out if there's a pattern of what gets in the way here, especially as this seems to be particularly bad downstream on Openshift.

@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Dec 5, 2019

@dprotaso I think it's fine so long at the .status.observedGeneration reflects the version of the spec that we actually reconciled.

@vagababov
Copy link
Copy Markdown
Contributor

So should we at this point compare generation as well, since it might have changed?

@dprotaso
Copy link
Copy Markdown
Member

dprotaso commented Dec 6, 2019

Poking at some K8s controllers it seems like they attempt to update the status and if it fails they fetch and retry one time (jobs controller retries 3 times 🤷‍♀ )

@mattmoor
Copy link
Copy Markdown
Member

mattmoor commented Dec 6, 2019

@vagababov the point is that if the generation changes, that's fine. The status we are posting is only stating that we've reconciled a previous generation.

@dprotaso I assume the fetch is via the client?

@dprotaso
Copy link
Copy Markdown
Member

dprotaso commented Dec 6, 2019

@dprotaso I assume the fetch is via the client?

yup

@vagababov
Copy link
Copy Markdown
Contributor

Well makes sense, since it's double-checking-lock pattern in essence :)

@markusthoemmes markusthoemmes force-pushed the reduce-update-conflicts branch from 1b5ab78 to fdf8519 Compare December 9, 2019 10:30
@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2019
@markusthoemmes
Copy link
Copy Markdown
Contributor Author

At the latest state, this PR now produces significantly less churn

      6 fulfilled on ingresses.networking.internal.knative.dev
     12 fulfilled on podautoscalers.autoscaling.internal.knative.dev
     20 fulfilled on configurations.serving.knative.dev

@markusthoemmes markusthoemmes changed the title Fetch objects via API before updating status to reduce conflicts. Add retries and fallback to actual API calls to status updates. Dec 9, 2019
@dprotaso
Copy link
Copy Markdown
Member

dprotaso commented Dec 9, 2019

/test pull-knative-serving-go-coverage

existing := pa.DeepCopy()
existing.Status = desired.Status
func (c *Base) UpdateStatus(desired *pav1alpha1.PodAutoscaler) error {
return reconciler.RetryUpdateConflicts(func(i int) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i be something more descriptive like attempt

}

// If there's nothing to update, just return.
if reflect.DeepEqual(existing.Status, desired.Status) {
Copy link
Copy Markdown
Member

@dprotaso dprotaso Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equality.Semantic.DeepEqual

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this wasn't in the prior code - feel free to leave this as a separate PR or I can do it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's leave it separate. Was wondering the same while doing it though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kmp.Diff ? :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, won't fix in this PR. I don't want to inadvertedly and subtly break things in this PR which is already kinda invasive. Let's make that a separate change please.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to doing separately.

return nil
}

becomesReady := desired.Status.IsReady() && !existing.Status.IsReady()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder - what's causing the conflicts? (ie. is it the labeller or gc'er)

@knative-metrics-robot
Copy link
Copy Markdown

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/certificate/certificate.go 88.9% 86.7% -2.2
pkg/reconciler/configuration/configuration.go 87.0% 85.9% -1.1
pkg/reconciler/ingress/ingress.go 84.9% 84.0% -0.9
pkg/reconciler/metric/metric.go 92.3% 87.8% -4.5
pkg/reconciler/retry.go Do not exist 100.0%
pkg/reconciler/revision/revision.go 86.7% 84.6% -2.1
pkg/reconciler/route/reconcile_resources.go 84.2% 83.1% -1.1
pkg/reconciler/serverlessservice/serverlessservice.go 95.2% 94.0% -1.1
pkg/reconciler/service/service.go 89.4% 88.3% -1.1

@knative-test-reporter-robot
Copy link
Copy Markdown

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-unit-tests pull-knative-serving-unit-tests 1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-unit-tests

@markusthoemmes
Copy link
Copy Markdown
Contributor Author

/test pull-knative-serving-unit-tests

@markusthoemmes
Copy link
Copy Markdown
Contributor Author

Values from the last run

      2 fulfilled on certificates.certmanager.k8s.io
      4 fulfilled on ingresses.networking.internal.knative.dev
      4 fulfilled on podautoscalers.autoscaling.internal.knative.dev
     18 fulfilled on configurations.serving.knative.dev

Configurations failures are all from the labeler. I'd like to attack that in a separate PR though to keep this only about status updates.

@mattmoor
Copy link
Copy Markdown
Member

/hold
In case there are other comments.

/lgtm

We should make a similar change to knative/sample-* and knative/eventing

cc @n3wscott @vaikas

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2019
@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2019
Copy link
Copy Markdown
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@vaikas
Copy link
Copy Markdown
Contributor

vaikas commented Dec 10, 2019

Should we move the pkg/reconciler stuff into knative.dev/pkg/reconciler? I can do that in a follow up, but since we're doing this in all the repos, seems like it could / should be hoisted?

@knative-prow-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vaikas vaikas removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2019
@vaikas
Copy link
Copy Markdown
Contributor

vaikas commented Dec 10, 2019

removing the hold label since everybody that participated has lgtm'd it.

@markusthoemmes
Copy link
Copy Markdown
Contributor Author

/test pull-knative-serving-upgrade-tests

@markusthoemmes
Copy link
Copy Markdown
Contributor Author

/test pull-knative-serving-upgrade-tests

@nlopezgi
Copy link
Copy Markdown
Contributor

Should we move the pkg/reconciler stuff into knative.dev/pkg/reconciler? I can do that in a follow up, but since we're doing this in all the repos, seems like it could / should be hoisted?

Did this in knative/pkg#1008 (as part of google/knative-gcp#508). Also created #6657 to use the retry.go from knative.dev/pkg/reconciler and add more tests.

We should make a similar change to knative/sample-* and knative/eventing

cc @n3wscott @vaikas

I'm planning on taking on doing this change in knative/sample-* and knative/eventing if no one else has plans on doing this atm.

@nlopezgi
Copy link
Copy Markdown
Contributor

We should make a similar change to knative/sample-* and knative/eventing

cc @n3wscott @vaikas

Created knative/eventing#2465 for knative/eventing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants