Skip to content

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing#315

Merged
openshift-merge-bot[bot] merged 5 commits into
openshift:oadp-1.3from
kaovilai:requeue&fail-oadp-1.3
Jun 18, 2024
Merged

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing#315
openshift-merge-bot[bot] merged 5 commits into
openshift:oadp-1.3from
kaovilai:requeue&fail-oadp-1.3

Conversation

@kaovilai
Copy link
Copy Markdown
Member

@kaovilai kaovilai commented Jun 11, 2024

Signed-off-by: Tiger Kaovilai tkaovila@redhat.com

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@kaovilai kaovilai changed the title Mark InProgress backup/restore as failed upon requeuing oadp-1.3: Mark InProgress backup/restore as failed upon requeuing Jun 11, 2024
@kaovilai
Copy link
Copy Markdown
Member Author

kaovilai commented Jun 11, 2024

from prior feedback need to do finalizer (operations in 1.3) controller also

@kaovilai
Copy link
Copy Markdown
Member Author

from prior feedback need to do finalizer (operations in 1.3) controller also

For restore operations controller we're covered by this returning error:

return ctrl.Result{}, errors.Wrap(err, "error updating Restore")

backup operations controller:

return ctrl.Result{}, errors.Wrap(err, "error updating Backup")

@sseago
Copy link
Copy Markdown

sseago commented Jun 11, 2024

Yes, I think we're fine leaving the backup/restore_operations_controller returns alone, since those already requeue. I don't think we have restore finalizer in 1.3, so we just need backup, restore, and backup finalizer.

@kaovilai
Copy link
Copy Markdown
Member Author

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

Copy link
Copy Markdown

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments on error messages and the change of variable name. The bigger issue is we need the change for the backup finalizer controller as well (no restore finalizer controller in Velero 1.12, so that's not a concern here).

Comment thread pkg/controller/backup_controller.go Outdated
log.Debug("Backup has in progress status from prior reconcile, marking it as failed")
failedCopy := original.DeepCopy()
failedCopy.Status.Phase = velerov1api.BackupPhaseFailed
failedCopy.Status.FailureReason = "Backup from previous reconcile still in progress"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to suggest an APIServer failure here.
"Backup from previous reconcile still in progress. The API Server may have been down."

request.Status.Phase = velerov1api.BackupPhaseInProgress
request.Status.StartTimestamp = &metav1.Time{Time: b.clock.Now()}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize change from upstream since this is in our fork, lets not include whitespace changes like this.

Comment thread pkg/controller/restore_controller.go Outdated
log.Debug("Restore has in progress status from prior reconcile, marking it as failed")
failedCopy := original.DeepCopy()
failedCopy.Status.Phase = api.RestorePhaseFailed
failedCopy.Status.FailureReason = "Restore from previous reconcile still in progress"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to suggest an APIServer failure here.
"Restore from previous reconcile still in progress. The API Server may have been down."

Comment thread pkg/controller/restore_controller.go Outdated

restore := &api.Restore{}
err := r.kbClient.Get(ctx, client.ObjectKey{Namespace: req.Namespace, Name: req.Name}, restore)
original := &api.Restore{}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the variable name change necessary here? This makes the diff larger and increases the possibility of rebase conflicts, since we're carrying this commit in our fork.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this so restore_controller and backup_controller has the same var name pattern that's all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary I agree, just make the logic more pastable across both.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai I figured that's why you did that. I reverted that part here in a later commit since we're carrying the commit in our fork right now, and it removed a lot of lines from the diff, making rebase conflicts less likely. Does that seem reasonable to you?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. That's reasonable

@sseago
Copy link
Copy Markdown

sseago commented Jun 11, 2024

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

Hmm. I think we need to do this in some way. If we can't do it with defer, then we'll need to eliminate the defer call and include this with all return statements.

sseago added 3 commits June 12, 2024 11:52
Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.
@weshayutin weshayutin changed the title oadp-1.3: Mark InProgress backup/restore as failed upon requeuing oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing Jun 12, 2024
Signed-off-by: Scott Seago <sseago@redhat.com>
Comment thread pkg/controller/backup_finalizer_controller.go
Comment thread pkg/controller/backup_finalizer_controller.go
@kaovilai
Copy link
Copy Markdown
Member Author

lgtm

@weshayutin
Copy link
Copy Markdown

ok.. testing update:

w/o the patch, the backup stayed in progress. While updating the dpa the velero server was restarted.

backup: westest-vsphere-apidown-1

status:
  completionTimestamp: "2024-06-17T17:25:58Z"
  expiration: "2024-07-17T17:12:21Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"

1.3.0 have to change the csv to get the test image on:

oc get csv oadp-operator.v1.3.0 -o yaml | grep sseago
                  value: quay.io/sseago/velero:1.3-requeue

Once patched, I initiated a second backup and then took down the api server for roughly 1 minute:

backup: westest-vsphere-apidown-2

status:
  expiration: "2024-07-17T17:38:01Z"
  failureReason: Backup from previous reconcile still in progress. The API Server
    may have been down.
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: "2024-06-17T17:38:01Z"
  version: 1

Copy link
Copy Markdown

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024
Copy link
Copy Markdown

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing to remove my prior "changes requested", but don't count it as an ack, since my own changes are in here as well.

@mateusoliveira43

This comment was marked as resolved.

@kaovilai
Copy link
Copy Markdown
Member Author

Just noting that this may never upstream based on comments at velero-io#7863 (comment)

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 17, 2024

New changes are detected. LGTM label has been removed.

@kaovilai kaovilai force-pushed the requeue&fail-oadp-1.3 branch from 07bab34 to dd74100 Compare June 17, 2024 21:32
Comment thread Makefile.prow Outdated
@kaovilai kaovilai force-pushed the requeue&fail-oadp-1.3 branch from dd74100 to a51ef63 Compare June 17, 2024 23:24
@kaovilai
Copy link
Copy Markdown
Member Author

retest after #320

@kaovilai
Copy link
Copy Markdown
Member Author

/retest

@weshayutin weshayutin added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Jun 18, 2024
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 18, 2024

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Jun 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: kaovilai, sseago, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kaovilai
Copy link
Copy Markdown
Member Author

follow on bugfix: #324

openshift-merge-bot Bot pushed a commit that referenced this pull request Jul 30, 2024
…330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <sseago@redhat.com>

* error message update

Signed-off-by: Scott Seago <sseago@redhat.com>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <sseago@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* `make update`

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this pull request Aug 1, 2024
…penshift#330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (openshift#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <sseago@redhat.com>

* error message update

Signed-off-by: Scott Seago <sseago@redhat.com>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <sseago@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (openshift#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* `make update`

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this pull request Aug 22, 2024
…penshift#330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (openshift#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <sseago@redhat.com>

* error message update

Signed-off-by: Scott Seago <sseago@redhat.com>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <sseago@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (openshift#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* `make update`

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

---------

Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: Scott Seago <sseago@redhat.com>
@kaovilai kaovilai deleted the requeue&fail-oadp-1.3 branch April 8, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants