Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Jun 2, 2023

What changes were proposed in this pull request?

This PR adds an option to the ozone sh snapshot snapshotDiff command, to cancel an IN_PROGRESS snapshotDiff job. If the option is used and the job is IN_PROGRESS, then the status is updated to CANCELED.

The part of the code that might take up the most resources and cause the delay, has been refactored so that we can keep checking if the JobStatus is CANCELED before every method call.

If the job is canceled, then the method doing the calculations returns and the job remains CANCELED until the SnapshotDiffCleanupService deletes it from the snapDiffJobTable and the user can resubmit it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8490

How was this patch tested?

This patch was tested with new unit and integration tests. It was also tested manually using the docker dev environment.

@xBis7
Copy link
Contributor Author

xBis7 commented Jun 2, 2023

@hemantk-12 @smengcl Can you take a look at this PR?

@adoroszlai adoroszlai added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 2, 2023
Copy link
Contributor

@hemantk-12 hemantk-12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @xBis7.

Overall looks good to me. Left some comments.

@xBis7
Copy link
Contributor Author

xBis7 commented Jun 7, 2023

@hemantk-12 Thanks for reviewing this, I've addressed all your comments.

@smengcl smengcl added the gr label Jun 8, 2023
@xBis7
Copy link
Contributor Author

xBis7 commented Jun 14, 2023

@hemantk-12 I've addressed all your comments and updated the patch to send different responses back to the client in case job cancelling fails. I've also updated the existing tests and added new ones as well. Let me know how it looks.

Comment on lines 656 to 667
SnapshotDiffResponse response = store.snapshotDiff(
volumeName, bucketName, fromSnapName, toSnapName,
null, 0, false, false);

assertEquals(IN_PROGRESS, response.getJobStatus());

response = store.snapshotDiff(volumeName,
bucketName, fromSnapName, toSnapName,
null, 0, false, true);

// Job status should be updated to CANCELED.
assertEquals(CANCELED, response.getJobStatus());
Copy link
Contributor

@smengcl smengcl Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a source of flakiness? There is no guarantee that the diff job isn't DONE when calling the second snapshotDiff with cancel = true right?

Can fix this if there is a way to suspend the SnapshotDiff worker thread before the first call.

Copy link
Contributor Author

@xBis7 xBis7 Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fix this if there is a way to suspend the SnapshotDiff worker thread before the first call.

That's a good idea, but the first call is submitting the job and we would be suspending the thread for a job that doesn't even exist in the snap diff table. If we do it after the submitting then there is no difference with cancelling the job.

The only way I can think of is calling the methods that perform all these operations and do the suspension between the calls instead of using the ObjectStore api. But that is done in TestSnapshotDiffManager.

So far this test hasn't proved to be flaky. Usually, the workflow takes too long to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm think if we could intercept the snapDiffExecutor:

this.snapDiffExecutor = new ThreadPoolExecutor(threadPoolSize,
threadPoolSize,
0,
TimeUnit.MILLISECONDS,
new ArrayBlockingQueue<>(threadPoolSize),
new ThreadFactoryBuilder()
.setNameFormat("snapshot-diff-job-thread-id-%d")
.build()
);

Or just add a spin lock just for the tests inside generateSnapshotDiffReport().

Comment on lines 481 to 486
case CANCELED:
return new SnapshotDiffResponse(
new SnapshotDiffReportOzone(snapshotRoot.toString(), volumeName,
bucketName, fromSnapshotName, toSnapshotName, new ArrayList<>(),
null),
CANCELED, 0L, CancelStatus.CANCEL_SUCCESS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but I think SnapshotDiffReportOzone could use a Builder subclass. cc @hemantk-12

@xBis7
Copy link
Contributor Author

xBis7 commented Jun 19, 2023

@smengcl Thanks for review, I've addressed all your comments and made the changes. Pending the comment about the test and its flakiness.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending CI

@smengcl smengcl merged commit f21940c into apache:master Jun 21, 2023
@smengcl
Copy link
Contributor

smengcl commented Jun 21, 2023

Thanks @xBis7 for the PR. Thanks @hemantk-12 for reviewing this.

We can file a follow-up jira to deal with the potential flakiness if needed.

@xBis7
Copy link
Contributor Author

xBis7 commented Jun 21, 2023

@smengcl @hemantk-12 Thanks for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants