-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-8490. [Snapshot] Ability to cancel an in-progress snapdiff job #4819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@hemantk-12 @smengcl Can you take a look at this PR? |
hemantk-12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @xBis7.
Overall looks good to me. Left some comments.
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshot.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOmSnapshot.java
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
|
@hemantk-12 Thanks for reviewing this, I've addressed all your comments. |
...p-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffJob.java
Outdated
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
|
@hemantk-12 I've addressed all your comments and updated the patch to send different responses back to the client in case job cancelling fails. I've also updated the existing tests and added new ones as well. Let me know how it looks. |
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/ObjectStore.java
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/protocol/ClientProtocol.java
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java
Show resolved
Hide resolved
...op-ozone/tools/src/main/java/org/apache/hadoop/ozone/shell/snapshot/SnapshotDiffHandler.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
Outdated
Show resolved
Hide resolved
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
Outdated
Show resolved
Hide resolved
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
Outdated
Show resolved
Hide resolved
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto
Outdated
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java
Outdated
Show resolved
Hide resolved
...one-manager/src/main/java/org/apache/hadoop/ozone/om/service/SnapshotDiffCleanupService.java
Show resolved
Hide resolved
| SnapshotDiffResponse response = store.snapshotDiff( | ||
| volumeName, bucketName, fromSnapName, toSnapName, | ||
| null, 0, false, false); | ||
|
|
||
| assertEquals(IN_PROGRESS, response.getJobStatus()); | ||
|
|
||
| response = store.snapshotDiff(volumeName, | ||
| bucketName, fromSnapName, toSnapName, | ||
| null, 0, false, true); | ||
|
|
||
| // Job status should be updated to CANCELED. | ||
| assertEquals(CANCELED, response.getJobStatus()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a source of flakiness? There is no guarantee that the diff job isn't DONE when calling the second snapshotDiff with cancel = true right?
Can fix this if there is a way to suspend the SnapshotDiff worker thread before the first call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fix this if there is a way to suspend the SnapshotDiff worker thread before the first call.
That's a good idea, but the first call is submitting the job and we would be suspending the thread for a job that doesn't even exist in the snap diff table. If we do it after the submitting then there is no difference with cancelling the job.
The only way I can think of is calling the methods that perform all these operations and do the suspension between the calls instead of using the ObjectStore api. But that is done in TestSnapshotDiffManager.
So far this test hasn't proved to be flaky. Usually, the workflow takes too long to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm think if we could intercept the snapDiffExecutor:
Lines 216 to 224 in 9c6cd4b
| this.snapDiffExecutor = new ThreadPoolExecutor(threadPoolSize, | |
| threadPoolSize, | |
| 0, | |
| TimeUnit.MILLISECONDS, | |
| new ArrayBlockingQueue<>(threadPoolSize), | |
| new ThreadFactoryBuilder() | |
| .setNameFormat("snapshot-diff-job-thread-id-%d") | |
| .build() | |
| ); |
Or just add a spin lock just for the tests inside generateSnapshotDiffReport().
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java
Show resolved
Hide resolved
| case CANCELED: | ||
| return new SnapshotDiffResponse( | ||
| new SnapshotDiffReportOzone(snapshotRoot.toString(), volumeName, | ||
| bucketName, fromSnapshotName, toSnapshotName, new ArrayList<>(), | ||
| null), | ||
| CANCELED, 0L, CancelStatus.CANCEL_SUCCESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR but I think SnapshotDiffReportOzone could use a Builder subclass. cc @hemantk-12
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Outdated
Show resolved
Hide resolved
...one/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java
Show resolved
Hide resolved
|
@smengcl Thanks for review, I've addressed all your comments and made the changes. Pending the comment about the test and its flakiness. |
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending CI
|
Thanks @xBis7 for the PR. Thanks @hemantk-12 for reviewing this. We can file a follow-up jira to deal with the potential flakiness if needed. |
|
@smengcl @hemantk-12 Thanks for the reviews. |
What changes were proposed in this pull request?
This PR adds an option to the
ozone sh snapshot snapshotDiffcommand, to cancel an IN_PROGRESS snapshotDiff job. If the option is used and the job is IN_PROGRESS, then the status is updated to CANCELED.The part of the code that might take up the most resources and cause the delay, has been refactored so that we can keep checking if the JobStatus is CANCELED before every method call.
If the job is canceled, then the method doing the calculations returns and the job remains CANCELED until the
SnapshotDiffCleanupServicedeletes it from the snapDiffJobTable and the user can resubmit it.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8490
How was this patch tested?
This patch was tested with new unit and integration tests. It was also tested manually using the docker dev environment.