-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-13320. BackgroundService should wait for the completion of run before triggering next run #8677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…efore triggering next run Change-Id: I2569ad4a0ab27e900c4f6d3cae98b0e4735503ca
Change-Id: I5f4e826ca73cc43d5e81b4df7289382bf31cf510
| }, 100, 3000); | ||
| Thread.sleep(3000); | ||
| assertEquals(values, IntStream.range(0, 10).boxed().map(map::get).collect(Collectors.toList())); | ||
| lockList.forEach(Lock::unlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit :
check that the even indices are also processed after unlock and maybe check for size of values (5)before and after unlock (10)
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @swamirishi for the patch.
| } catch (Throwable e) { | ||
| LOG.error("Background service execution failed", e); | ||
| } finally { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-throw Error.
| } finally { | ||
| long endTime = System.nanoTime(); | ||
| if (endTime - serviceStartTime > serviceTimeoutInNanos) { | ||
| LOG.warn("{} Background service execution failed which took {}ns > {}ns(timeout)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we get here only if execution failed?
| } | ||
| return true; | ||
| }, 100, 3000); | ||
| Thread.sleep(3000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use fixed sleep.
| lockList.forEach(Lock::unlock); | ||
| backgroundService.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release/shutdown in finally.
| if (i % 2 == 1 && map.get(i) != 1) { | ||
| return false; | ||
| } | ||
| if (i % 2 == 0 && map.get(i) != 0) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i % 2 and map.get(i) can be compared directly.
| Map<Integer, Integer> map = new HashMap<>(); | ||
| Map<Integer, Lock> locks = new HashMap<>(); | ||
| for (int i = 0; i < 10; i++) { | ||
| int index = i; | ||
| locks.put(index, new ReentrantLock()); | ||
| map.put(index, 0); | ||
| queue.add(() -> { | ||
| locks.get(index).lock(); | ||
| try { | ||
| map.compute(index, (k, v) -> v == null ? 1 : (v + 1)); | ||
| } finally { | ||
| locks.get(index).unlock(); | ||
| } | ||
| return new BackgroundTaskResult.EmptyTaskResult(); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the test case, please extract an inner class for the task, which should keep track of its invocations, and support locking.
|
Thanks for the review @sadanand48 and @adoroszlai. Could you review the updated patch? |
|
@jojochuang and @smengcl could you also review this patch? |
|
Please do take a look at CI results in fork. https://github.com/swamirishi/ozone/actions/runs/16352154299/job/46201433990 https://github.com/swamirishi/ozone/actions/runs/16352154299/job/46201434007#step:16:17 |
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… finish before launching next set and exiting.
|
Updated the patch. Instead of the runner waiting for all the tasks to complete at the end,
|
|
findbugs: M M IS: Inconsistent synchronization of org.apache.hadoop.hdds.utils.BackgroundService.exec; locked 75% of time Unsynchronized access at BackgroundService.java:[line 92] @SaketaChalamchala please run ./hadoop-ozone/dev-support/checks/findbugs.sh locally to confirm the next fix. Thanks! |
Change-Id: I1dd2167205b3ba95a871882eb29b312387aa70d3
swamirishi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM excepting for a few nitpicky comments.
|
|
||
| public abstract BackgroundTaskQueue getTasks(); | ||
|
|
||
| protected void execTaskCompletion() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this so that it might be useful to perform post task completion steps like updating last run metrics or incrementing run count or processing the result of the background tasks if necessary
| } catch (RuntimeException e) { | ||
| LOG.error("Background service execution failed.", e); | ||
| } finally { | ||
| execTaskCompletion(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply addAndIncreament while submitting
|
Looks good to me. I'll merge it to unblock #8655 |
When merging, please make sure to credit contributors. @SaketaChalamchala had 7 of 10 commits in this PR. |
…efore triggering next run (apache#8677) Co-authored-by: Wei-Chiu Chuang <weichiu@apache.org>
| List<? extends DatanodeDetails> dnList = cluster().getStorageContainerManager() | ||
| .getScmNodeManager() | ||
| .getAllNodes(); | ||
| NodeManager nodeManager = cluster().getStorageContainerManager().getScmNodeManager(); | ||
| List<? extends DatanodeDetails> dnList = nodeManager.getAllNodes(); | ||
|
|
||
| for (DatanodeDetails dn : dnList) { | ||
| final int expected = cluster().getStorageContainerManager().getScmNodeManager().getContainers(dn).size(); | ||
|
|
||
| List<HddsProtos.DatanodeUsageInfoProto> usageInfoList = | ||
| storageClient.getDatanodeUsageInfo( | ||
| dn.getIpAddress(), dn.getUuidString()); | ||
|
|
||
| assertEquals(1, usageInfoList.size()); | ||
| assertEquals(expected, usageInfoList.get(0).getContainerCount()); | ||
| assertEquals(nodeManager.getContainers(dn).size(), usageInfoList.get(0).getContainerCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change in TestContainerOperations seems completely unrelated to the rest of the PR.
…efore triggering next run (apache#8677) Co-authored-by: Wei-Chiu Chuang <weichiu@apache.org>
…ion of run before triggering next run (apache#8677) Co-authored-by: Wei-Chiu Chuang <weichiu@apache.org> (cherry picked from commit ce5676a) Conflicts: hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/BackgroundService.java hadoop-hdds/common/src/main/java/org/apache/hadoop/util/TestBackgroundService.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestContainerOperations.java Change-Id: Ie41fb7a56755364a31aefe6d8936a5849bf03b42
What changes were proposed in this pull request?
Currently even though the java doc of the base class BackgroundService says it would wait for the entire run to finish before triggering the next run, it doesn't wait as it just submits the tasks to the queue and doesn't track if those tasks have finished.
Added a wait at the end of CompletableFuture to fix it.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13320
How was this patch tested?
No unit tests added unit test for the same.