HDDS-7233. Add DiskBalancerService on Datanode #3760

symious · 2022-09-19T06:49:35Z

What changes were proposed in this pull request?

This ticket is to add the DiskBalancerService on Datanode to do the real balancer work.

The points of this PR are as follows:

A new background service is added to datanode as "DiskBalancerService".
This service has 4 parameters: shouldRun, threshold, bandwidthInMB, parallelThread. These parameters will be updated by requests from SCM. When receiving updates on these parameters, the latest parameters will be persisted in a YAML file, and this file will also be when datanode is starting.
As a background service, the service's main procedure will be invoked at an interval and do the following steps:
1. check "shouldRun", if it's false, it will skip this loop, which means do not do diskbalance job.
2. check "bandwidth", there is a counter recording bytes being balanced in a window, and it will be used to calculate the next avaiable time to run the balance job based on "bandwidthInMB" limits.
3. then it will check the available thread count, and try to find the same number of volume pairs to start the balance job.
The balance job will first copy the container to the dest volume's tmp directory, then a new container be loaded and replace the original container.
During the balancing, a set of inProgressContainers is maintained, for deleteBlock and move container commands from SCM, it will check if the container related is being balanced, if yes, the command will be requeued with a maximum requeue limit. Since the bandwidth limit is a "quick balance, long delay" mode, this requeue won't be lasting too long.

For easier review of this feature, this ticket will be split to some small tickets.

HDDS-7383. Basic framework of DiskBalancerService

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7233

How was this patch tested?

unit test.

…#3701)

symious · 2022-09-19T14:44:16Z

@ChenSammi @ferhui @lokeshj1703 @siddhantsangwan @sodonnel @neils-dev @JacksonYao287 Could you help to review this PR?

ferhui

Great! Minor comments here

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

...iner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java

...ce/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java

siddhantsangwan · 2022-09-22T12:20:39Z

This is a pretty big PR. @symious would it be possible for you to give a short summary of the changes so it's easier to review?

symious · 2022-09-23T08:20:43Z

@siddhantsangwan Sure, added a summary in PR's description, please have a look.

ferhui · 2022-09-26T06:13:40Z

@xichen01 could you please review this PR?

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

lokeshj1703

@symious Thanks for working on this! This is a huge PR. I was able to review it partly. I will need to do multiple reviews on it. I have a few comments inline.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java

hadoop-hdds/common/src/main/resources/ozone-default.xml

...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java

lokeshj1703 · 2022-09-30T09:53:29Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

+    if (shouldRequeue(command, context, container)) {
+      context.requeueCommand(command);
+      return;
+    }


For DeleteBlocks we retry the command from SCM. I am not sure if we need a requeue here.
I think we can probably send failure to SCM in this case and add a failure message in a later PR?

Currently the requeue is for the whole request. If a requeue is not needed, I think we can only omit the containers being balancing, and let SCM resend requests for them.

lokeshj1703 · 2022-09-30T09:58:31Z

...hadoop/ozone/container/common/statemachine/commandhandler/DeleteContainerCommandHandler.java

+    if (shouldRequeue(command, context, ozoneContainer)) {
+      context.requeueCommand(command);
+    }


Can we check if DeleteContainer request is retried from SCM? I think it is retried.

If the delete request is abandoned by datanode, it might need to wait for the next container report to resend the request.

I think in that case we do not need the requeue functionality. Requeue itself is a complex mechanism because it will keep occurring in a loop.
Why aren't we blocking the operations on container lock instead of requeue mechanism? Is it because the handler thread is getting blocked in that case?

Yes. Blocking is the original idea. But it might causing the thread pool of handler being blocked.

What is the approach we are going with here then? Can we also update the docs once the PR is merged?
If we fail the operation in DN, then SCM would at least know that disk balancer is running. Later we can add a change to avoid block deletions for that specific container in SCM (Since SCM has a retry count for every transaction, we do not want transaction retry count to be exhausted).

I will change the PR later to abandon the requests.

Can we also update the docs once the PR is merged?

Sure.

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java

lokeshj1703 · 2022-09-30T10:21:14Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

+      Pair<HddsVolume, HddsVolume> pair =
+          DiskBalancerUtils.getVolumePair(volumeSet, threshold, deltaSizes);


How are we making sure same volume is not part of different tasks?

Also can we add policy for choosing source, target volumes? Policy for choosing which container to move between those volumes? We can probably do it in a separate PR.

I think new tasks should not be added until the older tasks finish for better consistency.

Here, we have a deltaSizes to keep the consistency, it can help to get a correct calculation when deciding the source and target volume.

I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time. Since this is a disk heavy operation multiple tasks will not help.

Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.

We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.

We shouldn't allow new tasks to be added until older tasks finish.

I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.

I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time.

Sorry I didn't get the idea from this line, seems a little controversial.

Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.

Sure, I will add a policy class here.

We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.

Sure, I will also add a policy class for containers.

We shouldn't allow new tasks to be added until older tasks finish.

I think this is currently controlled by the parallelThread. There will only be new tasks when a thread is avaialbe.

I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.

Yes, it's a big PR, could you give some advice on splitting the PR, since I found this service quite assembled?

@lokeshj1703 Updated the PR, please have a look.

I mean we should have only one task involving a particular volume at a time. Multiple tasks could also lead to consistency issues since deltaSizes would be updated by two tasks. Also deltaSizes parameter would not be needed in that case.

Regarding splitting the PR, I think we could have multiple PRs -:

Adding policies on containers and volumes

Supporting cross volume move in Datanode

Configuration and start/stop balancer

Algorithm for balancer

It would be easier to review and also to add tests for every component this way.

@lokeshj1703 Raised a new ticket HDDS-7383 for better review.
Please have a look. The other PRs will be on the way.

lokeshj1703 · 2022-09-30T10:37:10Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

+      HddsVolume sourceVolume = pair.getLeft(), destVolume = pair.getRight();
+      Iterator<Container<?>> itr = ozoneContainer.getController()
+          .getContainers(sourceVolume);
+      while (itr.hasNext()) {


There should be a limit on how many containers/data we want to move per volume in a single interval.
Also we do not want parallel tasks trying to move containers between same pair of volumes.

It would be better to have one task per pair of volumes.

Each task should be only balancing one container, it will break the while loop after a suitable container is found.

Could we define the algorithm in the document around how the balancing would happen?
I think the limits in the algorithm are defined by the bandwidth and parallel threads.

lokeshj1703 · 2022-09-30T10:48:59Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

+        ContainerData originalContainerData = ContainerDataYaml
+            .readContainerFile(containerFile);
+        Container newContainer = ozoneContainer.getController()
+            .importContainer(originalContainerData, diskBalancerDestDir);
+        newContainer.getContainerData().getVolume()
+            .incrementUsedSpace(containerSize);


Do we need this? If we are copying the entire container contents, these should be copied as well?

The content is a little different, especially for V3 containers, the db content needs to be imported again.

lokeshj1703 · 2022-09-30T10:50:05Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

+    }
+
+    @Override
+    public BackgroundTaskResult call() {


Can we simplify this function and probably move it to utils or make it part of KeyValueContainer.. class itself? I think we should just have an API call here.

Could you illustrate more on this idea?

Since this operation is in a higher level of KeyValueContainer, and it needs the help of ContainerSet, I find it not simple to generalize this function to a util class.

I would try looking for an appropriate class for this. I think the move logic should not be present in the disk balancer and there should be separate implementation for the different container versions.

Can you please take a look at KeyValueContainerUtil?

lokeshj1703 · 2022-10-11T06:30:44Z

...-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/storage/DiskBalancerConfiguration.java

+  // The path where datanode diskBalancer's conf is to be written to.
+  public static final String HDDS_DATANODE_DISK_BALANCER_INFO_DIR =
+      "hdds.datanode.disk.balancer.info.dir";
+


Can we use the same format as configs below?

It will be a path like "ozone.metadata.dirs", the default path will not be a common value.

ferhui · 2022-10-20T07:32:54Z

@lokeshj1703 Thanks for your detailed review! do you have other comments?

DaveTeng0 · 2023-01-30T07:09:56Z

@ChenSammi @lokeshj1703 @siddhantsangwan @sodonnel @neils-dev @JacksonYao287 Could you help to review this PR?

hey guys~ let's start to gradually digest this pr together ~ lol

xBis7 · 2023-01-30T17:34:06Z

Raised a new ticket HDDS-7383 for better review.
Please have a look. The other PRs will be on the way.

@symious Can you add to the PR's description or as a comment, the new tickets that were raised from this PR?

symious · 2023-01-31T01:31:40Z

Can you add to the PR's description or as a comment, the new tickets that were raised from this PR?

@xBis7 Sure updated in the description.

kerneltime · 2023-02-06T05:58:17Z

@symious can you please rebase, your branch is 400+ commits behind master.

DaveTeng0 · 2023-02-07T07:08:01Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java

+                destVolume.getHddsRootDir().toString(), idDir,
+                containerData.getContainerID()));
+        Path destDirParent = diskBalancerDestDir.getParent();
+        if (destDirParent != null) {


should this be '== null' ?

symious · 2023-02-07T07:24:39Z

can you please rebase, your branch is 400+ commits behind master.

Sure, will try to do the rebase lately.

vtutrinov · 2023-06-30T10:55:42Z

...he/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java

        DeletedBlocksTransaction tx = containerBlocks.get(i);
+        // Container is being balanced, ignore this transaction
+        if (shouldAbandon(tx.getContainerID(), cmd.getContainer())) {
+          continue;


Shouldn't we be throwing some kind of exception here to tell the user that the operation can't be performed right now (including a detailed reason message)? (the same notice is for DeleteContainerCommandHandler)

@vtutrinov Thank you for the review. This PR has been separated to small PRs, could you help to review #4887?

I will close this PR.

symious · 2023-07-03T01:59:53Z

This PR has been split to #4887 and #3874, so I'm closing this one.

symious and others added 7 commits August 24, 2022 16:37

HDDS-7106. [DiskBalancer] Client-SCM interface (apache#3663)

09a0b8e

HDDS-7155. [DiskBalancer] Create interface between SCM and DN (apache…

b239018

…#3701)

HDDS-7205. DiskBalancer CLI (apache#3739)

4a0132e

HDDS-7233. Add DiskBalancerService on Datanode

7f9c75a

Merge branch 'HDDS-5713' into HDDS-7233

7a0bc34

HDDS-7233. Fix findbugs

82332a2

HDDS-7233. Fix unit tests

ccb9940

ferhui reviewed Sep 20, 2022

View reviewed changes

symious added 3 commits September 20, 2022 22:25

HDDS-7233. Remove log debug check

5532fd4

HDDS-7233. Remove e.printStackTrace()

b395b62

trigger new CI checK

e16f5ee

ferhui reviewed Sep 28, 2022

View reviewed changes

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Outdated Show resolved Hide resolved

...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java Outdated Show resolved Hide resolved

HDDS-7233. Break iteration after adding task

5ecba7f

ferhui approved these changes Sep 30, 2022

View reviewed changes

lokeshj1703 reviewed Sep 30, 2022

View reviewed changes

HDDS-7233. Changes according to comments

289f9ee

lokeshj1703 reviewed Oct 11, 2022

View reviewed changes

HDDS-7233. Add choosing policy for container and volume

97ccb49

DaveTeng0 reviewed Feb 7, 2023

View reviewed changes

symious force-pushed the HDDS-5713 branch from 35791f5 to 6ae2da8 Compare February 12, 2023 13:56

symious force-pushed the HDDS-5713 branch from 1d71713 to 7929c5c Compare March 16, 2023 02:45

vtutrinov reviewed Jun 30, 2023

View reviewed changes

symious closed this Jul 3, 2023

		Pair<HddsVolume, HddsVolume> pair =
		DiskBalancerUtils.getVolumePair(volumeSet, threshold, deltaSizes);

HDDS-7233. Add DiskBalancerService on Datanode #3760

HDDS-7233. Add DiskBalancerService on Datanode #3760

Uh oh!

Conversation

symious commented Sep 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

symious commented Sep 19, 2022

Uh oh!

ferhui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddhantsangwan commented Sep 22, 2022

Uh oh!

symious commented Sep 23, 2022

Uh oh!

ferhui commented Sep 26, 2022

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

symious commented Sep 19, 2022 •

edited

Loading

lokeshj1703 Oct 11, 2022 •

edited

Loading

xBis7 commented Jan 30, 2023 •

edited

Loading