Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Sep 19, 2022

What changes were proposed in this pull request?

This ticket is to add the DiskBalancerService on Datanode to do the real balancer work.

The points of this PR are as follows:

  1. A new background service is added to datanode as "DiskBalancerService".
  2. This service has 4 parameters: shouldRun, threshold, bandwidthInMB, parallelThread. These parameters will be updated by requests from SCM. When receiving updates on these parameters, the latest parameters will be persisted in a YAML file, and this file will also be when datanode is starting.
  3. As a background service, the service's main procedure will be invoked at an interval and do the following steps:
    1. check "shouldRun", if it's false, it will skip this loop, which means do not do diskbalance job.
    2. check "bandwidth", there is a counter recording bytes being balanced in a window, and it will be used to calculate the next avaiable time to run the balance job based on "bandwidthInMB" limits.
    3. then it will check the available thread count, and try to find the same number of volume pairs to start the balance job.
  4. The balance job will first copy the container to the dest volume's tmp directory, then a new container be loaded and replace the original container.
  5. During the balancing, a set of inProgressContainers is maintained, for deleteBlock and move container commands from SCM, it will check if the container related is being balanced, if yes, the command will be requeued with a maximum requeue limit. Since the bandwidth limit is a "quick balance, long delay" mode, this requeue won't be lasting too long.

For easier review of this feature, this ticket will be split to some small tickets.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-7233

How was this patch tested?

unit test.

@symious
Copy link
Contributor Author

symious commented Sep 19, 2022

@ChenSammi @ferhui @lokeshj1703 @siddhantsangwan @sodonnel @neils-dev @JacksonYao287 Could you help to review this PR?

Copy link
Contributor

@ferhui ferhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Minor comments here

@siddhantsangwan
Copy link
Contributor

This is a pretty big PR. @symious would it be possible for you to give a short summary of the changes so it's easier to review?

@symious
Copy link
Contributor Author

symious commented Sep 23, 2022

@siddhantsangwan Sure, added a summary in PR's description, please have a look.

@ferhui
Copy link
Contributor

ferhui commented Sep 26, 2022

@xichen01 could you please review this PR?

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious Thanks for working on this! This is a huge PR. I was able to review it partly. I will need to do multiple reviews on it. I have a few comments inline.

Comment on lines 115 to 118
if (shouldRequeue(command, context, container)) {
context.requeueCommand(command);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DeleteBlocks we retry the command from SCM. I am not sure if we need a requeue here.
I think we can probably send failure to SCM in this case and add a failure message in a later PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the requeue is for the whole request. If a requeue is not needed, I think we can only omit the containers being balancing, and let SCM resend requests for them.

Comment on lines 70 to 72
if (shouldRequeue(command, context, ozoneContainer)) {
context.requeueCommand(command);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if DeleteContainer request is retried from SCM? I think it is retried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the delete request is abandoned by datanode, it might need to wait for the next container report to resend the request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case we do not need the requeue functionality. Requeue itself is a complex mechanism because it will keep occurring in a loop.
Why aren't we blocking the operations on container lock instead of requeue mechanism? Is it because the handler thread is getting blocked in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Blocking is the original idea. But it might causing the thread pool of handler being blocked.

Copy link
Contributor

@lokeshj1703 lokeshj1703 Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the approach we are going with here then? Can we also update the docs once the PR is merged?
If we fail the operation in DN, then SCM would at least know that disk balancer is running. Later we can add a change to avoid block deletions for that specific container in SCM (Since SCM has a retry count for every transaction, we do not want transaction retry count to be exhausted).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the PR later to abandon the requests.

Can we also update the docs once the PR is merged?

Sure.

Comment on lines 310 to 311
Pair<HddsVolume, HddsVolume> pair =
DiskBalancerUtils.getVolumePair(volumeSet, threshold, deltaSizes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we making sure same volume is not part of different tasks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can we add policy for choosing source, target volumes? Policy for choosing which container to move between those volumes? We can probably do it in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think new tasks should not be added until the older tasks finish for better consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have a deltaSizes to keep the consistency, it can help to get a correct calculation when deciding the source and target volume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time. Since this is a disk heavy operation multiple tasks will not help.
  2. Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.
  3. We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.
  4. We shouldn't allow new tasks to be added until older tasks finish.

I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time.

Sorry I didn't get the idea from this line, seems a little controversial.

Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.

Sure, I will add a policy class here.

We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.

Sure, I will also add a policy class for containers.

  1. We shouldn't allow new tasks to be added until older tasks finish.

I think this is currently controlled by the parallelThread. There will only be new tasks when a thread is avaialbe.

I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.

Yes, it's a big PR, could you give some advice on splitting the PR, since I found this service quite assembled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lokeshj1703 Updated the PR, please have a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we should have only one task involving a particular volume at a time. Multiple tasks could also lead to consistency issues since deltaSizes would be updated by two tasks. Also deltaSizes parameter would not be needed in that case.

Regarding splitting the PR, I think we could have multiple PRs -:

  1. Adding policies on containers and volumes
  2. Supporting cross volume move in Datanode
  3. Configuration and start/stop balancer
  4. Algorithm for balancer

It would be easier to review and also to add tests for every component this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lokeshj1703 Raised a new ticket HDDS-7383 for better review.
Please have a look. The other PRs will be on the way.

HddsVolume sourceVolume = pair.getLeft(), destVolume = pair.getRight();
Iterator<Container<?>> itr = ozoneContainer.getController()
.getContainers(sourceVolume);
while (itr.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a limit on how many containers/data we want to move per volume in a single interval.
Also we do not want parallel tasks trying to move containers between same pair of volumes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have one task per pair of volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each task should be only balancing one container, it will break the while loop after a suitable container is found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define the algorithm in the document around how the balancing would happen?
I think the limits in the algorithm are defined by the bandwidth and parallel threads.

Comment on lines +407 to +412
ContainerData originalContainerData = ContainerDataYaml
.readContainerFile(containerFile);
Container newContainer = ozoneContainer.getController()
.importContainer(originalContainerData, diskBalancerDestDir);
newContainer.getContainerData().getVolume()
.incrementUsedSpace(containerSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? If we are copying the entire container contents, these should be copied as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content is a little different, especially for V3 containers, the db content needs to be imported again.

}

@Override
public BackgroundTaskResult call() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this function and probably move it to utils or make it part of KeyValueContainer.. class itself? I think we should just have an API call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you illustrate more on this idea?

Since this operation is in a higher level of KeyValueContainer, and it needs the help of ContainerSet, I find it not simple to generalize this function to a util class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try looking for an appropriate class for this. I think the move logic should not be present in the disk balancer and there should be separate implementation for the different container versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take a look at KeyValueContainerUtil?

Comment on lines +41 to +44
// The path where datanode diskBalancer's conf is to be written to.
public static final String HDDS_DATANODE_DISK_BALANCER_INFO_DIR =
"hdds.datanode.disk.balancer.info.dir";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same format as configs below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a path like "ozone.metadata.dirs", the default path will not be a common value.

@ferhui
Copy link
Contributor

ferhui commented Oct 20, 2022

@lokeshj1703 Thanks for your detailed review! do you have other comments?

@DaveTeng0
Copy link
Contributor

@ChenSammi @lokeshj1703 @siddhantsangwan @sodonnel @neils-dev @JacksonYao287 Could you help to review this PR?

hey guys~ let's start to gradually digest this pr together ~ lol

@xBis7
Copy link
Contributor

xBis7 commented Jan 30, 2023

Raised a new ticket HDDS-7383 for better review.
Please have a look. The other PRs will be on the way.

@symious Can you add to the PR's description or as a comment, the new tickets that were raised from this PR?

@symious
Copy link
Contributor Author

symious commented Jan 31, 2023

Can you add to the PR's description or as a comment, the new tickets that were raised from this PR?

@xBis7 Sure updated in the description.

@kerneltime
Copy link
Contributor

@symious can you please rebase, your branch is 400+ commits behind master.

destVolume.getHddsRootDir().toString(), idDir,
containerData.getContainerID()));
Path destDirParent = diskBalancerDestDir.getParent();
if (destDirParent != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be '== null' ?

@symious
Copy link
Contributor Author

symious commented Feb 7, 2023

can you please rebase, your branch is 400+ commits behind master.

Sure, will try to do the rebase lately.

DeletedBlocksTransaction tx = containerBlocks.get(i);
// Container is being balanced, ignore this transaction
if (shouldAbandon(tx.getContainerID(), cmd.getContainer())) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be throwing some kind of exception here to tell the user that the operation can't be performed right now (including a detailed reason message)? (the same notice is for DeleteContainerCommandHandler)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtutrinov Thank you for the review. This PR has been separated to small PRs, could you help to review #4887?

I will close this PR.

@symious symious closed this Jul 3, 2023
@symious
Copy link
Contributor Author

symious commented Jul 3, 2023

This PR has been split to #4887 and #3874, so I'm closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants