-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-7233. Add DiskBalancerService on Datanode #3760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ChenSammi @ferhui @lokeshj1703 @siddhantsangwan @sodonnel @neils-dev @JacksonYao287 Could you help to review this PR? |
ferhui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Minor comments here
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Outdated
Show resolved
Hide resolved
...iner-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
Outdated
Show resolved
Hide resolved
...ce/src/test/java/org/apache/hadoop/ozone/container/diskbalancer/TestDiskBalancerService.java
Show resolved
Hide resolved
|
This is a pretty big PR. @symious would it be possible for you to give a short summary of the changes so it's easier to review? |
|
@siddhantsangwan Sure, added a summary in PR's description, please have a look. |
|
@xichen01 could you please review this PR? |
...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java
Outdated
Show resolved
Hide resolved
...ervice/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java
Outdated
Show resolved
Hide resolved
lokeshj1703
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@symious Thanks for working on this! This is a huge PR. I was able to review it partly. I will need to do multiple reviews on it. I have a few comments inline.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsConfigKeys.java
Outdated
Show resolved
Hide resolved
...tainer-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java
Show resolved
Hide resolved
| if (shouldRequeue(command, context, container)) { | ||
| context.requeueCommand(command); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DeleteBlocks we retry the command from SCM. I am not sure if we need a requeue here.
I think we can probably send failure to SCM in this case and add a failure message in a later PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the requeue is for the whole request. If a requeue is not needed, I think we can only omit the containers being balancing, and let SCM resend requests for them.
| if (shouldRequeue(command, context, ozoneContainer)) { | ||
| context.requeueCommand(command); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check if DeleteContainer request is retried from SCM? I think it is retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the delete request is abandoned by datanode, it might need to wait for the next container report to resend the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case we do not need the requeue functionality. Requeue itself is a complex mechanism because it will keep occurring in a loop.
Why aren't we blocking the operations on container lock instead of requeue mechanism? Is it because the handler thread is getting blocked in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Blocking is the original idea. But it might causing the thread pool of handler being blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the approach we are going with here then? Can we also update the docs once the PR is merged?
If we fail the operation in DN, then SCM would at least know that disk balancer is running. Later we can add a change to avoid block deletions for that specific container in SCM (Since SCM has a retry count for every transaction, we do not want transaction retry count to be exhausted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the PR later to abandon the requests.
Can we also update the docs once the PR is merged?
Sure.
...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/utils/HddsVolumeUtil.java
Outdated
Show resolved
Hide resolved
| Pair<HddsVolume, HddsVolume> pair = | ||
| DiskBalancerUtils.getVolumePair(volumeSet, threshold, deltaSizes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we making sure same volume is not part of different tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can we add policy for choosing source, target volumes? Policy for choosing which container to move between those volumes? We can probably do it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think new tasks should not be added until the older tasks finish for better consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we have a deltaSizes to keep the consistency, it can help to get a correct calculation when deciding the source and target volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time. Since this is a disk heavy operation multiple tasks will not help.
- Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.
- We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.
- We shouldn't allow new tasks to be added until older tasks finish.
I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still have multiple tasks trying to move the containers from the same volume. I feel we should have only task per volume at a time.
Sorry I didn't get the idea from this line, seems a little controversial.
Also it would be good to have policies for choosing the volumes to balance. It can be a simple policy for now but it would avoid refactoring in future.
Sure, I will add a policy class here.
We should have policies for choosing the container to move. Some containers could be undergoing replication to other DNs. We could avoid them.
Sure, I will also add a policy class for containers.
- We shouldn't allow new tasks to be added until older tasks finish.
I think this is currently controlled by the parallelThread. There will only be new tasks when a thread is avaialbe.
I would suggest breaking this PR into smaller PRs to reduce the scope of single PR.
Yes, it's a big PR, could you give some advice on splitting the PR, since I found this service quite assembled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lokeshj1703 Updated the PR, please have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we should have only one task involving a particular volume at a time. Multiple tasks could also lead to consistency issues since deltaSizes would be updated by two tasks. Also deltaSizes parameter would not be needed in that case.
Regarding splitting the PR, I think we could have multiple PRs -:
- Adding policies on containers and volumes
- Supporting cross volume move in Datanode
- Configuration and start/stop balancer
- Algorithm for balancer
It would be easier to review and also to add tests for every component this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lokeshj1703 Raised a new ticket HDDS-7383 for better review.
Please have a look. The other PRs will be on the way.
| HddsVolume sourceVolume = pair.getLeft(), destVolume = pair.getRight(); | ||
| Iterator<Container<?>> itr = ozoneContainer.getController() | ||
| .getContainers(sourceVolume); | ||
| while (itr.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a limit on how many containers/data we want to move per volume in a single interval.
Also we do not want parallel tasks trying to move containers between same pair of volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have one task per pair of volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each task should be only balancing one container, it will break the while loop after a suitable container is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we define the algorithm in the document around how the balancing would happen?
I think the limits in the algorithm are defined by the bandwidth and parallel threads.
| ContainerData originalContainerData = ContainerDataYaml | ||
| .readContainerFile(containerFile); | ||
| Container newContainer = ozoneContainer.getController() | ||
| .importContainer(originalContainerData, diskBalancerDestDir); | ||
| newContainer.getContainerData().getVolume() | ||
| .incrementUsedSpace(containerSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? If we are copying the entire container contents, these should be copied as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content is a little different, especially for V3 containers, the db content needs to be imported again.
| } | ||
|
|
||
| @Override | ||
| public BackgroundTaskResult call() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this function and probably move it to utils or make it part of KeyValueContainer.. class itself? I think we should just have an API call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you illustrate more on this idea?
Since this operation is in a higher level of KeyValueContainer, and it needs the help of ContainerSet, I find it not simple to generalize this function to a util class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try looking for an appropriate class for this. I think the move logic should not be present in the disk balancer and there should be separate implementation for the different container versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please take a look at KeyValueContainerUtil?
| // The path where datanode diskBalancer's conf is to be written to. | ||
| public static final String HDDS_DATANODE_DISK_BALANCER_INFO_DIR = | ||
| "hdds.datanode.disk.balancer.info.dir"; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same format as configs below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a path like "ozone.metadata.dirs", the default path will not be a common value.
|
@lokeshj1703 Thanks for your detailed review! do you have other comments? |
hey guys~ let's start to gradually digest this pr together ~ lol |
@xBis7 Sure updated in the description. |
|
@symious can you please rebase, your branch is 400+ commits behind master. |
| destVolume.getHddsRootDir().toString(), idDir, | ||
| containerData.getContainerID())); | ||
| Path destDirParent = diskBalancerDestDir.getParent(); | ||
| if (destDirParent != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be '== null' ?
Sure, will try to do the rebase lately. |
| DeletedBlocksTransaction tx = containerBlocks.get(i); | ||
| // Container is being balanced, ignore this transaction | ||
| if (shouldAbandon(tx.getContainerID(), cmd.getContainer())) { | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be throwing some kind of exception here to tell the user that the operation can't be performed right now (including a detailed reason message)? (the same notice is for DeleteContainerCommandHandler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vtutrinov Thank you for the review. This PR has been separated to small PRs, could you help to review #4887?
I will close this PR.
What changes were proposed in this pull request?
This ticket is to add the DiskBalancerService on Datanode to do the real balancer work.
The points of this PR are as follows:
For easier review of this feature, this ticket will be split to some small tickets.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7233
How was this patch tested?
unit test.