Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Jun 14, 2023

What changes were proposed in this pull request?

This ticket includes the internal move logic for the DiskBalancer.

The code are mainly from HDDS-7233.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8844

How was this patch tested?

Used unit test from HDDS-7233.

@symious
Copy link
Contributor Author

symious commented Jun 14, 2023

@ChenSammi Could you help to review? I shall provide more test cases along the review.

if (toBalanceContainer != null) {
queue.add(new DiskBalancerTask(toBalanceContainer, sourceVolume,
destVolume));
inProgressContainers.add(toBalanceContainer.getContainerID());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'isBalancingContainer' method is never used. What will happen if the command to delete container/block will be received on container balancing? Or do the locks in the container solve the synchronization problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we can drop the commands and let SCM resend the commands.

@sodonnel
Copy link
Contributor

@symious What is the current status of this PR - it has been here for a while? Are you interested in moving it forward?

@symious
Copy link
Contributor Author

symious commented Jan 26, 2024

@sodonnel Yes, it's been a while for this PR. I'll fix the checks first.

.resolve(DISK_BALANCER_DIR).resolve(String.valueOf(containerId));

// Copy container to new Volume's tmp Dir
ozoneContainer.getController().copyContainer(containerData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there somewhere else in the code that blocks operations against the container while the export / import is happening? Eg, if a we run copyContainer. After that completes a delete block is processed, deleting the block from the original copy, but not the new copy. Then we continue and at the end we have a block in the moved container that should have been removed. I think each of these operations on the container (copy, export, import, etc) lock the container instance but it feels like we need something to lock across the entire move cycle, while also allowing reads to continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently we don't have such locks. Since the locks could be time-consuming for export and import, other operations might be complaining about it.

And I think it's possible now that there are already some replicas not same with others, due to DN restart or exporting, it should be good to have a mechanism to sync among all replicas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a design in progress that will deal with mis-matches between the containers in terms of deleted blocks, but I feel we should try to avoid causing such problems if we can. I am not sure how hard that might be with the current code layout.

Eg if there was a read lock held for the duration of the container move, then it would allow block reads etc to come through, but would stop block deletes. However at the moment, I don't think readChunk has any locks involved, so this would be a lot of difficult changes I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a design in progress that will deal with mis-matches between the containers in terms of deleted blocks

Looking forward to this feature.

Currently I think the read won't be affected, clients can still read data while the move is ongoing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be ok if the block deletes get missed. This isn't really any difference to replication between nodes, as after tarring a container a block could be deleted before the new replica is available on the target node.

Preconditions.checkNotNull(container, "container cannot be null");

long containerId = container.getContainerData().getContainerID();
if (!containerMap.containsKey(containerId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use computeIfPresent() here, as it makes the operation atomic. There is a small chance the mapping could be remove between the containerKey check and the put.

@sodonnel
Copy link
Contributor

Is this the last PR needed to get the disk balancer working?

Also, have you been running a version of this on your own clusters already? I understand you needed this feature but it has been under development for a long time now.

@symious
Copy link
Contributor Author

symious commented Feb 1, 2024

Is this the last PR needed to get the disk balancer working?

Yes.

have you been running a version of this on your own clusters already?

Yes, but the version on our cluster is different from this PR. Also our cluster's Ozone version is quite different from the master.

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is good to commit. +1 from me.

@sodonnel sodonnel merged commit 57889e7 into apache:HDDS-5713 Feb 1, 2024
sodonnel pushed a commit to sodonnel/hadoop-ozone that referenced this pull request Feb 1, 2024
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Apr 17, 2024
sadanand48 pushed a commit to sadanand48/hadoop-ozone that referenced this pull request Apr 17, 2024
ChenSammi pushed a commit to ChenSammi/ozone that referenced this pull request Feb 19, 2025
ChenSammi pushed a commit to ChenSammi/ozone that referenced this pull request Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants