Skip to content

Conversation

@symious
Copy link
Contributor

@symious symious commented Jun 9, 2022

What changes were proposed in this pull request?

If a datanode crashes, the blocks on the datanode will be added to inflightReplication to fulfill the replication factor.

When the datanode is restarted, the containers don't need to be replicated anymore, but the replication commands have been sent to datanodes. If the crashed datanodes holds many containers, there will be many replication commands which will cause jams on Datanodes's replication queue, and the big replication traffic will also cause problems to the whole cluster.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6848

How was this patch tested?

unit test

@symious
Copy link
Contributor Author

symious commented Jun 9, 2022

@adoroszlai @sodonnel @ferhui Could you help to review this PR?

@sodonnel
Copy link
Contributor

sodonnel commented Jun 9, 2022

There is some work going on to fix this sort of issue in different ways. #3482 is going to limit the number of replications that can be inflight at any given times, and hence should avoid deep queues on the DNs.

As part of EC, we are working toward replacing replication manager completely, and when we do that, we will be holding back the replication work in SCM and feeding it to the datanodes when they have capacity to take it. The queues on the DNs should be very small when that happens.

While the approach here looks OK, I feel we should hold off on committing it until we see how well #3482 works, and larger RM refactor. Once those are in, this change would not be needed I think.

@symious
Copy link
Contributor Author

symious commented Jun 9, 2022

@sodonnel Thanks for the review.
Since it's causing severe issues in our cluster, I think we might need to use this version of fix internally first.

Talking about #3482 , I'm not sure if they are solving the same problem. If a task is marked timeout in RM, but still processed by datanode, the work done by datanode side is totally wasted. I think only if RM can handle timeout tasks better can this problem be mitigated.

@sodonnel
Copy link
Contributor

sodonnel commented Dec 6, 2022

@symious I was discussing some replication related topics today and remembered about this change. I think we should move ahead with this change, and also a similar change for deleteContainer commands, but we can do it in a new PR once we have this one done. I will give this a quick review and then perhaps you could try to fix the conflicts if you still have time to work on this?

requestCounter.incrementAndGet();
if (task.getTimeoutMs() != 0) {
long msInQueue =
Duration.between(task.getQueued(), Instant.now()).toMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 queues in the DN, the main command queue and then the replication queue. The time set for getQueued() is the time the command is placed onto the replication queue. However, there are some delays in the system here eg:

  1. The command is created in SCM and queued on SCM. It could be a full DN heartbeat interval before it gets picked up.

  2. The commands lands on the DN command queue. Something may hold it up getting onto the replication queue.

I think this would mean the pending operation in SCM could expire before DN command does, and it gives SCM a chance to schedule another command before this gets expired on the DN.

I think it might be better if we set the expiry time on the command (in ms since epoch) when it is created in SCM to match the SCM timeout, and then it avoids any of these delays. Infact, it may make sense to make the DN command expiry a little less than the SCM timeout, so the DN command tends to expire slightly earlier than the SCM timeout.

try {
requestCounter.incrementAndGet();
if (task.getTimeoutMs() != 0) {
long msInQueue =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than calling Instance.now here, can we inject an instance of MonotonicClock and use it to get the time. This will make tests easier and avoid sleep calls in the tests. This is something we have used in other places in the code before.


private final long containerID;
private final List<DatanodeDetails> sourceDatanodes;
private final long timeoutMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with my earlier suggestion of setting the "run by time" in SCM on the command, perhaps rename "timeout" to something like "deadlineEpochMs" or "expiryEpochMs" throughout this method.

Assert.assertEquals(2, supervisor.getQueueSize());
// Sleep 2s, wait all tasks processed
try {
Thread.sleep(2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we inject the MontonicClock into the Supervisor, then we can create it with a instance of TestClock that we can use to avoid the sleep calls and make the tests faster.

required int64 containerID = 1;
repeated DatanodeDetailsProto sources = 2;
required int64 cmdId = 3;
optional int64 timeoutMs = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename the timeout to expiry or deadline, then change it here too.

@sodonnel
Copy link
Contributor

sodonnel commented Dec 7, 2022

I think I have a way to do this somewhat generically for all commands, so please hold off on working on this. I will try to post a PR tomorrow.

@symious
Copy link
Contributor Author

symious commented Dec 9, 2022

I think I have a way to do this somewhat generically for all commands, so please hold off on working on this. I will try to post a PR tomorrow.

Sure.

@sodonnel
Copy link
Contributor

Committed #4069 which implements this and takes it a bit further. Note that we did not add deadlines to commands sent from the legacy RM. If you want that, you could add it in. Only commands from the new RM have deadlines added.

@sodonnel sodonnel closed this Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants