HDDS-6848. Ignore timeout replication tasks on datanode #3497

symious · 2022-06-09T03:39:28Z

What changes were proposed in this pull request?

If a datanode crashes, the blocks on the datanode will be added to inflightReplication to fulfill the replication factor.

When the datanode is restarted, the containers don't need to be replicated anymore, but the replication commands have been sent to datanodes. If the crashed datanodes holds many containers, there will be many replication commands which will cause jams on Datanodes's replication queue, and the big replication traffic will also cause problems to the whole cluster.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6848

How was this patch tested?

unit test

symious · 2022-06-09T03:40:03Z

@adoroszlai @sodonnel @ferhui Could you help to review this PR?

sodonnel · 2022-06-09T10:50:07Z

There is some work going on to fix this sort of issue in different ways. #3482 is going to limit the number of replications that can be inflight at any given times, and hence should avoid deep queues on the DNs.

As part of EC, we are working toward replacing replication manager completely, and when we do that, we will be holding back the replication work in SCM and feeding it to the datanodes when they have capacity to take it. The queues on the DNs should be very small when that happens.

While the approach here looks OK, I feel we should hold off on committing it until we see how well #3482 works, and larger RM refactor. Once those are in, this change would not be needed I think.

symious · 2022-06-09T12:30:31Z

@sodonnel Thanks for the review.
Since it's causing severe issues in our cluster, I think we might need to use this version of fix internally first.

Talking about #3482 , I'm not sure if they are solving the same problem. If a task is marked timeout in RM, but still processed by datanode, the work done by datanode side is totally wasted. I think only if RM can handle timeout tasks better can this problem be mitigated.

sodonnel · 2022-12-06T12:54:10Z

@symious I was discussing some replication related topics today and remembered about this change. I think we should move ahead with this change, and also a similar change for deleteContainer commands, but we can do it in a new PR once we have this one done. I will give this a quick review and then perhaps you could try to fix the conflicts if you still have time to work on this?

sodonnel · 2022-12-06T13:05:12Z

...rvice/src/main/java/org/apache/hadoop/ozone/container/replication/ReplicationSupervisor.java

        requestCounter.incrementAndGet();
+        if (task.getTimeoutMs() != 0) {
+          long msInQueue =
+              Duration.between(task.getQueued(), Instant.now()).toMillis();


There are 2 queues in the DN, the main command queue and then the replication queue. The time set for getQueued() is the time the command is placed onto the replication queue. However, there are some delays in the system here eg:

The command is created in SCM and queued on SCM. It could be a full DN heartbeat interval before it gets picked up.

The commands lands on the DN command queue. Something may hold it up getting onto the replication queue.

I think this would mean the pending operation in SCM could expire before DN command does, and it gives SCM a chance to schedule another command before this gets expired on the DN.

I think it might be better if we set the expiry time on the command (in ms since epoch) when it is created in SCM to match the SCM timeout, and then it avoids any of these delays. Infact, it may make sense to make the DN command expiry a little less than the SCM timeout, so the DN command tends to expire slightly earlier than the SCM timeout.

sodonnel · 2022-12-06T13:06:52Z

...rvice/src/main/java/org/apache/hadoop/ozone/container/replication/ReplicationSupervisor.java

      try {
        requestCounter.incrementAndGet();
+        if (task.getTimeoutMs() != 0) {
+          long msInQueue =


Rather than calling Instance.now here, can we inject an instance of MonotonicClock and use it to get the time. This will make tests easier and avoid sleep calls in the tests. This is something we have used in other places in the code before.

sodonnel · 2022-12-06T13:09:26Z

...rvice/src/main/java/org/apache/hadoop/ozone/protocol/commands/ReplicateContainerCommand.java


  private final long containerID;
  private final List<DatanodeDetails> sourceDatanodes;
+  private final long timeoutMs;


If we go with my earlier suggestion of setting the "run by time" in SCM on the command, perhaps rename "timeout" to something like "deadlineEpochMs" or "expiryEpochMs" throughout this method.

sodonnel · 2022-12-06T13:11:19Z

...e/src/test/java/org/apache/hadoop/ozone/container/replication/TestReplicationSupervisor.java

+      Assert.assertEquals(2, supervisor.getQueueSize());
+      // Sleep 2s, wait all tasks processed
+      try {
+        Thread.sleep(2000);


If we inject the MontonicClock into the Supervisor, then we can create it with a instance of TestClock that we can use to avoid the sleep calls and make the tests faster.

sodonnel · 2022-12-06T13:11:50Z

hadoop-hdds/interface-server/src/main/proto/ScmServerDatanodeHeartbeatProtocol.proto

  required int64 containerID = 1;
  repeated DatanodeDetailsProto sources = 2;
  required int64 cmdId = 3;
+  optional int64 timeoutMs = 4;


If we rename the timeout to expiry or deadline, then change it here too.

sodonnel · 2022-12-07T17:19:10Z

I think I have a way to do this somewhat generically for all commands, so please hold off on working on this. I will try to post a PR tomorrow.

symious · 2022-12-09T02:24:50Z

I think I have a way to do this somewhat generically for all commands, so please hold off on working on this. I will try to post a PR tomorrow.

Sure.

sodonnel · 2022-12-13T10:50:28Z

Committed #4069 which implements this and takes it a bit further. Note that we did not add deadlines to commands sent from the legacy RM. If you want that, you could add it in. Only commands from the new RM have deadlines added.

HDDS-6848. Ignore timeout replication tasks on datanode

1938406

HDDS-6848. Add metrics for timeout replications

686f425

HDDS-6848. Remove unused metrics

f19e515

symious mentioned this pull request Jul 20, 2022

HDDS-6976. 0GB data moved by container balancer after successful iteration #3604

Merged

sodonnel reviewed Dec 6, 2022

View reviewed changes

sodonnel closed this Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-6848. Ignore timeout replication tasks on datanode #3497

HDDS-6848. Ignore timeout replication tasks on datanode #3497

Uh oh!

symious commented Jun 9, 2022

Uh oh!

symious commented Jun 9, 2022

Uh oh!

sodonnel commented Jun 9, 2022

Uh oh!

symious commented Jun 9, 2022

Uh oh!

sodonnel commented Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Uh oh!

sodonnel commented Dec 7, 2022

Uh oh!

symious commented Dec 9, 2022

Uh oh!

sodonnel commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-6848. Ignore timeout replication tasks on datanode #3497

HDDS-6848. Ignore timeout replication tasks on datanode #3497

Uh oh!

Conversation

symious commented Jun 9, 2022

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

symious commented Jun 9, 2022

Uh oh!

sodonnel commented Jun 9, 2022

Uh oh!

symious commented Jun 9, 2022

Uh oh!

sodonnel commented Dec 6, 2022

Uh oh!

sodonnel Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Dec 7, 2022

Uh oh!

symious commented Dec 9, 2022

Uh oh!

sodonnel commented Dec 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants