HDDS-5513. Race condition upon dn restart at prefinalization. #2471

guihecheng · 2021-07-29T13:09:30Z

What changes were proposed in this pull request?

Race condition upon dn restart at prefinalization.
More details in the jira below.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5513

How was this patch tested?

CI

errose28 · 2021-07-29T20:47:15Z

Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from DatanodeStateMachine#triggerHeartbeat. Below is my current understanding, let me know what you think:

DatanodeStateMachine#triggerHeartbeat Is only called when the container services have started.
Container services are not started until OzoneContainer#start is called by VersionEndpointTask#call
VersionEndpointTask#call will not occur until the datanode endpoint tasks are registered by context#execute inside the main loop of DatanodeStateMachine#start.
At this point, pre-finalize upgrade actions must have finished.

If this is correct then the interrupt could not have come from DatanodeStateMachine#triggerHeartbeat, and we need a new fix to determine where the interrupt came from.

Else if this is not correct then we need to figure out why DatanodeStateMachine#triggerHeartbeat, and possibly all container services, are being called/started while the pre-finalize upgrade actions are running. If this is really happening it is an error and the fix will be to have all pre-finalize actions run before container services are started.

Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until OzoneContainer#start is called I believe.

Do you have any way to reproduce this issue or help verify where the interrupt came from?

guihecheng · 2021-07-30T02:04:53Z

Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from DatanodeStateMachine#triggerHeartbeat. Below is my current understanding, let me know what you think:

DatanodeStateMachine#triggerHeartbeat Is only called when the container services have started.

If I read the ratis code correctly, this is not quite true, actually a triggerHeartbeat call can be potentially make at ContainerStateMachine#initialize which is called in RaftServer construction, before the OzoneContainer#start.
And this is why we have the check 'stateMachineThread != null' in triggerHeartbeat.

Container services are not started until OzoneContainer#start is called by VersionEndpointTask#call

VersionEndpointTask#call will not occur until the datanode endpoint tasks are registered by context#execute inside the main loop of DatanodeStateMachine#start.

At this point, pre-finalize upgrade actions must have finished.

If this is correct then the interrupt could not have come from DatanodeStateMachine#triggerHeartbeat, and we need a new fix to determine where the interrupt came from.

Else if this is not correct then we need to figure out why DatanodeStateMachine#triggerHeartbeat, and possibly all container services, are being called/started while the pre-finalize upgrade actions are running. If this is really happening it is an error and the fix will be to have all pre-finalize actions run before container services are started.

Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until OzoneContainer#start is called I believe.

The raft server was not started indeed, and there is a potential triggerHeartbeat in the construction code as said above.
And there is a point to check in the logs:

2021-07-27 20:11:59,507 [pool-92215-thread-1] INFO org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine: group-9034B3A2567B: The snapshot info is null. Setting the last applied indexto:(t:0, i:~)

This is from the function call ContainerStateMachine#loadSnapshot which is only called in ContainerStateMachine#initialize where the triggerHeartbeat potentially happen.

Since the log entry above got printed right at second where the Exception got thrown, I suspect that the interrupt() comes from the place in the patch.

Do you have any way to reproduce this issue or help verify where the interrupt came from?

Thanks @errose28 for a detailed check, we only encoutered this problem once during a non-rolling upgrade, about 4 in 40 nodes reported this, and I can't reproduce this problem in my test deployment since it is hard for a thread to exactly catch the interrupt() send by another thread.
And some inline replies above.

errose28 · 2021-07-30T04:32:54Z

Ah I understand now thanks for the explanation. This makes sense how the triggerHeartbeat causes the interrupt. I did not realize the a snapshot install was happening in parallel with the pre-finalize actions. Should we wait for the snapshot install to finish before running pre-finalize actions? If we have a layout change in the future that involves state machine data I am concerned this will be problematic currently.

guihecheng · 2021-07-30T05:50:37Z

Ah I understand now thanks for the explanation. This makes sense how the triggerHeartbeat causes the interrupt. I did not realize the a snapshot install was happening in parallel with the pre-finalize actions. Should we wait for the snapshot install to finish before running pre-finalize actions? If we have a layout change in the future that involves state machine data I am concerned this will be problematic currently.

Oh, that's a good question, I'm not quite sure, but it seems to be more clean to have consistent state machine data before we do a upgrade operation.

guihecheng · 2021-07-30T09:24:45Z

cc @ChenSammi @bshashikant for comments.

errose28

Thanks for investigating this @guihecheng and the review @bshashikant. Let's merge this to unblock upgrades from master, and investigate the snapshot install and pre-finalize action coordination issue in a follow up Jira: HDDS-5525.

Squash merge branch 'tencent-master' into 'tencent-master' * Skip this test, because we do a force version update to VERSION file, * skip tests for non-rolling upgrade * Skip finalize for upgrade with on-disk layout version updated only. * HDDS-5513. Race condition upon dn restart at prefinalization. (apache#2471) * HDDS-5514. Skip check for UNHEALTHY containers for datanode finalize. (apache#2469) PLEASE REVERT ME WHEN DO UPGRADE AT 2021-08.

HDDS-5513. Race condition upon dn restart at prefinalization.

d407ebf

errose28 self-requested a review July 29, 2021 15:44

bshashikant approved these changes Jul 30, 2021

View reviewed changes

errose28 approved these changes Aug 2, 2021

View reviewed changes

errose28 merged commit 622275f into apache:master Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-5513. Race condition upon dn restart at prefinalization. #2471

HDDS-5513. Race condition upon dn restart at prefinalization. #2471

Uh oh!

guihecheng commented Jul 29, 2021

Uh oh!

errose28 commented Jul 29, 2021

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

errose28 commented Jul 30, 2021 •

edited

Loading

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

errose28 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDDS-5513. Race condition upon dn restart at prefinalization. #2471

HDDS-5513. Race condition upon dn restart at prefinalization. #2471

Uh oh!

Conversation

guihecheng commented Jul 29, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

errose28 commented Jul 29, 2021

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

errose28 commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

guihecheng commented Jul 30, 2021

Uh oh!

errose28 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

errose28 commented Jul 30, 2021 •

edited

Loading

errose28 left a comment •

edited

Loading