Skip to content

Conversation

@guihecheng
Copy link
Contributor

What changes were proposed in this pull request?

Race condition upon dn restart at prefinalization.
More details in the jira below.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5513

How was this patch tested?

CI

@errose28 errose28 self-requested a review July 29, 2021 15:44
@errose28
Copy link
Contributor

Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from DatanodeStateMachine#triggerHeartbeat. Below is my current understanding, let me know what you think:

  • DatanodeStateMachine#triggerHeartbeat Is only called when the container services have started.
  • Container services are not started until OzoneContainer#start is called by VersionEndpointTask#call
  • VersionEndpointTask#call will not occur until the datanode endpoint tasks are registered by context#execute inside the main loop of DatanodeStateMachine#start.
  • At this point, pre-finalize upgrade actions must have finished.

If this is correct then the interrupt could not have come from DatanodeStateMachine#triggerHeartbeat, and we need a new fix to determine where the interrupt came from.

Else if this is not correct then we need to figure out why DatanodeStateMachine#triggerHeartbeat, and possibly all container services, are being called/started while the pre-finalize upgrade actions are running. If this is really happening it is an error and the fix will be to have all pre-finalize actions run before container services are started.

Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until OzoneContainer#start is called I believe.

Do you have any way to reproduce this issue or help verify where the interrupt came from?

@guihecheng
Copy link
Contributor Author

Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from DatanodeStateMachine#triggerHeartbeat. Below is my current understanding, let me know what you think:

  • DatanodeStateMachine#triggerHeartbeat Is only called when the container services have started.

If I read the ratis code correctly, this is not quite true, actually a triggerHeartbeat call can be potentially make at ContainerStateMachine#initialize which is called in RaftServer construction, before the OzoneContainer#start.
And this is why we have the check 'stateMachineThread != null' in triggerHeartbeat.

  • Container services are not started until OzoneContainer#start is called by VersionEndpointTask#call
  • VersionEndpointTask#call will not occur until the datanode endpoint tasks are registered by context#execute inside the main loop of DatanodeStateMachine#start.
  • At this point, pre-finalize upgrade actions must have finished.

If this is correct then the interrupt could not have come from DatanodeStateMachine#triggerHeartbeat, and we need a new fix to determine where the interrupt came from.

Else if this is not correct then we need to figure out why DatanodeStateMachine#triggerHeartbeat, and possibly all container services, are being called/started while the pre-finalize upgrade actions are running. If this is really happening it is an error and the fix will be to have all pre-finalize actions run before container services are started.

Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until OzoneContainer#start is called I believe.

The raft server was not started indeed, and there is a potential triggerHeartbeat in the construction code as said above.
And there is a point to check in the logs:

2021-07-27 20:11:59,507 [pool-92215-thread-1] INFO org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine: group-9034B3A2567B: The snapshot info is null. Setting the last applied indexto:(t:0, i:~)

This is from the function call ContainerStateMachine#loadSnapshot which is only called in ContainerStateMachine#initialize where the triggerHeartbeat potentially happen.

Since the log entry above got printed right at second where the Exception got thrown, I suspect that the interrupt() comes from the place in the patch.

Do you have any way to reproduce this issue or help verify where the interrupt came from?

Thanks @errose28 for a detailed check, we only encoutered this problem once during a non-rolling upgrade, about 4 in 40 nodes reported this, and I can't reproduce this problem in my test deployment since it is hard for a thread to exactly catch the interrupt() send by another thread.
And some inline replies above.

@errose28
Copy link
Contributor

errose28 commented Jul 30, 2021

Ah I understand now thanks for the explanation. This makes sense how the triggerHeartbeat causes the interrupt. I did not realize the a snapshot install was happening in parallel with the pre-finalize actions. Should we wait for the snapshot install to finish before running pre-finalize actions? If we have a layout change in the future that involves state machine data I am concerned this will be problematic currently.

@guihecheng
Copy link
Contributor Author

Ah I understand now thanks for the explanation. This makes sense how the triggerHeartbeat causes the interrupt. I did not realize the a snapshot install was happening in parallel with the pre-finalize actions. Should we wait for the snapshot install to finish before running pre-finalize actions? If we have a layout change in the future that involves state machine data I am concerned this will be problematic currently.

Oh, that's a good question, I'm not quite sure, but it seems to be more clean to have consistent state machine data before we do a upgrade operation.

@guihecheng
Copy link
Contributor Author

cc @ChenSammi @bshashikant for comments.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating this @guihecheng and the review @bshashikant. Let's merge this to unblock upgrades from master, and investigate the snapshot install and pre-finalize action coordination issue in a follow up Jira: HDDS-5525.

@errose28 errose28 merged commit 622275f into apache:master Aug 2, 2021
guihecheng pushed a commit to guihecheng/ozone that referenced this pull request Dec 10, 2021
Squash merge branch 'tencent-master' into 'tencent-master'

* Skip this test, because we do a force version update to VERSION file,

* skip tests for non-rolling upgrade

* Skip finalize for upgrade with on-disk layout version updated only.

* HDDS-5513. Race condition upon dn restart at prefinalization. (apache#2471)

* HDDS-5514. Skip check for UNHEALTHY containers for datanode finalize. (apache#2469)

PLEASE REVERT ME WHEN DO UPGRADE AT 2021-08.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants