-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-5513. Race condition upon dn restart at prefinalization. #2471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for taking a look at this issue @guihecheng. I am not sure the interrupt reported in the Jira came from
If this is correct then the interrupt could not have come from Else if this is not correct then we need to figure out why Also, those ratis log messages shared in the jira that occur before the pre-finalize actions appear to come from normal raft server construction. I do not think they indicate that the raft server was actually started when they were printed, since that should not happen until Do you have any way to reproduce this issue or help verify where the interrupt came from? |
If I read the ratis code correctly, this is not quite true, actually a triggerHeartbeat call can be potentially make at
The raft server was not started indeed, and there is a potential triggerHeartbeat in the construction code as said above. This is from the function call Since the log entry above got printed right at second where the Exception got thrown, I suspect that the interrupt() comes from the place in the patch.
Thanks @errose28 for a detailed check, we only encoutered this problem once during a non-rolling upgrade, about 4 in 40 nodes reported this, and I can't reproduce this problem in my test deployment since it is hard for a thread to exactly catch the interrupt() send by another thread. |
|
Ah I understand now thanks for the explanation. This makes sense how the triggerHeartbeat causes the interrupt. I did not realize the a snapshot install was happening in parallel with the pre-finalize actions. Should we wait for the snapshot install to finish before running pre-finalize actions? If we have a layout change in the future that involves state machine data I am concerned this will be problematic currently. |
Oh, that's a good question, I'm not quite sure, but it seems to be more clean to have consistent state machine data before we do a upgrade operation. |
|
cc @ChenSammi @bshashikant for comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating this @guihecheng and the review @bshashikant. Let's merge this to unblock upgrades from master, and investigate the snapshot install and pre-finalize action coordination issue in a follow up Jira: HDDS-5525.
Squash merge branch 'tencent-master' into 'tencent-master' * Skip this test, because we do a force version update to VERSION file, * skip tests for non-rolling upgrade * Skip finalize for upgrade with on-disk layout version updated only. * HDDS-5513. Race condition upon dn restart at prefinalization. (apache#2471) * HDDS-5514. Skip check for UNHEALTHY containers for datanode finalize. (apache#2469) PLEASE REVERT ME WHEN DO UPGRADE AT 2021-08.
What changes were proposed in this pull request?
Race condition upon dn restart at prefinalization.
More details in the jira below.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5513
How was this patch tested?
CI