-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-12843. Intermittent NPE in TestDecommissionAndMaintenance #8643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
peterxcli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chungen0126 for this fix!
...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/safemode/ECContainerSafeModeRule.java
Show resolved
Hide resolved
| setNodeCmd.getStateExpiryEpochSeconds()); | ||
| try { | ||
| persistDatanodeDetails(dni); | ||
| persistUpdatedDatanodeDetails(dni, state, setNodeCmd.getStateExpiryEpochSeconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this behaviour should be more explicit.
| persistUpdatedDatanodeDetails(dni, state, setNodeCmd.getStateExpiryEpochSeconds()); | |
| DatanodeDetails persistedDni = new DatanodeDetails(dni) | |
| .setPersistedOpState(state) | |
| .setPersistedOpStateExpiryEpochSec(stateExpiryEpochSeconds); | |
| persistDatanodeDetails(persistedDni); | |
| // Only update the inmem state if datanodeDetails is persisted successfully | |
| dni.setPersistedOpState(state); | |
| dni.setPersistedOpStateExpiryEpochSec(stateExpiryEpochSeconds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we don't need persistUpdatedDatanodeDetails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. what do you think?
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM will merge it.
|
Thanks @chungen0126 for the patch, @jojochuang, @peterxcli for the review. We can inline |
What changes were proposed in this pull request?
Two intermittent failures in TestDecommissionAndMaintenance
The root causes of the failures:
ecContainerDNsMap.clear().SCM might think a datanode is already in IN_MAINTENANCE because of the heartbeat, but the actual state wasn’t persisted. If the user then shutdown the datanode, thinking it's safe, the datanode may fail to restart because it's missing persisted datanode details.
What does this pr change?
ecContainerDNsMap.clear()inECContainerSafeModeRule#initializeRule.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12843
How was this patch tested?
CI:
https://github.com/chungen0126/ozone/actions/runs/15698436934
Passed 20x50 after change:
https://github.com/chungen0126/ozone/actions/runs/15697148311