Skip to content

Conversation

@chungen0126
Copy link
Contributor

What changes were proposed in this pull request?

Two intermittent failures in TestDecommissionAndMaintenance

	at org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:116)
	at org.apache.hadoop.ozone.MiniOzoneClusterImpl.waitForClusterToBeReady(MiniOzoneClusterImpl.java:166)
	at org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartStorageContainerManager(MiniOzoneClusterImpl.java:291)
	at org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testDecommissioningNodesCompleteDecommissionOnSCMRestart(TestDecommissionAndMaintenance.java:279)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance -- Time elapsed: 18.06 s <<< ERROR!
java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml$DatanodeDetailsYaml.getUuid()" because "datanodeDetailsYaml" is null
	at org.apache.hadoop.ozone.container.common.helpers.DatanodeIdYaml.readDatanodeIdFile(DatanodeIdYaml.java:91)
	at org.apache.hadoop.ozone.container.common.helpers.ContainerUtils.readDatanodeDetailsFrom(ContainerUtils.java:177)
	at org.apache.hadoop.ozone.HddsDatanodeService.initializeDatanodeDetails(HddsDatanodeService.java:428)
	at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:223)
	at org.apache.hadoop.ozone.MiniOzoneClusterImpl.startHddsDatanode(MiniOzoneClusterImpl.java:424)
	at org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:362)
	at org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:372)
	at org.apache.hadoop.hdds.scm.node.TestDecommissionAndMaintenance.testSingleNodeWithOpenPipelineCanGotoMaintenance(TestDecommissionAndMaintenance.java:462)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) 

The root causes of the failures:

  1. The first failure occurs after c7117dc, due to an accidental addition of ecContainerDNsMap.clear().
  2. The second failure is that we currently update DatanodeDetails in memory before persisting it to disk. If the persist step fails, the in-memory state is already changed and SCM may receive this incorrect state via heartbeat.
    SCM might think a datanode is already in IN_MAINTENANCE because of the heartbeat, but the actual state wasn’t persisted. If the user then shutdown the datanode, thinking it's safe, the datanode may fail to restart because it's missing persisted datanode details.

What does this pr change?

  1. Remove ecContainerDNsMap.clear() in ECContainerSafeModeRule#initializeRule.
  2. This improves state consistency in the datanode.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-12843

How was this patch tested?

CI:
https://github.com/chungen0126/ozone/actions/runs/15698436934

Passed 20x50 after change:
https://github.com/chungen0126/ozone/actions/runs/15697148311

@chungen0126 chungen0126 marked this pull request as ready for review June 17, 2025 09:09
@chungen0126 chungen0126 requested a review from peterxcli June 17, 2025 09:09
@ivandika3 ivandika3 added the test label Jun 17, 2025
Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chungen0126 for this fix!

setNodeCmd.getStateExpiryEpochSeconds());
try {
persistDatanodeDetails(dni);
persistUpdatedDatanodeDetails(dni, state, setNodeCmd.getStateExpiryEpochSeconds());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behaviour should be more explicit.

Suggested change
persistUpdatedDatanodeDetails(dni, state, setNodeCmd.getStateExpiryEpochSeconds());
DatanodeDetails persistedDni = new DatanodeDetails(dni)
.setPersistedOpState(state)
.setPersistedOpStateExpiryEpochSec(stateExpiryEpochSeconds);
persistDatanodeDetails(persistedDni);
// Only update the inmem state if datanodeDetails is persisted successfully
dni.setPersistedOpState(state);
dni.setPersistedOpStateExpiryEpochSec(stateExpiryEpochSeconds);

Copy link
Contributor Author

@chungen0126 chungen0126 Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we don't need persistUpdatedDatanodeDetails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. what do you think?

@peterxcli peterxcli self-requested a review June 19, 2025 02:07
@adoroszlai adoroszlai marked this pull request as draft June 20, 2025 10:13
@adoroszlai adoroszlai marked this pull request as ready for review June 21, 2025 19:31
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM will merge it.

@adoroszlai adoroszlai merged commit 583730d into apache:master Jul 12, 2025
52 checks passed
@adoroszlai
Copy link
Contributor

Thanks @chungen0126 for the patch, @jojochuang, @peterxcli for the review.

We can inline persistUpdatedDatanodeDetails in a follow-up, if you prefer.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants