Skip to content

Conversation

@Gargi-jais11
Copy link
Contributor

@Gargi-jais11 Gargi-jais11 commented Nov 17, 2025

What changes were proposed in this pull request?

After implementing direct client-to-DN communication for DiskBalancer in HDDS-13598, the SCM part is no longer needed and should be removed.
This JIRA covers:

  • Support stdin so that a list of datanodes can be easily passed in from a file.
  • Support json option to gather output from many nodes and just parse out problematic ones.
  • Remove the -d / --datanodes flag and keep it as a list of space separated arguments of datanodes similar to ozone admin container info .
  • Remove rest of the scm part connection with datanode
  • Cleanup of obsolete protobuf definitions in ScmServerDatanodeHeartbeatProtocol.proto and ScmAdminProtocol.proto
  • Rewriting integration tests (TestDiskBalancer, TestDiskBalancerDuringDecommissionAndMaintenance) for direct communication
  • Rewrite testdiskbalancer.robot robot test.
  • Do unit testing of all new diskbalancer subcommands.
  • Update the design doc and feature doc for diskBalancer (DiskBalancer.md).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13878

How was this patch tested?

Added new Unit Test : TestDiskBalancerSubCommands
Updated Integration Tests : TestDiskBalancer.java and TestDiskBalancerDuringDecommissioningAndMaintenance.java
Updated Acceptance Test : testdiskbalancer.robot

Below are the manual test output from docker-cluster:

  • specific dn request:
bash-5.1$ ozone admin datanode diskbalancer status ozone-datanode-2 ozone-datanode-3 ozone-datanode-5
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-2.ozone_default      STOPPED         10.0000         10              5            0            0            0               0               0              
ozone-datanode-3.ozone_default      STOPPED         10.0000         10              5            0            0            0               0               0              
ozone-datanode-5.ozone_default      STOPPED         10.0000         10              5            0            0            0               0               0              

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.
bash-5.1$ ozone admin datanode diskbalancer report ozone-datanode-2 ozone-datanode-3 ozone-datanode-5
Report result:
Datanode                                           VolumeDensity
ozone-datanode-5.ozone_default                     5.052149552875473E-9
ozone-datanode-2.ozone_default                     5.0521495459365795E-9
ozone-datanode-3.ozone_default                     5.052149538997686E-9

bash-5.1$ ozone admin datanode diskbalancer start -t 0.001 -s false ozone-datanode-2 ozone-datanode-3 ozone-datanode-5
Started DiskBalancer on nodes: [ozone-datanode-2, ozone-datanode-3, ozone-datanode-5]

bash-5.1$ ozone admin datanode diskbalancer stop ozone-datanode-2 ozone-datanode-3 ozone-datanode-5
Stopped DiskBalancer on nodes: [ozone-datanode-2, ozone-datanode-3, ozone-datanode-5]
  • json output
bash-5.1$ ozone admin datanode diskbalancer report --in-service-datanodes --json
[ {
  "datanode" : "ozone-datanode-5.ozone_default",
  "action" : "report",
  "status" : "success",
  "volumeDensity" : 0.0
}, {
  "datanode" : "ozone-datanode-1.ozone_default",
  "action" : "report",
  "status" : "success",
  "volumeDensity" : 0.0
}, {
  "datanode" : "ozone-datanode-3.ozone_default",
  "action" : "report",
  "status" : "success",
  "volumeDensity" : 0.0
}, {
  "datanode" : "ozone-datanode-4.ozone_default",
  "action" : "report",
  "status" : "success",
  "volumeDensity" : 0.0
}, {
  "datanode" : "ozone-datanode-2.ozone_default",
  "action" : "report",
  "status" : "success",
  "volumeDensity" : 0.0
} ]

bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes --json
[ {
  "datanode" : "ozone-datanode-5.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "STOPPED",
  "threshold" : 10.0,
  "bandwidthInMB" : 10,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-1.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "STOPPED",
  "threshold" : 10.0,
  "bandwidthInMB" : 10,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-3.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "STOPPED",
  "threshold" : 10.0,
  "bandwidthInMB" : 10,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-4.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "STOPPED",
  "threshold" : 10.0,
  "bandwidthInMB" : 10,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-2.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "STOPPED",
  "threshold" : 10.0,
  "bandwidthInMB" : 10,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
} ]

bash-5.1$ ozone admin datanode diskbalancer update -t 0.001 -b 20 --json ozone-datanode-1 ozone-datanode-4                 
[ {
  "datanode" : "ozone-datanode-1",
  "action" : "update",
  "status" : "success",
  "configuration" : {
    "threshold" : 0.001,
    "bandwidthInMB" : 20
  }
}, {
  "datanode" : "ozone-datanode-4",
  "action" : "update",
  "status" : "success",
  "configuration" : {
    "threshold" : 0.001,
    "bandwidthInMB" : 20
  }
} ]
bash-5.1$ ozone admin datanode diskbalancer start -s false --json ozone-datanode-1 ozone-datanode-4
[ {
  "datanode" : "ozone-datanode-1",
  "action" : "start",
  "status" : "success",
  "configuration" : {
    "stopAfterDiskEven" : false
  }
}, {
  "datanode" : "ozone-datanode-4",
  "action" : "start",
  "status" : "success",
  "configuration" : {
    "stopAfterDiskEven" : false
  }
} ]
bash-5.1$ ozone admin datanode diskbalancer stop --json ozone-datanode-1                 
[ {
  "datanode" : "ozone-datanode-1",
  "action" : "stop",
  "status" : "success"
} ]
  • stdin output
bash-5.1$ cat datanode.txt | ozone admin datanode diskbalancer status -
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-1.ozone_default      RUNNING         0.0010          20              20           0            0            0               0               0              
ozone-datanode-2.ozone_default      RUNNING         0.0010          10              20           0            0            0               0               0              

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.
bash-5.1$ cat datanode.txt | ozone admin datanode diskbalancer report -
Report result:
Datanode                                           VolumeDensity
ozone-datanode-1.ozone_default                     0.0
ozone-datanode-2.ozone_default                     0.0

bash-5.1$ cat datanode.txt | ozone admin datanode diskbalancer start -p 20 -
Started DiskBalancer on nodes: [ozone-datanode-1, ozone-datanode-2]
bash-5.1$ cat datanode.txt | ozone admin datanode diskbalancer status -
Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-1.ozone_default      RUNNING         0.0010          20              20           0            0            0               0               0              
ozone-datanode-2.ozone_default      RUNNING         0.0010          10              20           0            0            0               0               0              

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review November 17, 2025 15:33
@ChenSammi
Copy link
Contributor

@Gargi-jais11 , could you add some example about the failure case output of CLI? with/without json enabled.

@Gargi-jais11
Copy link
Contributor Author

@Gargi-jais11 , could you add some example about the failure case output of CLI? with/without json enabled.

Okay sure, I will try to add output as soon as possible.

@Gargi-jais11
Copy link
Contributor Author

Failure Case Output of CLI without json enabled:

bash-5.1$ ozone admin datanode diskbalancer update -b 50 ozone-datanode-1 ozone-datanode-5 ozone-datanode-4              
Error on node [ozone-datanode-5]: Invalid host name: local host is: "289e51bc0b85/172.18.0.3"; destination host is: "ozone-datanode-5":19864; java.net.UnknownHostException: Invalid host name: local host is: "289e51bc0b85/172.18.0.3"; destination host is: "ozone-datanode-5":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost

Updated DiskBalancer configuration on nodes: [ozone-datanode-1, ozone-datanode-4]
Failed to update DiskBalancer configuration on nodes: [ozone-datanode-5]

Failure Case Output of CLI with json enabled:

bash-5.1$ ozone admin datanode diskbalancer update -b 50 ozone-datanode-1 ozone-datanode-5 ozone-datanode-4 --json
[ {
  "datanode" : "ozone-datanode-1",
  "action" : "update",
  "status" : "success",
  "configuration" : {
    "bandwidthInMB" : 50
  }
}, {
  "datanode" : "ozone-datanode-4",
  "action" : "update",
  "status" : "success",
  "configuration" : {
    "bandwidthInMB" : 50
  }
} ]
Error on node [ozone-datanode-5]: Invalid host name: local host is: "289e51bc0b85/172.18.0.3"; destination host is: "ozone-datanode-5":19864; java.net.UnknownHostException: Invalid host name: local host is: "289e51bc0b85/172.18.0.3"; destination host is: "ozone-datanode-5":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost

Thanks, @ChenSammi for pointing this out. Seems to be overlooked by me. Here for json case expected failure for DN-5 is not in json format like this:

{
"datanode" : "ozone-datanode-1",
"action" : "start",
"status" : "failure",
"errorMsg": "******* "
"configuration" : {
"stopAfterDiskEven" : false
}

I will do work around on this to be the required format.

@Gargi-jais11
Copy link
Contributor Author

Gargi-jais11 commented Nov 20, 2025

New Failure Case Output of CLI without json enabled:

bash-5.1$ ozone admin datanode diskbalancer status ozone-datanode-2 ozone-datanode-3 ozone-datanode-5       
Error on node [ozone-datanode-3]: Invalid host name: local host is: "cc2acd5aa05c/172.18.0.2"; destination host is: "ozone-datanode-3":19864; java.net.UnknownHostException: Invalid host name: local host is: "cc2acd5aa05c/172.18.0.2"; destination host is: "ozone-datanode-3":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
Failed to get DiskBalancer status from nodes: [ozone-datanode-3]

Status result:
Datanode                            Status          Threshold(%)    BandwidthInMB   Threads      SuccessMove  FailureMove  BytesMoved(MB)  EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-2.ozone_default      STOPPED         0.0001          300             5            0            0            0               0               0              
ozone-datanode-5.ozone_default      STOPPED         0.0001          300             5            0            0            0               0               0              

Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.
bash-5.1$ ozone admin datanode diskbalancer start -s false ozone-datanode-2 ozone-datanode-3 ozone-datanode-5
Error on node [ozone-datanode-3]: Invalid host name: local host is: "cc2acd5aa05c/172.18.0.2"; destination host is: "ozone-datanode-3":19864; java.net.UnknownHostException: Invalid host name: local host is: "cc2acd5aa05c/172.18.0.2"; destination host is: "ozone-datanode-3":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost

Started DiskBalancer on nodes: [ozone-datanode-2, ozone-datanode-5]
Failed to start DiskBalancer on nodes: [ozone-datanode-3]

New Failure Case Output of CLI with json enabled:

bash-5.1$ ozone admin datanode diskbalancer start -s false ozone-datanode-2 ozone-datanode-3 ozone-datanode-5 --json
[ {
  "datanode" : "ozone-datanode-2",
  "action" : "start",
  "status" : "success",
  "configuration" : {
    "stopAfterDiskEven" : false
  }
}, {
  "datanode" : "ozone-datanode-3",
  "action" : "start",
  "status" : "failure",
  "errorMsg" : "Invalid host name: local host is: \"cc2acd5aa05c/172.18.0.2\"; destination host is: \"ozone-datanode-3\":19864; java.net.UnknownHostException: Invalid host name: local host is: \"cc2acd5aa05c/172.18.0.2\"; destination host is: \"ozone-datanode-3\":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost",
  "configuration" : {
    "stopAfterDiskEven" : false
  }
}, {
  "datanode" : "ozone-datanode-5",
  "action" : "start",
  "status" : "success",
  "configuration" : {
    "stopAfterDiskEven" : false
  }
} ]
bash-5.1$ ozone admin datanode diskbalancer status ozone-datanode-2 ozone-datanode-4 ozone-datanode-3 --json
[ {
  "datanode" : "ozone-datanode-2.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "RUNNING",
  "threshold" : 0.1254,
  "bandwidthInMB" : 20,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-4.ozone_default",
  "action" : "status",
  "status" : "success",
  "serviceStatus" : "RUNNING",
  "threshold" : 0.1254,
  "bandwidthInMB" : 20,
  "threads" : 5,
  "successMove" : 0,
  "failureMove" : 0,
  "bytesMovedMB" : 0,
  "estBytesToMoveMB" : 0,
  "estTimeLeftMin" : 0
}, {
  "datanode" : "ozone-datanode-3",
  "action" : "status",
  "status" : "failure",
  "errorMsg" : "Invalid host name: local host is: \"e7543a7c1b29/172.18.0.7\"; destination host is: \"ozone-datanode-3\":19864; java.net.UnknownHostException: Invalid host name: local host is: \"e7543a7c1b29/172.18.0.7\"; destination host is: \"ozone-datanode-3\":19864; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost; For more details see:  http://wiki.apache.org/hadoop/UnknownHost"
} ]

@ChenSammi
Copy link
Contributor

For json failure case, since we add "status" : "failure" to result, to keep symmentric, it's better to add "status" : "success" to all success output of all commands too. Specifically for status command, "status" is currently used to represent disk balancer service status, we should choose another word, for example "serviceStatus".

@ChenSammi
Copy link
Contributor

Thanks @Gargi-jais11 , wait for CI to pass.

@ChenSammi ChenSammi merged commit c07a4a7 into apache:HDDS-5713 Nov 25, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants