Skip to content

Conversation

@GlenGeng-awx
Copy link
Contributor

@GlenGeng-awx GlenGeng-awx commented Mar 29, 2021

What changes were proposed in this pull request?

When SCM HA is enabled, after restart DN, the SCM may not know the full ports of that DN.

The issue is:
SCMNodeManager just record the DatanodeDetails once during register. But for DN, it won’t record the admin, server, client port into DatanodeDetails until its ratis server is up. Thus there is contention here: if the register request is reported before ratis server is up, SCM won’t know full port list of that DN.

The solution is
If SCM HA is enabled, OzoneContainer#start() will be called multi-times from VersionEndpointTask. The first call should do the initializing job, the successive calls should wait until OzoneContainer is initialized.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5033

How was this patch tested?

CI and integration test inside tencent.

@GlenGeng-awx GlenGeng-awx force-pushed the HDDS-5033 branch 2 times, most recently from c8127fb to af09298 Compare March 29, 2021 08:59
@bshashikant bshashikant changed the base branch from HDDS-2823 to master March 30, 2021 07:15
@GlenGeng-awx
Copy link
Contributor Author

@bshashikant @nandakumar131 Please take a look at this bug fix of SCM HA. Thanks

*/
public void start(String clusterId) throws IOException {
if (!isStarted.compareAndSet(false, true)) {
if (!initializingStatus.compareAndSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some documentation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please take another look !

@bshashikant
Copy link
Contributor

bshashikant commented Apr 1, 2021

@GlenGeng , can you explain why OzoneContainer#start() needs to be called multiple times if SCM HA is enabled for registering to the same SCM?

@GlenGeng-awx
Copy link
Contributor Author

GlenGeng-awx commented Apr 1, 2021

@GlenGeng , can you explain why OzoneContainer#start() needs to be called multiple times if SCM HA is enabled for registering to the same SCM?

Please check

public class SCMConnectionManager
    implements Closeable, SCMConnectionManagerMXBean {
  private static final Logger LOG =
      LoggerFactory.getLogger(SCMConnectionManager.class);

  private final ReadWriteLock mapLock;
  private final Map<InetSocketAddress, EndpointStateMachine> scmMachines;

The SCM Connection will go through VersionEndpointTask, RegisterEndpointTask and HeartbeatEndpointTask.

Previously, the VersionEndpointTask#call() will execute

          // Start the container services after getting the version information
          ozoneContainer.start(clusterId);

If SCM HA is enabled, there will be 3 VersionEndpointTask created, one for each SCM.

DN will call VersionEndpointTask#call for each of them. Yet, we need ensure that OzoneContainer should only be started once.

@GlenGeng-awx
Copy link
Contributor Author

@GlenGeng , can you explain why OzoneContainer#start() needs to be called multiple times if SCM HA is enabled for registering to the same SCM?

Actually no, each call is for different SCM.

@bshashikant bshashikant merged commit 1d4351e into apache:master Apr 1, 2021
errose28 added a commit to errose28/ozone that referenced this pull request Apr 7, 2021
* HDDS-3698-nonrolling-upgrade: (144 commits)
  fix project name in NOTICE.txt (apache#2112)
  HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115)
  HDDS-5014. Add non-rolling upgrade design docs.
  HDDS-5035. Use default config values to solve generated config file conflict (apache#2087)
  HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109)
  HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885)
  Fix ozone-ha acceptance test.
  HDDS-5058. Make getScmInfo retry for a duration.
  HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628)
  HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062)
  HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098)
  HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090)
  HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088)
  HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992)
  HDDS-4987. Import container should not delete container contents if container already exists (apache#2077)
  Checkstyle fix.
  Intialize DN layout version before security init.
  HDDS-4915. [SCM HA Security] Integrate CertClient. (apache#2000)
  HDDS-5049. Add timeout support for ratis requests in SCM HA. (apache#2099)
  trigger new CI check
  ...
errose28 added a commit to errose28/ozone that referenced this pull request Apr 9, 2021
* HDDS-3698-nonrolling-upgrade: (150 commits)
  HDDS-5056. Avoid false positiver error messages during pipeline creations (apache#2105)
  HDDS-5027. [SCM HA Security] Handle leader changes during bootstrap. (apache#2113)
  HDDS-5032. Fix findbugs (apache#2120)
  HDDS-5062. Add a config to bypass clusterId validation for bootstrapping SCM. (apache#2114)
  HDDS-5011. Introduce Java based ReplicationConfig implementation (apache#2089)
  HDDS-4925. Introduce ContainerBalancer in SCM with start/stop capabilities. (apache#2097)
  fix project name in NOTICE.txt (apache#2112)
  HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115)
  HDDS-5014. Add non-rolling upgrade design docs.
  HDDS-5035. Use default config values to solve generated config file conflict (apache#2087)
  HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109)
  HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885)
  Fix ozone-ha acceptance test.
  HDDS-5058. Make getScmInfo retry for a duration.
  HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628)
  HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062)
  HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098)
  HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090)
  HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088)
  HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants