Skip to content

Conversation

@bshashikant
Copy link
Contributor

What changes were proposed in this pull request?

IN SCM HA, the primary node starts up the ratis server while other bootstrapping nodes will get added to the ratis group. Now, if all the bootstrapping SCM's get stopped, the primary node will now step down from leadership as it will loose majority. If the bootstrapping nodes are now bootstrapped again, the bootsrapping node will try to first validate the cluster id from the leader SCM with the persisted cluster id , but as there is no leader existing, bootstrapping wil keep on failing and retrying until it shuts down.

The issue can be very easily simulated in kubernetes deployments, where bootstrap and init cmds are run repeatedly on every restart.

The Jira aims to bypass the cluster id validation if a bootstrapping node already has a cluster id.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5062

How was this patch tested?

Added unit test

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.
Minor comments/questions.

@GlenGeng-awx
Copy link
Contributor

LGTM. Wait for CI pass.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@bshashikant
Copy link
Contributor Author

The findbug issue reported will be fixed by #2120 . Merging it.

@bshashikant bshashikant merged commit 33ddcb3 into apache:master Apr 7, 2021
errose28 added a commit to errose28/ozone that referenced this pull request Apr 7, 2021
* HDDS-3698-nonrolling-upgrade:
  HDDS-5056. Avoid false positiver error messages during pipeline creations (apache#2105)
  HDDS-5027. [SCM HA Security] Handle leader changes during bootstrap. (apache#2113)
  HDDS-5032. Fix findbugs (apache#2120)
  HDDS-5062. Add a config to bypass clusterId validation for bootstrapping SCM. (apache#2114)
  HDDS-5011. Introduce Java based ReplicationConfig implementation (apache#2089)
  HDDS-4925. Introduce ContainerBalancer in SCM with start/stop capabilities. (apache#2097)
errose28 added a commit to errose28/ozone that referenced this pull request Apr 7, 2021
* HDDS-3698-nonrolling-upgrade:
  HDDS-5056. Avoid false positiver error messages during pipeline creations (apache#2105)
  HDDS-5027. [SCM HA Security] Handle leader changes during bootstrap. (apache#2113)
  HDDS-5032. Fix findbugs (apache#2120)
  HDDS-5062. Add a config to bypass clusterId validation for bootstrapping SCM. (apache#2114)
  HDDS-5011. Introduce Java based ReplicationConfig implementation (apache#2089)
  HDDS-4925. Introduce ContainerBalancer in SCM with start/stop capabilities. (apache#2097)
errose28 added a commit to errose28/ozone that referenced this pull request Apr 9, 2021
* HDDS-3698-nonrolling-upgrade: (150 commits)
  HDDS-5056. Avoid false positiver error messages during pipeline creations (apache#2105)
  HDDS-5027. [SCM HA Security] Handle leader changes during bootstrap. (apache#2113)
  HDDS-5032. Fix findbugs (apache#2120)
  HDDS-5062. Add a config to bypass clusterId validation for bootstrapping SCM. (apache#2114)
  HDDS-5011. Introduce Java based ReplicationConfig implementation (apache#2089)
  HDDS-4925. Introduce ContainerBalancer in SCM with start/stop capabilities. (apache#2097)
  fix project name in NOTICE.txt (apache#2112)
  HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115)
  HDDS-5014. Add non-rolling upgrade design docs.
  HDDS-5035. Use default config values to solve generated config file conflict (apache#2087)
  HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109)
  HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885)
  Fix ozone-ha acceptance test.
  HDDS-5058. Make getScmInfo retry for a duration.
  HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628)
  HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062)
  HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098)
  HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090)
  HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088)
  HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants