-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. #2090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c8127fb to
af09298
Compare
…er Datanode is started.
|
@bshashikant @nandakumar131 Please take a look at this bug fix of SCM HA. Thanks |
| */ | ||
| public void start(String clusterId) throws IOException { | ||
| if (!isStarted.compareAndSet(false, true)) { | ||
| if (!initializingStatus.compareAndSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some documentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please take another look !
|
@GlenGeng , can you explain why OzoneContainer#start() needs to be called multiple times if SCM HA is enabled for registering to the same SCM? |
Please check The SCM Connection will go through Previously, the If SCM HA is enabled, there will be 3 VersionEndpointTask created, one for each SCM. DN will call |
Actually no, each call is for different SCM. |
* HDDS-3698-nonrolling-upgrade: (144 commits) fix project name in NOTICE.txt (apache#2112) HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115) HDDS-5014. Add non-rolling upgrade design docs. HDDS-5035. Use default config values to solve generated config file conflict (apache#2087) HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109) HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885) Fix ozone-ha acceptance test. HDDS-5058. Make getScmInfo retry for a duration. HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628) HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062) HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098) HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090) HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088) HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992) HDDS-4987. Import container should not delete container contents if container already exists (apache#2077) Checkstyle fix. Intialize DN layout version before security init. HDDS-4915. [SCM HA Security] Integrate CertClient. (apache#2000) HDDS-5049. Add timeout support for ratis requests in SCM HA. (apache#2099) trigger new CI check ...
* HDDS-3698-nonrolling-upgrade: (150 commits) HDDS-5056. Avoid false positiver error messages during pipeline creations (apache#2105) HDDS-5027. [SCM HA Security] Handle leader changes during bootstrap. (apache#2113) HDDS-5032. Fix findbugs (apache#2120) HDDS-5062. Add a config to bypass clusterId validation for bootstrapping SCM. (apache#2114) HDDS-5011. Introduce Java based ReplicationConfig implementation (apache#2089) HDDS-4925. Introduce ContainerBalancer in SCM with start/stop capabilities. (apache#2097) fix project name in NOTICE.txt (apache#2112) HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115) HDDS-5014. Add non-rolling upgrade design docs. HDDS-5035. Use default config values to solve generated config file conflict (apache#2087) HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109) HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885) Fix ozone-ha acceptance test. HDDS-5058. Make getScmInfo retry for a duration. HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628) HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062) HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098) HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090) HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088) HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992) ...
What changes were proposed in this pull request?
When SCM HA is enabled, after restart DN, the SCM may not know the full ports of that DN.
The issue is:
SCMNodeManager just record the DatanodeDetails once during register. But for DN, it won’t record the admin, server, client port into DatanodeDetails until its ratis server is up. Thus there is contention here: if the register request is reported before ratis server is up, SCM won’t know full port list of that DN.
The solution is
If SCM HA is enabled, OzoneContainer#start() will be called multi-times from VersionEndpointTask. The first call should do the initializing job, the successive calls should wait until OzoneContainer is initialized.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-5033
How was this patch tested?
CI and integration test inside tencent.