Skip to content

Conversation

@bharatviswa504
Copy link
Contributor

@bharatviswa504 bharatviswa504 commented Mar 8, 2021

What changes were proposed in this pull request?

This PR is to implement

  1. For primary SCM during init, fetch sub -ca certificate and CA Certificate.
  2. For bootstrap nodes during bootstrap phase use SCM security client to fetch sub-ca certificate and CA Certificate.
  3. For both primary and bootstrap nodes persist primary SCM node Id to version file.
  4. Make changes in OzoneManager to use listCA and return all CA certificates list in getServiceList. This list will be used in RatisClient truststore when talking to DN.
  5. Make Changes in Datanode to use listCA and return all CA certificates and use it during TLSClientConfig.
  6. During SCM startup, if this is an upgraded cluster just behave like before with only CA and start with DefaultProfile.
    If freshly installed cluster where init is performed, start rootCA and subCA. (Not implemented changes required upgrade from the old cluster to the new cluster and Ratis is enabled after upgrade)
  7. Enable MTLS between SCM HA nodes.
  8. DefaultCertificateClient changes:
    a. Add API to store rootCA Certificate.
    b. Expose API's to getRootCA Certificate and updateCAList which fetches list of CA and update it with in-memory.
  9. When DN and OM are startup, wait for listCA output size to be matched with config SCM node list size. This is to support when DN and OM are started before all quorum of SCM is started.
  10. Update Ozone-secure-om-ha to add secure SCM HA setup and run existing test suites.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4915

How was this patch tested?

Added docker test

@bharatviswa504
Copy link
Contributor Author

PR based on #1988

@bharatviswa504 bharatviswa504 changed the title HDDS-4915. Integrate CertClient. HDDS-4915. [SCM HA Security] Integrate CertClient. Mar 8, 2021
@bharatviswa504 bharatviswa504 force-pushed the HDDS-4915 branch 2 times, most recently from aafeead to b4b9e8c Compare March 10, 2021 06:07
Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharatviswa504 thanks for working on this. Can you separate the patch into smaller ones? For example
separate the bootstrap of the non-primary from this patch and focus only on the init of primary SCM.
Otherwise, the review will take longer time.

@GlenGeng-awx
Copy link
Contributor

Please hold on the merge util we finish merging master back to 2823. Thanks!

@bharatviswa504
Copy link
Contributor Author

@bharatviswa504 thanks for working on this. Can you separate the patch into smaller ones? For example
separate the bootstrap of the non-primary from this patch and focus only on the init of primary SCM.
Otherwise, the review will take longer time.

@xiaoyuyao I understand it is a big change, sorry for the trouble. Bootstrap changes will be less than 100 LOC, but if it is still needed I can split it into a new Jira.

Actually, I have planned to implement CertClient and Integration in two parts, later when integrating found issues and closed the other one. Let me see if I can split again, but 2nd PR integration one will be dependent on this. Can we also have a call for a review?

@bharatviswa504
Copy link
Contributor Author

I have split the changes into 2 Jiras HDDS-4897 and HDDS-4952.

Integration to code will do in this Jira, once the above Jiras are merged.

@xiaoyuyao
Copy link
Contributor

I have split the changes into 2 Jiras HDDS-4897 and HDDS-4952.

Integration to code will do in this Jira, once the above Jiras are merged.

Thanks for the efforts. I will look into those dependency JIRAs.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Mar 17, 2021

@xiaoyuyao Addressed/replied review comments these are addressed in HDDS-4897 under PR #2041
Can we continue the review in #2041

I will update this PR to use all the Jira's which went in and integrate it.

@bharatviswa504
Copy link
Contributor Author

Re-opened pull request based on HDDS-4897.

@bharatviswa504
Copy link
Contributor Author

Rebased with latest HDDS-2823.

Copy link
Contributor

@GlenGeng-awx GlenGeng-awx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just some general comments, most of them are typo.

this.scmVersion = RPC.getProtocolVersion(ScmBlockLocationProtocolPB.class);

try {
this.ugi = UserGroupInformation.getCurrentUser();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need change this file ? The logic is unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make use of UGI during the creation of FailOverProxyProvider, otherwise, we will use UGI during create proxy time which might not be correct one.
Without this change, we shall see this error.

om1_1        | 2021-03-23 05:59:25,420 [IPC Server handler 7 on default port 9862] WARN ipc.Client: Exception encountered while connecting to the server
om1_1        | javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
om1_1        | 	at jdk.security.jgss/com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
om1_1        | 	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
om1_1        | 	at java.base/java.security.AccessController.doPrivileged(Native Method)
om1_1        | 	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
om1_1        | 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
om1_1        | 	at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
om1_1        | 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
om1_1        | 	at org.apache.hadoop.ipc.Client.call(Client.java:1452)
om1_1        | 	at org.apache.hadoop.ipc.Client.call(Client.java:1405)
om1_1        | 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
om1_1        | 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
om1_1        | 	at com.sun.proxy.$Proxy32.send(Unknown Source)
om1_1        | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
om1_1        | 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
om1_1        | 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
om1_1        | 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
om1_1        | 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
om1_1        | 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
om1_1        | 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
om1_1        | 	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
om1_1        | 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
om1_1        | 	at com.sun.proxy.$Proxy32.send(Unknown Source)
om1_1        | 	at org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.submitRequest(ScmBlockLocationProtocolClientSideTranslatorPB.java:118)
om1_1        | 	at org.apache.hadoop.hdds.scm.protocolPB.ScmBlockLocationProtocolClientSideTranslatorPB.allocateBlock(ScmBlockLocationProtocolClientSideTranslatorPB.java:172)
om1_1        | 	at org.apache.hadoop.ozone.om.request.key.OMKeyRequest.allocateBlock(OMKeyRequest.java:128)
om1_1        | 	at org.apache.hadoop.ozone.om.request.key.OMKeyCreateRequest.preExecute(OMKeyCreateRequest.java:151)
om1_1        | 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:139)
om1_1        | 	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
om1_1        | 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:122)
om1_1        | 	at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
om1_1        | 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
om1_1        | 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
om1_1        | 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
om1_1        | 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
om1_1        | 	at java.base/java.security.AccessController.doPrivileged(Native Method)
om1_1        | 	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
om1_1        | 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
om1_1        | 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
om1_1        | Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
om1_1        | 	at java.security.jgss/sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162)
om1_1        | 	at java.security.jgss/sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:126)
om1_1        | 	at java.security.jgss/sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:193)
om1_1        | 	at java.security.jgss/sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:218)
om1_1        | 	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:230)
om1_1        | 	at java.security.jgss/sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:196)
om1_1        | 	at jdk.security.jgss/com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
om1_1        | 	... 42 more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enable the Debug log of UGI class to see which UGI is used if we don't cache the UGI at the time of provider creation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap line 623 and 641 into separate helper, this seems to be a long function already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message can be improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, Let me know if that looks okay to you?

Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bharatviswa504 for the update. A few more comments. Let's discuss offline tomorrow.

@bharatviswa504
Copy link
Contributor Author

bharatviswa504 commented Mar 31, 2021

Thank You @xiaoyuyao for the review. I have addressed review comments.

Let's discuss offline tomorrow.

Sure.

Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bharatviswa504 for the update. The latest change LGTM, +1 pending CI.

@bharatviswa504 bharatviswa504 merged commit 5bb8dda into apache:master Apr 1, 2021
@bharatviswa504
Copy link
Contributor Author

Thank You @xiaoyuyao and @GlenGeng for the review.

errose28 added a commit to errose28/ozone that referenced this pull request Apr 7, 2021
* HDDS-3698-nonrolling-upgrade: (144 commits)
  fix project name in NOTICE.txt (apache#2112)
  HDDS-5066. Use fixed vesion from pnpm to build recon (apache#2115)
  HDDS-5014. Add non-rolling upgrade design docs.
  HDDS-5035. Use default config values to solve generated config file conflict (apache#2087)
  HDDS-5032. DN stopped to load containers on volume after a container load exception. (apache#2109)
  HDDS-4504. Datanode deletion config should be based on number of blocks (apache#1885)
  Fix ozone-ha acceptance test.
  HDDS-5058. Make getScmInfo retry for a duration.
  HDDS-4506. Support query parameter based v4 auth in S3g (apache#1628)
  HDDS-4553. ChunkInputStream should release buffer as soon as last byte in the buffer is read (apache#2062)
  HDDS-5022. SCM get roles command should provide Ratis Leader/Follower… (apache#2098)
  HDDS-5033. SCM may not be able to know full port list of Datanode after Datanode is started. (apache#2090)
  HDDS-3752. Fix o3fs list bucket contents issue when without tailing "/" (apache#2088)
  HDDS-4901. Remove OmOzoneAclMap from OmVolumeArgs to avoid OzoneAcl conversions (apache#1992)
  HDDS-4987. Import container should not delete container contents if container already exists (apache#2077)
  Checkstyle fix.
  Intialize DN layout version before security init.
  HDDS-4915. [SCM HA Security] Integrate CertClient. (apache#2000)
  HDDS-5049. Add timeout support for ratis requests in SCM HA. (apache#2099)
  trigger new CI check
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants