Skip to content

Conversation

@maobaolong
Copy link
Member

What changes were proposed in this pull request?

Container deleted wrong replica cause mis-replicated.

What is the link to the Apache JIRA

HDDS-4244

How was this patch tested?

  • Related config file
    • ozone-site.xml
  <property>
    <name>ozone.scm.container.placement.impl</name>
    <value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware</value>
  </property>
  <property>
    <name>net.topology.node.switch.mapping.impl</name>
    <value>org.apache.hadoop.net.TableMapping</value>
  </property>
  <property>
    <name>net.topology.table.file.name</name>
    <value>/data/ozoneadmin/ozoneenv/ozone/etc/hadoop/network-config</value>
  </property>
  • network-config
192.168.1.100 /racks1
192.168.1.101 /racks1
192.168.1.102 /racks1
192.168.1.103 /racks2
192.168.1.104 /racks2
192.168.1.106 /racks2

First you should make the 3 replicas of the tested container on the same racks, like racks1.
secondly, change the config file update the request config.

2 ReplicationManager interval, you can use the following command to verify the container replicas are in the 2 racks.

ozone admin container info #xxx

@ChenSammi
Copy link
Contributor

@maobaolong , can we add a policy unsatisfied UT too?

@maobaolong
Copy link
Member Author

@ChenSammi Thank you for remind me, I add a testOverReplicatedAndPolicyUnSatisfied test case. PTAL.

@ChenSammi
Copy link
Contributor

LGTM +1.

@ChenSammi ChenSammi merged commit 570d34c into apache:master Sep 16, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Sep 18, 2020
* master: (47 commits)
  HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369)
  HDDS-4250. Fix wrong logger name (apache#1429)
  HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423)
  HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296)
  HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398)
  HDDS-4075. Retry request on different OM on AccessControlException (apache#1303)
  HDDS-4166. Documentation index page redirects to the wrong address (apache#1372)
  HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289)
  HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361)
  HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401)
  HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336)
  HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408)
  HDDS-4218.Remove test TestRatisManager (apache#1409)
  HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337)
  HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413)
  HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390)
  HDDS-4211. [OFS] Better owner and group display for listing Ozone volumes and buckets (apache#1397)
  HDDS-4150. recon.api.TestEndpoints test is flaky (apache#1396)
  HDDS-4170 - Fix typo in method description. (apache#1406)
  HDDS-4064. Show container verbose info with verbose option (apache#1290)
  ...
errose28 added a commit to errose28/ozone that referenced this pull request Sep 18, 2020
…ponse

* HDDS-4122-quota-attempt2: (51 commits)
  Remove redundant check status calls in children of AbstractOMOpenKeyDeleteRequest
  Remove unused inports and fix super constructor calls
  Move common volume byte usage update code to AbstractOMKeyDeleteResponse
  Add volume quota update to open key delete response, and group duplicate code
  HDDS-4104. Provide a way to get the default value and key of java-based-configuration easily (apache#1369)
  HDDS-4250. Fix wrong logger name (apache#1429)
  HDDS-4244. Container deleted wrong replica cause mis-replicated. (apache#1423)
  HDDS-4053. Volume space: add quotaUsageInBytes and update it when write and delete key. (apache#1296)
  HDDS-4210. ResolveBucket during checkAcls fails. (apache#1398)
  HDDS-4075. Retry request on different OM on AccessControlException (apache#1303)
  HDDS-4166. Documentation index page redirects to the wrong address (apache#1372)
  HDDS-4039. Reduce the number of fields in hdds.proto to improve performance (apache#1289)
  HDDS-4155. Directory and filename can end up with same name in a path. (apache#1361)
  HDDS-3927. Rename Ozone OM,DN,SCM runtime options to conform to naming conventions (apache#1401)
  HDDS-4119. Improve performance of the BufferPool management of Ozone client (apache#1336)
  HDDS-4217.Remove test TestOzoneContainerRatis (apache#1408)
  HDDS-4218.Remove test TestRatisManager (apache#1409)
  HDDS-4129. change MAX_QUOTA_IN_BYTES to Long.MAX_VALUE. (apache#1337)
  HDDS-4228: add field 'num' to ALLOCATE_BLOCK of scm audit log. (apache#1413)
  HDDS-4196. Add an endpoint in Recon to query Prometheus (apache#1390)
  ...
Comment on lines -675 to +676
if (misReplicated || !nowMisRep) {
// Remove the replica if the container was already mis-replicated
// OR if losing this replica does not make it become mis-replicated
ContainerPlacementStatus nowPS =
getPlacementStatus(replicaSet, replicationFactor);
if ((!ps.isPolicySatisfied()
&& nowPS.actualPlacementCount() == ps.actualPlacementCount())
|| (ps.isPolicySatisfied() && nowPS.isPolicySatisfied())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need such a complex condition check?
Won't fixing the previous check solve the problem?
if (misReplicated || !nowMisRep) to if (!misReplicated || nowMisRep)

@nandakumar131
Copy link
Contributor

How exactly does the new test cases verify that the ReplicationManager is considering Rack Awareness and picks the correct replica for deletion? As far as I can understand, the new test cases are checking if the delete command is sent or not. The test cases don't verify any kind of Rack Placement. Am I missing something here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants