From 74c257b02b9b3bbb2ec907eb7b22482152ae480a Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 29 May 2025 22:11:07 -0700 Subject: [PATCH 1/7] HDDS-13138. [Docs] Update Topology Awareness user doc. Change-Id: I66c09d82a3eb89edff3538c59fd53e131065cab8 --- hadoop-hdds/docs/content/feature/Topology.md | 254 +++++++++++++------ 1 file changed, 179 insertions(+), 75 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 8a053be8db00..14d1b9056d38 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -23,86 +23,190 @@ summary: Configuration for rack-awarness for improved read/write limitations under the License. --> -Ozone can use topology related information (for example rack placement) to optimize read and write pipelines. To get full rack-aware cluster, Ozone requires three different configuration. - - 1. The topology information should be configured by Ozone. - 2. Topology related information should be used when Ozone chooses 3 different datanodes for a specific pipeline/container. (WRITE) - 3. When Ozone reads a Key it should prefer to read from the closest node. - - - -## Topology hierarchy - -Topology hierarchy can be configured with using `net.topology.node.switch.mapping.impl` configuration key. This configuration should define an implementation of the `org.apache.hadoop.net.CachedDNSToSwitchMapping`. As this is a Hadoop class, the configuration is exactly the same as the Hadoop Configuration - -### Static list - -Static list can be configured with the help of ```TableMapping```: - -```XML +Apache Ozone uses topology information (e.g., rack placement) to optimize data access and improve resilience. A fully rack-aware cluster needs: + +1. Configured network topology. +2. Topology-aware DataNode selection for container replica placement (write path). +3. Prioritized reads from topologically closest DataNodes (read path). + +## Applicability to Container Types + +Ozone's topology-aware placement strategies vary by container replication type and state: + +* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open containers (write), and an async replication for closed, immutable containers (cold data). As RAFT requires low-latency network, topology awareness placement is available only for closed containers. See the [page about Containers](concept/Containers.md) about more information related to Open vs Closed containers. +* **Erasure Coded (EC) Containers:** EC demands topology awareness from the initial write. For an EC key, OM allocates a block group of `$d+p$` distinct DataNodes selected by SCM's `ECPipelineProvider` to ensure rack diversity and fault tolerance. This topology-aware selection is integral to the EC write path for new blocks. \[2] + +## Configuring Topology Hierarchy + +Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack awareness, configured via `net.topology.node.switch.mapping.impl` in `core-site.xml`. This key specifies a `org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1] + +Two primary methods exist: + +### 1. Static List: `TableMapping` + +Maps IPs/hostnames to racks using a predefined file. + +* **Configuration:** Set `net.topology.node.switch.mapping.impl` to `org.apache.hadoop.net.TableMapping` and `net.topology.table.file.name` to the mapping file's path. \[1] + ```xml + + net.topology.node.switch.mapping.impl + org.apache.hadoop.net.TableMapping + + + net.topology.table.file.name + /etc/ozone/topology.map + + ``` +* **File Format:** A two-column text file (IP/hostname, rack path per line). Unlisted nodes go to `/default-rack`. \[1] + Example `topology.map`: + ``` + 192.168.1.100 /rack1 + datanode101.example.com /rack1 + 192.168.1.102 /rack2 + datanode103.example.com /rack2 + ``` + +### 2. Dynamic List: `ScriptBasedMapping` + +Uses an external script to resolve rack locations for IPs. + +* **Configuration:** Set `net.topology.node.switch.mapping.impl` to `org.apache.hadoop.net.ScriptBasedMapping` and `net.topology.script.file.name` to the script's path. \[1] + ```xml + + net.topology.node.switch.mapping.impl + org.apache.hadoop.net.ScriptBasedMapping + + + net.topology.script.file.name + /etc/ozone/determine_rack.sh + + ``` +* **Script:** Admin-provided, executable script. Ozone passes IPs (up to `net.topology.script.number.args`, default 100) as arguments; script outputs rack paths (one per line). + Example `determine_rack.sh`: + ```bash + #!/bin/bash + # This is a simplified example. A real script might query a CMDB or use other logic. + while [ $# -gt 0 ] ; do + nodeAddress=$1 + if [[ "$nodeAddress" == "192.168.1.100" || "$nodeAddress" == "datanode101.example.com" ]]; then + echo "/rack1" + elif [[ "$nodeAddress" == "192.168.1.102" || "$nodeAddress" == "datanode103.example.com" ]]; then + echo "/rack2" + else + echo "/default-rack" + fi + shift + done + ``` + Ensure the script is executable (`chmod +x /etc/ozone/determine_rack.sh`). + +**Topology Mapping Best Practices:** + +* **Accuracy:** Mappings must be accurate and current. +* **Static Mapping:** Simpler for small, stable clusters; requires manual updates. +* **Dynamic Mapping:** Flexible for large/dynamic clusters. Script performance, correctness, and reliability are vital; ensure it's idempotent and handles batch lookups efficiently. + +## Container Placement Policies for Replicated (RATIS) Containers + +SCM uses a pluggable policy for placing additional replicas of *closed* RATIS-replicated containers, configured by `ozone.scm.container.placement.impl` in `ozone-site.xml`. Policies are in `org.apache.hadoop.hdds.scm.container.placement.algorithms`. \[1, 3] + +### 1. `SCMContainerPlacementRackAware` (Default) + +* **Function:** Distributes replicas across racks for fault tolerance (e.g., for 3 replicas, aims for at least two racks). Similar to HDFS placement. \[1] +* **Use Cases:** Production clusters needing rack-level fault tolerance. +* **Configuration:** + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware + + ``` +* **Best Practices:** Requires accurate topology mapping. +* **Limitations:** Designed for single-layer rack topologies (e.g., `/rack/node`). Not recommended for multi-layer hierarchies (e.g., `/dc/row/rack/node`) as it may not interpret deeper levels correctly. \[1] + +### 2. `SCMContainerPlacementRandom` + +* **Function:** Randomly selects healthy, available DataNodes meeting basic criteria (space, no existing replica), ignoring rack topology. \[1, 4] +* **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for closed replicas isn't critical. +* **Configuration:** (Default) + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRandom + + ``` +* **Best Practices:** Not for production needing rack failure resilience. + +### 3. `SCMContainerPlacementCapacity` + +* **Function:** Selects DataNodes by available capacity (favors lower disk utilization) to balance disk usage. \[5, 6] +* **Use Cases:** Heterogeneous storage clusters or where even disk utilization is key. +* **Configuration:** + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementCapacity + + ``` +* **Best Practices:** Prevents uneven node filling. +* **Interaction:** Typically respects topology constraints first (like `SCMContainerPlacementRackAware` in simple topologies), then chooses by capacity among valid nodes. Verify this interaction for specific needs. + +## Topology Awareness and Placement for Erasure Coded (EC) Containers + +Erasure Coding (EC) offers storage efficiency with fault tolerance, featuring distinct, intrinsic placement requirements in its write path. + +### Unique EC Placement Demands + +* **Inherent Rack Fault-Tolerance:** EC aims to spread `$d+p$` (data + parity) chunks of a block group across different racks for reconstruction after rack failures (up to `$p$` racks). \[2] +* **Minimum Rack Requirements:** The EC schema (e.g., RS-6-3-1024k) dictates rack diversity. Ideally, `$d+p$` distinct racks are needed for one chunk per rack. With fewer racks, the system maximizes spread, ensuring no rack holds more chunks from a stripe than parity can recover. Aim for at least `ceil((d+p)/p)` racks, preferably `$p+1$` or more. \[2] +* **Node Selection:** For EC writes, OM and SCM's `ECPipelineProvider` allocate `$d+p$` distinct DataNodes, enforcing topology for rack diversity. \[2] + +### EC Container Placement Logic (Intrinsic) + +EC placement is an **intrinsic behavior** driven by the bucket's EC schema (e.g., `rs-6-3-1024k`), not a swappable policy like for replicated containers (see JIRA HDDS-3816). \[2, 7, 8] + +* **Rack Diversity:** `ECPipelineProvider` logic places each of the `$d+p$` chunks on unique racks if possible. Otherwise, it distributes chunks evenly, ensuring no rack holds more chunks from a stripe than parity (`$p$`) can recover. +* **Node Selection Priorities:** Maximize rack/node diversity, respect rack density limits, select healthy/capacious DataNodes, and exclude unsuitable nodes. +* **Base Topology Reliance:** EC placement depends on accurate topology from `net.topology.node.switch.mapping.impl`. \[1] + +### Example: RS(6,3) EC Placement (9 chunks) + +* **Ideal (9+ Racks):** 1 chunk per rack; tolerates 3 rack failures. +* **Constrained (3 Racks):** 3 chunks per rack; losing 1 rack (3 chunks) is recoverable. +* **Intermediate (5 Racks):** Chunks spread (e.g., 2 chunks on 4 racks, 1 on another); tolerates losing any `$p` (3) racks. + +### EC Topology Best Practices + +* **Sufficient Racks:** Provision racks for EC schema fault tolerance (at least `$p+1$`, ideally `$d+p$`). \[2] +* **Even DataNode Distribution:** Distribute DataNodes evenly across racks. \[2] +* **Network & CPU:** Ensure robust network (especially inter-rack) and DataNode CPU for EC operations. \[2] + +## Optimizing Read Paths + +Enable by setting `ozone.network.topology.aware.read` to `true` in `ozone-site.xml`. \[1] +```xml - net.topology.node.switch.mapping.impl - org.apache.hadoop.net.TableMapping - - - net.topology.table.file.name - /opt/hadoop/compose/ozone-topology/network-config + ozone.network.topology.aware.read + true ``` +This directs clients (replicated data) and the system (EC reconstruction) to read from topologically closest DataNodes, reducing latency and cross-rack traffic. Recommended with accurate topology. -The second configuration option should point to a text file. The file format is a two column text file, with columns separated by whitespace. The first column is IP address and the second column specifies the rack where the address maps. If no entry corresponding to a host in the cluster is found, then `/default-rack` is assumed. - -### Dynamic list - -Rack information can be identified with the help of an external script: +## Summary of Best Practices - -```XML - - net.topology.node.switch.mapping.impl - org.apache.hadoop.net.ScriptBasedMapping - - - net.topology.script.file.name - /usr/local/bin/rack.sh - -``` - -If implementing an external script, it will be specified with the `net.topology.script.file.name` parameter in the configuration files. Unlike the java class, the external topology script is not included with the Ozone distribution and is provided by the administrator. Ozone will send multiple IP addresses to ARGV when forking the topology script. The number of IP addresses sent to the topology script is controlled with `net.topology.script.number.args` and defaults to 100. If `net.topology.script.number.args` was changed to 1, a topology script would get forked for each IP submitted. - -## Write path - -Placement of the closed containers can be configured with `ozone.scm.container.placement.impl` configuration key. The available container placement policies can be found in the `org.apache.hdds.scm.container.placement` [package](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). - -By default the `SCMContainerPlacementRandom` is used for topology-awareness the `SCMContainerPlacementRackAware` can be used: - -```XML - - ozone.scm.container.placement.impl - org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware - -``` - -This placement policy complies with the algorithm used in HDFS. With default 3 replica, two replicas will be on the same rack, the third one will on a different rack. - -This implementation applies to network topology like "/rack/node". Don't recommend to use this if the network topology has more layers. - -## Read path - -Finally the read path also should be configured to read the data from the closest pipeline. - -```XML - - ozone.network.topology.aware.read - true - -``` +* **Accurate Topology:** Maintain an accurate, up-to-date topology map (static or dynamic script); this is foundational. +* **Replicated (RATIS) Containers:** For production rack fault tolerance, use `SCMContainerPlacementRackAware` (mindful of its single-layer topology limitation) or `SCMContainerPlacementCapacity` (verify rack interaction) over `SCMContainerPlacementRandom`. +* **Erasure Coded (EC) Containers:** Provision sufficient racks for your EC schema. Ensure adequate network/CPU resources and use native EC libraries. +* **Read Operations:** Enable `ozone.network.topology.aware.read` with accurate topology. +* **Monitor & Validate:** Regularly monitor placement and balance; use tools like Recon to verify topology awareness. ## References - * Hadoop documentation about `net.topology.node.switch.mapping.impl`: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/RackAwareness.html - * [Design doc]({{< ref "design/topology.md">}}) \ No newline at end of file +1. Hadoop Documentation: Rack Awareness. (Covers `net.topology.node.switch.mapping.impl`, `SCMContainerPlacementRandom`, `SCMContainerPlacementRackAware`, `ozone.network.topology.aware.read`) - Refer to the official Apache Hadoop documentation on Rack Awareness. +2. Ozone Erasure Coding Design Documentation (JIRA HDDS-3816). Refer to the Apache Ozone JIRA issue HDDS-3816 and its associated design documents for authoritative details on EC implementation and placement. +3. Ozone Source Code: `org.apache.hadoop.hdds.scm.container.placement.algorithms` package. (For implementations of pluggable placement policies). +4. Ozone Source Code: `SCMContainerPlacementRandom.java`. +5. Ozone Source Code: `SCMContainerPlacementCapacity.java`. +6. Relevant JIRAs or design discussions for `SCMContainerPlacementCapacity` if more specific details on its behavior are needed. +7. Apache Ozone JIRA: [HDDS-3816](https://issues.apache.org/jira/browse/HDDS-3816) - Erasure Coding in Ozone. +8. Erasure Coding in Apache Hadoop Ozone (PDF linked from HDDS-3816 or Ozone documentation site). From 57acc1c017c49684ebcabc1af6fb3acc67ced416 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 5 Jun 2025 17:11:05 -0700 Subject: [PATCH 2/7] Address Sammi's comments. Change-Id: Ieee0d4dc2d7748d09dccef38966be89e133ef590 --- hadoop-hdds/docs/content/feature/Topology.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 14d1b9056d38..be098e00dc00 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -38,7 +38,7 @@ Ozone's topology-aware placement strategies vary by container replication type a ## Configuring Topology Hierarchy -Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack awareness, configured via `net.topology.node.switch.mapping.impl` in `core-site.xml`. This key specifies a `org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1] +Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack awareness, configured via `net.topology.node.switch.mapping.impl` in `ozone-site.xml`. This key specifies a `org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1] Two primary methods exist: @@ -128,7 +128,7 @@ SCM uses a pluggable policy for placing additional replicas of *closed* RATIS-re * **Function:** Randomly selects healthy, available DataNodes meeting basic criteria (space, no existing replica), ignoring rack topology. \[1, 4] * **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for closed replicas isn't critical. -* **Configuration:** (Default) +* **Configuration:** ```xml ozone.scm.container.placement.impl @@ -149,7 +149,7 @@ SCM uses a pluggable policy for placing additional replicas of *closed* RATIS-re ``` * **Best Practices:** Prevents uneven node filling. -* **Interaction:** Typically respects topology constraints first (like `SCMContainerPlacementRackAware` in simple topologies), then chooses by capacity among valid nodes. Verify this interaction for specific needs. +* **Interaction:** This container placement policy selects datanodes by randomly picking two nodes from a pool of healthy, available nodes and then choosing the one with lower utilization (more free space). This approach aims to distribute containers more evenly across the cluster over time, favoring less utilized nodes without overwhelming newly added nodes. ## Topology Awareness and Placement for Erasure Coded (EC) Containers @@ -207,6 +207,5 @@ This directs clients (replicated data) and the system (EC reconstruction) to rea 3. Ozone Source Code: `org.apache.hadoop.hdds.scm.container.placement.algorithms` package. (For implementations of pluggable placement policies). 4. Ozone Source Code: `SCMContainerPlacementRandom.java`. 5. Ozone Source Code: `SCMContainerPlacementCapacity.java`. -6. Relevant JIRAs or design discussions for `SCMContainerPlacementCapacity` if more specific details on its behavior are needed. -7. Apache Ozone JIRA: [HDDS-3816](https://issues.apache.org/jira/browse/HDDS-3816) - Erasure Coding in Ozone. -8. Erasure Coding in Apache Hadoop Ozone (PDF linked from HDDS-3816 or Ozone documentation site). +6. Apache Ozone JIRA: [HDDS-3816](https://issues.apache.org/jira/browse/HDDS-3816) - Erasure Coding in Ozone. +7. Erasure Coding in Apache Hadoop Ozone (PDF linked from HDDS-3816 or Ozone documentation site). From 428002b825f85513d31819eb3b0d25e8daef525d Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 5 Jun 2025 17:18:12 -0700 Subject: [PATCH 3/7] AI updated. Change-Id: I91a0d3f4b379a535245a3acbeef7f274d944cf6c --- hadoop-hdds/docs/content/feature/Topology.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index be098e00dc00..41b2aa60bf30 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -100,6 +100,8 @@ Uses an external script to resolve rack locations for IPs. ``` Ensure the script is executable (`chmod +x /etc/ozone/determine_rack.sh`). + **Note:** For production environments, implement robust error handling and validation in your script. This should include handling network timeouts, invalid inputs, CMDB query failures, and logging errors appropriately. The example above is simplified for illustration purposes only. + **Topology Mapping Best Practices:** * **Accuracy:** Mappings must be accurate and current. @@ -108,7 +110,9 @@ Uses an external script to resolve rack locations for IPs. ## Container Placement Policies for Replicated (RATIS) Containers -SCM uses a pluggable policy for placing additional replicas of *closed* RATIS-replicated containers, configured by `ozone.scm.container.placement.impl` in `ozone-site.xml`. Policies are in `org.apache.hadoop.hdds.scm.container.placement.algorithms`. \[1, 3] +SCM uses a pluggable policy to place additional replicas of *closed* RATIS-replicated containers. This is configured using the `ozone.scm.container.placement.impl` property in `ozone-site.xml`. Available policies are found in the `org.apache.hadoop.hdds.scm.container.placement.algorithms` package \[1, 3\]. + +These policies are applied when SCM needs to re-replicate containers, such as during container balancing. ### 1. `SCMContainerPlacementRackAware` (Default) From d67699984b48e10d27c8084942c325e5bb22488d Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 5 Jun 2025 17:31:59 -0700 Subject: [PATCH 4/7] Add pipeline choosing policies. Change-Id: Ib39a35bb4bf4abd564790620ec7fbe22df84b239 --- hadoop-hdds/docs/content/feature/Topology.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 41b2aa60bf30..3bb8f54b16f0 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -108,6 +108,22 @@ Uses an external script to resolve rack locations for IPs. * **Static Mapping:** Simpler for small, stable clusters; requires manual updates. * **Dynamic Mapping:** Flexible for large/dynamic clusters. Script performance, correctness, and reliability are vital; ensure it's idempotent and handles batch lookups efficiently. +## Pipeline Choosing Policies + +Ozone supports several policies for selecting a pipeline when placing containers. The policy for Ratis containers is configured by the property `hdds.scm.pipeline.choose.policy.impl` for SCM. The policy for EC (Erasure Coded) containers is configured by the property `hdds.scm.ec.pipeline.choose.policy.impl`. For both, the default value is `org.apache.hadoop.hdds.scm.pipeline.choose.algorithms.RandomPipelineChoosePolicy`. + +These policies help optimize for different goals such as load balancing, health, or simplicity: + +- **RandomPipelineChoosePolicy** (Default): Selects a pipeline at random from the available list, without considering utilization or health. This policy is simple and does not optimize for any particular metric. + +- **CapacityPipelineChoosePolicy**: Picks two random pipelines and selects the one with lower utilization, favoring pipelines with more available capacity and helping to balance the load across the cluster. + +- **RoundRobinPipelineChoosePolicy**: Selects pipelines in a round-robin order. This policy is mainly used for debugging and testing, ensuring even distribution but not considering health or capacity. + +- **HealthyPipelineChoosePolicy**: Randomly selects pipelines but only returns a healthy one. If no healthy pipeline is found, it returns the last tried pipeline as a fallback. + +These policies can be configured to suit different deployment needs and workloads. + ## Container Placement Policies for Replicated (RATIS) Containers SCM uses a pluggable policy to place additional replicas of *closed* RATIS-replicated containers. This is configured using the `ozone.scm.container.placement.impl` property in `ozone-site.xml`. Available policies are found in the `org.apache.hadoop.hdds.scm.container.placement.algorithms` package \[1, 3\]. From 82a3a64ad3985ef70f3d52f26075350067e3c3b6 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Thu, 5 Jun 2025 17:48:17 -0700 Subject: [PATCH 5/7] Update Change-Id: I84e6ff3130d8a133e7011729264c82ba99f46579 --- hadoop-hdds/docs/content/feature/Topology.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 3bb8f54b16f0..7f31a851ac9a 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -146,7 +146,7 @@ These policies are applied when SCM needs to re-replicate containers, such as du ### 2. `SCMContainerPlacementRandom` -* **Function:** Randomly selects healthy, available DataNodes meeting basic criteria (space, no existing replica), ignoring rack topology. \[1, 4] +* **Function:** Randomly selects healthy, available DataNodes meeting basic criteria (space, no existing replica), ignoring rack topology. \[1, 4\] * **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for closed replicas isn't critical. * **Configuration:** ```xml @@ -159,7 +159,7 @@ These policies are applied when SCM needs to re-replicate containers, such as du ### 3. `SCMContainerPlacementCapacity` -* **Function:** Selects DataNodes by available capacity (favors lower disk utilization) to balance disk usage. \[5, 6] +* **Function:** Selects DataNodes by available capacity (favors lower disk utilization) to balance disk usage. \[5, 6\] * **Use Cases:** Heterogeneous storage clusters or where even disk utilization is key. * **Configuration:** ```xml @@ -183,8 +183,6 @@ Erasure Coding (EC) offers storage efficiency with fault tolerance, featuring di ### EC Container Placement Logic (Intrinsic) -EC placement is an **intrinsic behavior** driven by the bucket's EC schema (e.g., `rs-6-3-1024k`), not a swappable policy like for replicated containers (see JIRA HDDS-3816). \[2, 7, 8] - * **Rack Diversity:** `ECPipelineProvider` logic places each of the `$d+p$` chunks on unique racks if possible. Otherwise, it distributes chunks evenly, ensuring no rack holds more chunks from a stripe than parity (`$p$`) can recover. * **Node Selection Priorities:** Maximize rack/node diversity, respect rack density limits, select healthy/capacious DataNodes, and exclude unsuitable nodes. * **Base Topology Reliance:** EC placement depends on accurate topology from `net.topology.node.switch.mapping.impl`. \[1] @@ -210,7 +208,7 @@ Enable by setting `ozone.network.topology.aware.read` to `true` in `ozone-site.x true ``` -This directs clients (replicated data) and the system (EC reconstruction) to read from topologically closest DataNodes, reducing latency and cross-rack traffic. Recommended with accurate topology. +This directs clients (replicated data) to read from topologically closest DataNodes, reducing latency and cross-rack traffic. Recommended with accurate topology. ## Summary of Best Practices @@ -222,10 +220,10 @@ This directs clients (replicated data) and the system (EC reconstruction) to rea ## References -1. Hadoop Documentation: Rack Awareness. (Covers `net.topology.node.switch.mapping.impl`, `SCMContainerPlacementRandom`, `SCMContainerPlacementRackAware`, `ozone.network.topology.aware.read`) - Refer to the official Apache Hadoop documentation on Rack Awareness. -2. Ozone Erasure Coding Design Documentation (JIRA HDDS-3816). Refer to the Apache Ozone JIRA issue HDDS-3816 and its associated design documents for authoritative details on EC implementation and placement. -3. Ozone Source Code: `org.apache.hadoop.hdds.scm.container.placement.algorithms` package. (For implementations of pluggable placement policies). -4. Ozone Source Code: `SCMContainerPlacementRandom.java`. -5. Ozone Source Code: `SCMContainerPlacementCapacity.java`. +1. [Hadoop Documentation: Rack Awareness](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html). +2. [Ozone Erasure Coding Design Documentation (JIRA HDDS-3816)](https://issues.apache.org/jira/browse/HDDS-3816). Refer to the Apache Ozone JIRA issue HDDS-3816 and its associated design documents for authoritative details on EC implementation and placement. +3. [Ozone Source Code: container placement policies](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). (For implementations of pluggable placement policies). +4. [Ozone Source Code: SCMContainerPlacementRandom.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRandom.java). +5. [Ozone Source Code: SCMContainerPlacementCapacity.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementCapacity.java). 6. Apache Ozone JIRA: [HDDS-3816](https://issues.apache.org/jira/browse/HDDS-3816) - Erasure Coding in Ozone. -7. Erasure Coding in Apache Hadoop Ozone (PDF linked from HDDS-3816 or Ozone documentation site). +7. [Erasure Coding in Apache Hadoop Ozone (Design Doc)](https://ozone.apache.org/docs/edge/design/ec.html) From 1dc6f1f70e43ff45c7624e3008998e46abc308f7 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Sat, 7 Jun 2025 00:52:21 -0700 Subject: [PATCH 6/7] Fix Change-Id: I55128582234d59f889b1a918cbba1f87d7d4a551 --- hadoop-hdds/docs/content/feature/Topology.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 7f31a851ac9a..a6142047f9e2 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -33,7 +33,7 @@ Apache Ozone uses topology information (e.g., rack placement) to optimize data a Ozone's topology-aware placement strategies vary by container replication type and state: -* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open containers (write), and an async replication for closed, immutable containers (cold data). As RAFT requires low-latency network, topology awareness placement is available only for closed containers. See the [page about Containers](concept/Containers.md) about more information related to Open vs Closed containers. +* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open containers (write), and an async replication for closed, immutable containers (cold data). Topology awareness placement is implemented for both open and closed RATIS containers, ensuring rack diversity and fault tolerance during both write and re-replication operations. See the [page about Containers](concept/Containers.md) for more information related to Open vs Closed containers. * **Erasure Coded (EC) Containers:** EC demands topology awareness from the initial write. For an EC key, OM allocates a block group of `$d+p$` distinct DataNodes selected by SCM's `ECPipelineProvider` to ensure rack diversity and fault tolerance. This topology-aware selection is integral to the EC write path for new blocks. \[2] ## Configuring Topology Hierarchy From 409214b18e74e49485d86b5cea0af43b1100ed68 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Wed, 13 Aug 2025 18:51:22 -0700 Subject: [PATCH 7/7] HDDS-13138. [DOCS] UPDATE TOPOLOGY AWARENESS USER DOC. (#8528) Change-Id: I854ff3efcfba508f4ecae632c57ca99d578dff3c --- hadoop-hdds/docs/content/feature/Topology.md | 39 +++----------------- 1 file changed, 5 insertions(+), 34 deletions(-) diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index a6142047f9e2..48b5d718ecfd 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -34,7 +34,7 @@ Apache Ozone uses topology information (e.g., rack placement) to optimize data a Ozone's topology-aware placement strategies vary by container replication type and state: * **RATIS Replicated Containers:** Ozone uses RAFT replication for Open containers (write), and an async replication for closed, immutable containers (cold data). Topology awareness placement is implemented for both open and closed RATIS containers, ensuring rack diversity and fault tolerance during both write and re-replication operations. See the [page about Containers](concept/Containers.md) for more information related to Open vs Closed containers. -* **Erasure Coded (EC) Containers:** EC demands topology awareness from the initial write. For an EC key, OM allocates a block group of `$d+p$` distinct DataNodes selected by SCM's `ECPipelineProvider` to ensure rack diversity and fault tolerance. This topology-aware selection is integral to the EC write path for new blocks. \[2] + ## Configuring Topology Hierarchy @@ -171,33 +171,7 @@ These policies are applied when SCM needs to re-replicate containers, such as du * **Best Practices:** Prevents uneven node filling. * **Interaction:** This container placement policy selects datanodes by randomly picking two nodes from a pool of healthy, available nodes and then choosing the one with lower utilization (more free space). This approach aims to distribute containers more evenly across the cluster over time, favoring less utilized nodes without overwhelming newly added nodes. -## Topology Awareness and Placement for Erasure Coded (EC) Containers - -Erasure Coding (EC) offers storage efficiency with fault tolerance, featuring distinct, intrinsic placement requirements in its write path. - -### Unique EC Placement Demands - -* **Inherent Rack Fault-Tolerance:** EC aims to spread `$d+p$` (data + parity) chunks of a block group across different racks for reconstruction after rack failures (up to `$p$` racks). \[2] -* **Minimum Rack Requirements:** The EC schema (e.g., RS-6-3-1024k) dictates rack diversity. Ideally, `$d+p$` distinct racks are needed for one chunk per rack. With fewer racks, the system maximizes spread, ensuring no rack holds more chunks from a stripe than parity can recover. Aim for at least `ceil((d+p)/p)` racks, preferably `$p+1$` or more. \[2] -* **Node Selection:** For EC writes, OM and SCM's `ECPipelineProvider` allocate `$d+p$` distinct DataNodes, enforcing topology for rack diversity. \[2] - -### EC Container Placement Logic (Intrinsic) -* **Rack Diversity:** `ECPipelineProvider` logic places each of the `$d+p$` chunks on unique racks if possible. Otherwise, it distributes chunks evenly, ensuring no rack holds more chunks from a stripe than parity (`$p$`) can recover. -* **Node Selection Priorities:** Maximize rack/node diversity, respect rack density limits, select healthy/capacious DataNodes, and exclude unsuitable nodes. -* **Base Topology Reliance:** EC placement depends on accurate topology from `net.topology.node.switch.mapping.impl`. \[1] - -### Example: RS(6,3) EC Placement (9 chunks) - -* **Ideal (9+ Racks):** 1 chunk per rack; tolerates 3 rack failures. -* **Constrained (3 Racks):** 3 chunks per rack; losing 1 rack (3 chunks) is recoverable. -* **Intermediate (5 Racks):** Chunks spread (e.g., 2 chunks on 4 racks, 1 on another); tolerates losing any `$p` (3) racks. - -### EC Topology Best Practices - -* **Sufficient Racks:** Provision racks for EC schema fault tolerance (at least `$p+1$`, ideally `$d+p$`). \[2] -* **Even DataNode Distribution:** Distribute DataNodes evenly across racks. \[2] -* **Network & CPU:** Ensure robust network (especially inter-rack) and DataNode CPU for EC operations. \[2] ## Optimizing Read Paths @@ -214,16 +188,13 @@ This directs clients (replicated data) to read from topologically closest DataNo * **Accurate Topology:** Maintain an accurate, up-to-date topology map (static or dynamic script); this is foundational. * **Replicated (RATIS) Containers:** For production rack fault tolerance, use `SCMContainerPlacementRackAware` (mindful of its single-layer topology limitation) or `SCMContainerPlacementCapacity` (verify rack interaction) over `SCMContainerPlacementRandom`. -* **Erasure Coded (EC) Containers:** Provision sufficient racks for your EC schema. Ensure adequate network/CPU resources and use native EC libraries. + * **Read Operations:** Enable `ozone.network.topology.aware.read` with accurate topology. * **Monitor & Validate:** Regularly monitor placement and balance; use tools like Recon to verify topology awareness. ## References 1. [Hadoop Documentation: Rack Awareness](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html). -2. [Ozone Erasure Coding Design Documentation (JIRA HDDS-3816)](https://issues.apache.org/jira/browse/HDDS-3816). Refer to the Apache Ozone JIRA issue HDDS-3816 and its associated design documents for authoritative details on EC implementation and placement. -3. [Ozone Source Code: container placement policies](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). (For implementations of pluggable placement policies). -4. [Ozone Source Code: SCMContainerPlacementRandom.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRandom.java). -5. [Ozone Source Code: SCMContainerPlacementCapacity.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementCapacity.java). -6. Apache Ozone JIRA: [HDDS-3816](https://issues.apache.org/jira/browse/HDDS-3816) - Erasure Coding in Ozone. -7. [Erasure Coding in Apache Hadoop Ozone (Design Doc)](https://ozone.apache.org/docs/edge/design/ec.html) +2. [Ozone Source Code: container placement policies](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). (For implementations of pluggable placement policies). +3. [Ozone Source Code: SCMContainerPlacementRandom.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRandom.java). +4. [Ozone Source Code: SCMContainerPlacementCapacity.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementCapacity.java).