diff --git a/hadoop-hdds/docs/content/feature/Topology.md b/hadoop-hdds/docs/content/feature/Topology.md index 8a053be8db00..48b5d718ecfd 100644 --- a/hadoop-hdds/docs/content/feature/Topology.md +++ b/hadoop-hdds/docs/content/feature/Topology.md @@ -23,86 +23,178 @@ summary: Configuration for rack-awarness for improved read/write limitations under the License. --> -Ozone can use topology related information (for example rack placement) to optimize read and write pipelines. To get full rack-aware cluster, Ozone requires three different configuration. +Apache Ozone uses topology information (e.g., rack placement) to optimize data access and improve resilience. A fully rack-aware cluster needs: - 1. The topology information should be configured by Ozone. - 2. Topology related information should be used when Ozone chooses 3 different datanodes for a specific pipeline/container. (WRITE) - 3. When Ozone reads a Key it should prefer to read from the closest node. +1. Configured network topology. +2. Topology-aware DataNode selection for container replica placement (write path). +3. Prioritized reads from topologically closest DataNodes (read path). - +* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open containers (write), and an async replication for closed, immutable containers (cold data). Topology awareness placement is implemented for both open and closed RATIS containers, ensuring rack diversity and fault tolerance during both write and re-replication operations. See the [page about Containers](concept/Containers.md) for more information related to Open vs Closed containers. -## Topology hierarchy -Topology hierarchy can be configured with using `net.topology.node.switch.mapping.impl` configuration key. This configuration should define an implementation of the `org.apache.hadoop.net.CachedDNSToSwitchMapping`. As this is a Hadoop class, the configuration is exactly the same as the Hadoop Configuration +## Configuring Topology Hierarchy -### Static list +Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack awareness, configured via `net.topology.node.switch.mapping.impl` in `ozone-site.xml`. This key specifies a `org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1] -Static list can be configured with the help of ```TableMapping```: +Two primary methods exist: -```XML - - net.topology.node.switch.mapping.impl - org.apache.hadoop.net.TableMapping - - - net.topology.table.file.name - /opt/hadoop/compose/ozone-topology/network-config - -``` +### 1. Static List: `TableMapping` -The second configuration option should point to a text file. The file format is a two column text file, with columns separated by whitespace. The first column is IP address and the second column specifies the rack where the address maps. If no entry corresponding to a host in the cluster is found, then `/default-rack` is assumed. +Maps IPs/hostnames to racks using a predefined file. -### Dynamic list +* **Configuration:** Set `net.topology.node.switch.mapping.impl` to `org.apache.hadoop.net.TableMapping` and `net.topology.table.file.name` to the mapping file's path. \[1] + ```xml + + net.topology.node.switch.mapping.impl + org.apache.hadoop.net.TableMapping + + + net.topology.table.file.name + /etc/ozone/topology.map + + ``` +* **File Format:** A two-column text file (IP/hostname, rack path per line). Unlisted nodes go to `/default-rack`. \[1] + Example `topology.map`: + ``` + 192.168.1.100 /rack1 + datanode101.example.com /rack1 + 192.168.1.102 /rack2 + datanode103.example.com /rack2 + ``` + +### 2. Dynamic List: `ScriptBasedMapping` + +Uses an external script to resolve rack locations for IPs. + +* **Configuration:** Set `net.topology.node.switch.mapping.impl` to `org.apache.hadoop.net.ScriptBasedMapping` and `net.topology.script.file.name` to the script's path. \[1] + ```xml + + net.topology.node.switch.mapping.impl + org.apache.hadoop.net.ScriptBasedMapping + + + net.topology.script.file.name + /etc/ozone/determine_rack.sh + + ``` +* **Script:** Admin-provided, executable script. Ozone passes IPs (up to `net.topology.script.number.args`, default 100) as arguments; script outputs rack paths (one per line). + Example `determine_rack.sh`: + ```bash + #!/bin/bash + # This is a simplified example. A real script might query a CMDB or use other logic. + while [ $# -gt 0 ] ; do + nodeAddress=$1 + if [[ "$nodeAddress" == "192.168.1.100" || "$nodeAddress" == "datanode101.example.com" ]]; then + echo "/rack1" + elif [[ "$nodeAddress" == "192.168.1.102" || "$nodeAddress" == "datanode103.example.com" ]]; then + echo "/rack2" + else + echo "/default-rack" + fi + shift + done + ``` + Ensure the script is executable (`chmod +x /etc/ozone/determine_rack.sh`). -Rack information can be identified with the help of an external script: + **Note:** For production environments, implement robust error handling and validation in your script. This should include handling network timeouts, invalid inputs, CMDB query failures, and logging errors appropriately. The example above is simplified for illustration purposes only. +**Topology Mapping Best Practices:** -```XML - - net.topology.node.switch.mapping.impl - org.apache.hadoop.net.ScriptBasedMapping - - - net.topology.script.file.name - /usr/local/bin/rack.sh - -``` +* **Accuracy:** Mappings must be accurate and current. +* **Static Mapping:** Simpler for small, stable clusters; requires manual updates. +* **Dynamic Mapping:** Flexible for large/dynamic clusters. Script performance, correctness, and reliability are vital; ensure it's idempotent and handles batch lookups efficiently. -If implementing an external script, it will be specified with the `net.topology.script.file.name` parameter in the configuration files. Unlike the java class, the external topology script is not included with the Ozone distribution and is provided by the administrator. Ozone will send multiple IP addresses to ARGV when forking the topology script. The number of IP addresses sent to the topology script is controlled with `net.topology.script.number.args` and defaults to 100. If `net.topology.script.number.args` was changed to 1, a topology script would get forked for each IP submitted. +## Pipeline Choosing Policies -## Write path +Ozone supports several policies for selecting a pipeline when placing containers. The policy for Ratis containers is configured by the property `hdds.scm.pipeline.choose.policy.impl` for SCM. The policy for EC (Erasure Coded) containers is configured by the property `hdds.scm.ec.pipeline.choose.policy.impl`. For both, the default value is `org.apache.hadoop.hdds.scm.pipeline.choose.algorithms.RandomPipelineChoosePolicy`. -Placement of the closed containers can be configured with `ozone.scm.container.placement.impl` configuration key. The available container placement policies can be found in the `org.apache.hdds.scm.container.placement` [package](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). +These policies help optimize for different goals such as load balancing, health, or simplicity: -By default the `SCMContainerPlacementRandom` is used for topology-awareness the `SCMContainerPlacementRackAware` can be used: +- **RandomPipelineChoosePolicy** (Default): Selects a pipeline at random from the available list, without considering utilization or health. This policy is simple and does not optimize for any particular metric. -```XML - - ozone.scm.container.placement.impl - org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware - -``` +- **CapacityPipelineChoosePolicy**: Picks two random pipelines and selects the one with lower utilization, favoring pipelines with more available capacity and helping to balance the load across the cluster. + +- **RoundRobinPipelineChoosePolicy**: Selects pipelines in a round-robin order. This policy is mainly used for debugging and testing, ensuring even distribution but not considering health or capacity. -This placement policy complies with the algorithm used in HDFS. With default 3 replica, two replicas will be on the same rack, the third one will on a different rack. - -This implementation applies to network topology like "/rack/node". Don't recommend to use this if the network topology has more layers. - -## Read path +- **HealthyPipelineChoosePolicy**: Randomly selects pipelines but only returns a healthy one. If no healthy pipeline is found, it returns the last tried pipeline as a fallback. -Finally the read path also should be configured to read the data from the closest pipeline. +These policies can be configured to suit different deployment needs and workloads. -```XML +## Container Placement Policies for Replicated (RATIS) Containers + +SCM uses a pluggable policy to place additional replicas of *closed* RATIS-replicated containers. This is configured using the `ozone.scm.container.placement.impl` property in `ozone-site.xml`. Available policies are found in the `org.apache.hadoop.hdds.scm.container.placement.algorithms` package \[1, 3\]. + +These policies are applied when SCM needs to re-replicate containers, such as during container balancing. + +### 1. `SCMContainerPlacementRackAware` (Default) + +* **Function:** Distributes replicas across racks for fault tolerance (e.g., for 3 replicas, aims for at least two racks). Similar to HDFS placement. \[1] +* **Use Cases:** Production clusters needing rack-level fault tolerance. +* **Configuration:** + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware + + ``` +* **Best Practices:** Requires accurate topology mapping. +* **Limitations:** Designed for single-layer rack topologies (e.g., `/rack/node`). Not recommended for multi-layer hierarchies (e.g., `/dc/row/rack/node`) as it may not interpret deeper levels correctly. \[1] + +### 2. `SCMContainerPlacementRandom` + +* **Function:** Randomly selects healthy, available DataNodes meeting basic criteria (space, no existing replica), ignoring rack topology. \[1, 4\] +* **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for closed replicas isn't critical. +* **Configuration:** + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRandom + + ``` +* **Best Practices:** Not for production needing rack failure resilience. + +### 3. `SCMContainerPlacementCapacity` + +* **Function:** Selects DataNodes by available capacity (favors lower disk utilization) to balance disk usage. \[5, 6\] +* **Use Cases:** Heterogeneous storage clusters or where even disk utilization is key. +* **Configuration:** + ```xml + + ozone.scm.container.placement.impl + org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementCapacity + + ``` +* **Best Practices:** Prevents uneven node filling. +* **Interaction:** This container placement policy selects datanodes by randomly picking two nodes from a pool of healthy, available nodes and then choosing the one with lower utilization (more free space). This approach aims to distribute containers more evenly across the cluster over time, favoring less utilized nodes without overwhelming newly added nodes. + + + +## Optimizing Read Paths + +Enable by setting `ozone.network.topology.aware.read` to `true` in `ozone-site.xml`. \[1] +```xml - ozone.network.topology.aware.read - true + ozone.network.topology.aware.read + true ``` +This directs clients (replicated data) to read from topologically closest DataNodes, reducing latency and cross-rack traffic. Recommended with accurate topology. + +## Summary of Best Practices + +* **Accurate Topology:** Maintain an accurate, up-to-date topology map (static or dynamic script); this is foundational. +* **Replicated (RATIS) Containers:** For production rack fault tolerance, use `SCMContainerPlacementRackAware` (mindful of its single-layer topology limitation) or `SCMContainerPlacementCapacity` (verify rack interaction) over `SCMContainerPlacementRandom`. + +* **Read Operations:** Enable `ozone.network.topology.aware.read` with accurate topology. +* **Monitor & Validate:** Regularly monitor placement and balance; use tools like Recon to verify topology awareness. ## References - * Hadoop documentation about `net.topology.node.switch.mapping.impl`: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/RackAwareness.html - * [Design doc]({{< ref "design/topology.md">}}) \ No newline at end of file +1. [Hadoop Documentation: Rack Awareness](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html). +2. [Ozone Source Code: container placement policies](https://github.com/apache/ozone/tree/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms). (For implementations of pluggable placement policies). +3. [Ozone Source Code: SCMContainerPlacementRandom.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRandom.java). +4. [Ozone Source Code: SCMContainerPlacementCapacity.java](https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementCapacity.java).