From 2cc1267f3a682af134ae8c9762d599f19228b2dd Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Sun, 6 Jul 2025 08:37:55 -0700 Subject: [PATCH 1/4] HDDS-13396. Documentation: Improve the top-level overview page for new users. The previous overview page was technically accurate but could be intimidating for newcomers. This change revamps the page to be more welcoming and informative by: - Providing a clear, benefit-oriented introduction. - Adding a comparison table to HDFS and S3 for better context. - Highlighting a single, recommended quick-start path (Docker). - Listing concrete use cases to illustrate practical applications. Change-Id: Id7e9d5db7415e043d2a7ea33de6fd9afea21b932 --- hadoop-hdds/docs/content/_index.md | 66 ++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 12 deletions(-) diff --git a/hadoop-hdds/docs/content/_index.md b/hadoop-hdds/docs/content/_index.md index f65c1527f47d..b2bba6283a2b 100644 --- a/hadoop-hdds/docs/content/_index.md +++ b/hadoop-hdds/docs/content/_index.md @@ -1,6 +1,6 @@ --- name: Ozone -title: Overview +title: An Introduction to Apache Ozone menu: main weight: -10 --- @@ -21,21 +21,63 @@ weight: -10 limitations under the License. --> -# Apache Ozone +# An Introduction to Apache Ozone + +**Apache Ozone is a highly scalable, distributed object store built for big data applications.** It can store billions of objects, both large and small, and is designed to run effectively in on-premise and containerized environments like Kubernetes. + +Think of it as a private, on-premise storage platform that speaks the S3 protocol, while also offering native support for the Hadoop ecosystem. {{
}} -*_Ozone is a scalable, redundant, and distributed object store for Big data workloads.

-Apart from scaling to billions of objects of varying sizes, -Ozone can function effectively in containerized environments -like Kubernetes._* +--- + +### Why Use Ozone? + +* **Massive Scalability:** Ozone's architecture separates namespace management from block management, allowing it to scale to billions of objects without the limitations found in traditional filesystems. +* **S3-Compatible:** Use the vast ecosystem of S3 tools, SDKs, and applications you already know. Ozone's S3 Gateway provides a compatible REST interface. +* **Hadoop Ecosystem Native:** Applications like Apache Spark, Hive, and YARN can use Ozone as their storage backend without any modifications, making it a powerful replacement or complement to HDFS. +* **Cloud-Native Ready:** Ozone is designed to be deployed and managed in containerized environments like Kubernetes, supporting modern, cloud-native data architectures. + +--- + +### How It Compares + +| Feature | Apache Ozone | HDFS (Hadoop Distributed File System) | Amazon S3 | +| :--- | :--- |:-------------------------------------------------------------------------------------------------| :--- | +| **Type** | Distributed Object Store | Distributed File System | Cloud Object Store | +| **Best For** | Billions of mixed-size files, cloud-native apps, data lakes. | Very large files (hundreds of megabytes and above), streaming data access. | Fully managed cloud storage, web applications, backups. | +| **API** | S3-compatible, Hadoop FS | Hadoop FS | S3 API | +| **Deployment**| On-premise, private cloud, Kubernetes | On-premise, private cloud | Public Cloud (AWS) | +| **Namespace** | Multiple volumes | Single rooted filesystem (`/`) | Global bucket namespace | + +--- + +### Getting Started: Your First Cluster in 5 Minutes + +The fastest way to experience Ozone is with Docker. This single command will set up a complete, multi-node pseudo-cluster on your local machine. + +**[➡️ Quick Start with Docker]({{< ref "start/StartFromDockerHub.md" >}})** + +This is the recommended path for first-time users. For other deployment options, including Kubernetes and bare-metal, see the full [Getting Started Guide]({{< ref "start/_index.md" >}}). + +--- + +### Common Use Cases + +Ozone is versatile. Here are a few ways it's commonly used: + +* **Analytics & Data Lakes:** Store vast amounts of structured and unstructured data and run queries directly with Spark, Hive, or Presto. +* **Machine Learning Backend:** Use Ozone as a central repository for training datasets, models, and experiment logs, accessed via S3 APIs from frameworks like TensorFlow or PyTorch. +* **Cloud-Native Application Storage:** Provide persistent, scalable storage for stateful applications running on Kubernetes using the Ozone CSI driver. + +--- -Applications like Apache Spark, Hive and YARN, work without any modifications when using Ozone. Ozone comes with a [Java client library]({{}}), [S3 protocol support]({{< ref "S3.md" >}}), and a [command line interface]({{< ref "Cli.md" >}}) which makes it easy to use Ozone. +### Core Concepts: Volumes, Buckets, and Keys -Ozone consists of volumes, buckets, and keys: +Ozone organizes data in a simple three-level hierarchy: -* Volumes are similar to user accounts. Only administrators can create or delete volumes. -* Buckets are similar to directories. A bucket can contain any number of keys, but buckets cannot contain other buckets. -* Keys are similar to files. +* **Volumes:** The top-level organizational unit, similar to a user account or a top-level project folder. Volumes are created by administrators. +* **Buckets:** Reside within volumes and are similar to directories. A bucket can contain any number of keys. +* **Keys:** The actual objects you store, analogous to files. -Check out the [Getting Started](start/) guide to dive right in and learn how to run Ozone on your machine or in the cloud. +This structure provides a flexible way to manage data for multiple tenants and use cases within a single cluster. From 663ac4b3b6b040627fdfcf83059da56f3ce14e2d Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Mon, 7 Jul 2025 09:57:19 -0700 Subject: [PATCH 2/4] Backport Overview page from v2 site. Change-Id: Ia5979b1cbc73131d430c2c00d1a3721d792d348b --- hadoop-hdds/docs/content/_index.md | 110 +++++++++++++++++++---------- 1 file changed, 73 insertions(+), 37 deletions(-) diff --git a/hadoop-hdds/docs/content/_index.md b/hadoop-hdds/docs/content/_index.md index b2bba6283a2b..a1c8f60baef9 100644 --- a/hadoop-hdds/docs/content/_index.md +++ b/hadoop-hdds/docs/content/_index.md @@ -21,63 +21,99 @@ weight: -10 limitations under the License. --> -# An Introduction to Apache Ozone +## What is Apache Ozone? -**Apache Ozone is a highly scalable, distributed object store built for big data applications.** It can store billions of objects, both large and small, and is designed to run effectively in on-premise and containerized environments like Kubernetes. +Apache Ozone is a scalable, distributed object store designed for lakehouse workloads, +AI/ML, and cloud-native applications. +Originating from the BigData analytics ecosystem, it handles both small and large files, +supporting deployments up to billions of objects and exabytes of capacity. +Ozone provides strong consistency guarantees, +multiple protocol interfaces (including S3 compatibility), and configurable durability options. -Think of it as a private, on-premise storage platform that speaks the S3 protocol, while also offering native support for the Hadoop ecosystem. +## What it does? -{{

}} +Ozone includes features relevant to large-scale storage requirements: ---- +### Scale -### Why Use Ozone? +Ozone's architecture separates metadata management from data storage. The Ozone Manager (OM) and Storage Container Manager (SCM) handle metadata operations, while Datanodes manage the physical storage of data blocks. This design allows for independent scaling of these components and supports incremental cluster growth. -* **Massive Scalability:** Ozone's architecture separates namespace management from block management, allowing it to scale to billions of objects without the limitations found in traditional filesystems. -* **S3-Compatible:** Use the vast ecosystem of S3 tools, SDKs, and applications you already know. Ozone's S3 Gateway provides a compatible REST interface. -* **Hadoop Ecosystem Native:** Applications like Apache Spark, Hive, and YARN can use Ozone as their storage backend without any modifications, making it a powerful replacement or complement to HDFS. -* **Cloud-Native Ready:** Ozone is designed to be deployed and managed in containerized environments like Kubernetes, supporting modern, cloud-native data architectures. +### Flexible Durability ---- +Ozone offers configurable data durability options per bucket or per object: +* **Replication (RATIS):** Uses 3-way replication via the [Ratis (Raft)](https://ratis.apache.org) consensus protocol for high availability. +* **Erasure Coding (EC):** Supports various EC codecs (e.g., Reed-Solomon) to reduce storage overhead compared to replication while maintaining specified durability levels. -### How It Compares +### Secure -| Feature | Apache Ozone | HDFS (Hadoop Distributed File System) | Amazon S3 | -| :--- | :--- |:-------------------------------------------------------------------------------------------------| :--- | -| **Type** | Distributed Object Store | Distributed File System | Cloud Object Store | -| **Best For** | Billions of mixed-size files, cloud-native apps, data lakes. | Very large files (hundreds of megabytes and above), streaming data access. | Fully managed cloud storage, web applications, backups. | -| **API** | S3-compatible, Hadoop FS | Hadoop FS | S3 API | -| **Deployment**| On-premise, private cloud, Kubernetes | On-premise, private cloud | Public Cloud (AWS) | -| **Namespace** | Multiple volumes | Single rooted filesystem (`/`) | Global bucket namespace | +Security features are integrated at multiple layers: +* **Authentication:** Supports Kerberos integration for user and service authentication. +* **Authorization:** Provides Access Control Lists (ACLs) for managing permissions at the volume, bucket, and key levels. Supports Apache Ranger integration for centralized policy management. +* **Encryption:** Supports TLS/SSL for data in transit and Transparent Data Encryption (TDE) for data at rest. +* **Tokens:** Uses delegation tokens and block tokens for access control in distributed operations. ---- +### Performance -### Getting Started: Your First Cluster in 5 Minutes +Ozone's design considers performance for different access patterns: +* **Throughput:** Intended for streaming reads and writes of large files. Data can be served directly from Datanodes after initial metadata lookup. +* **Latency:** Metadata operations are managed by OM and SCM, designed for low-latency access. +* **Small File Handling:** Includes mechanisms for managing metadata and storage for large quantities of small files. -The fastest way to experience Ozone is with Docker. This single command will set up a complete, multi-node pseudo-cluster on your local machine. +### Multiple Protocols -**[➡️ Quick Start with Docker]({{< ref "start/StartFromDockerHub.md" >}})** +Applications can access data stored in Ozone through several interfaces: +* **S3 Protocol:** Provides an S3-compatible REST API, allowing use with S3-native applications and tools. +* **Hadoop Compatible File System (OFS):** Offers the `ofs://` scheme for integration with Hadoop ecosystem tools (e.g., Iceberg, Spark, Hive, Flink, MapReduce). +* **Native Java Client API:** A client library for Java applications. +* **Command Line Interface (CLI):** Provides tools for administrative tasks and data interaction. -This is the recommended path for first-time users. For other deployment options, including Kubernetes and bare-metal, see the full [Getting Started Guide]({{< ref "start/_index.md" >}}). +### Efficient Storage Use ---- +Ozone includes features aimed at optimizing storage utilization: +* **Erasure Coding:** Can reduce the physical storage footprint compared to 3x replication. +* **Small File Handling:** Manages metadata and block allocation for small files. +* **Containerization:** Groups data blocks into larger Storage Containers, which can simplify management and disk I/O. -### Common Use Cases +### Storage Management -Ozone is versatile. Here are a few ways it's commonly used: +Ozone uses a hierarchical namespace and provides management tools: +* **Namespace:** Organizes data into Volumes (often mapped to tenants) and Buckets (containers for objects), which hold Keys (objects/files). +* **Quotas:** Administrators can set storage quotas at the Volume and Bucket levels. +* **Snapshots:** Supports point-in-time, read-only snapshots of buckets for data protection and versioning. -* **Analytics & Data Lakes:** Store vast amounts of structured and unstructured data and run queries directly with Spark, Hive, or Presto. -* **Machine Learning Backend:** Use Ozone as a central repository for training datasets, models, and experiment logs, accessed via S3 APIs from frameworks like TensorFlow or PyTorch. -* **Cloud-Native Application Storage:** Provide persistent, scalable storage for stateful applications running on Kubernetes using the Ozone CSI driver. +### Strong Consistency ---- +Ozone provides strong consistency for metadata and data operations. Reads reflect the results of the latest successfully completed write operations. + +## Key Characteristics + +The design of Ozone leads to certain characteristics relevant for large-scale data management: + +### Storage Costs + +Factors influencing storage costs include: +* **Storage Efficiency:** Erasure Coding can reduce physical storage requirements. +* **Hardware:** Designed to run on commodity hardware. +* **Licensing:** Apache Ozone is open-source software under the Apache License 2.0. +* **Scalability:** Clusters can be expanded by adding nodes or racks. Data rebalancing mechanisms help manage utilization. + +### Operations + +Aspects related to storage administration include: +* **Unified Storage:** Can potentially serve as a common storage layer for different types of workloads. +* **Management Tools:** Includes the Recon web UI for monitoring and CLI tools for administration. +* **Maintenance:** Supports features like rolling upgrades, node decommissioning, and data balancing. + +### Hybrid Cloud Scenarios -### Core Concepts: Volumes, Buckets, and Keys +Ozone's S3 compatibility allows applications developed for S3 to run on-premises using Ozone. This can be relevant for hybrid cloud strategies or migrating workloads between on-premises and cloud environments. -Ozone organizes data in a simple three-level hierarchy: +## Dive Deeper -* **Volumes:** The top-level organizational unit, similar to a user account or a top-level project folder. Volumes are created by administrators. -* **Buckets:** Reside within volumes and are similar to directories. A bucket can contain any number of keys. -* **Keys:** The actual objects you store, analogous to files. +To learn more about Ozone, refer to the following sections: -This structure provides a flexible way to manage data for multiple tenants and use cases within a single cluster. +* **New to Ozone?** Try the **[Quick Start Guide](./02-quick-start/README.mdx)** to set up a cluster. +* **Want to understand the internals?** Read about the **[Core Concepts](./03-core-concepts/README.mdx)** (architecture, replication, security). +* **Need to use Ozone?** Check the **[User Guide](./04-user-guide/README.mdx)** for client interfaces and integrations. +* **Managing a cluster?** Consult the **[Administrator Guide](./05-administrator-guide/README.mdx)** for installation, configuration, and operations. +* **Running into issues?** The **[Troubleshooting Guide](./06-troubleshooting/README.mdx)** may provide assistance. From 19ee2a2274964500ceab79bfd2be9f1837a14e85 Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Mon, 14 Jul 2025 08:07:57 -0700 Subject: [PATCH 3/4] docs(hdds): Fix broken links in Dive Deeper section Change-Id: I2cf096c49648f90c5f3bacf528c91a295f77c921 --- hadoop-hdds/docs/content/_index.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/hadoop-hdds/docs/content/_index.md b/hadoop-hdds/docs/content/_index.md index a1c8f60baef9..98bd13c02741 100644 --- a/hadoop-hdds/docs/content/_index.md +++ b/hadoop-hdds/docs/content/_index.md @@ -112,8 +112,7 @@ Ozone's S3 compatibility allows applications developed for S3 to run on-premises To learn more about Ozone, refer to the following sections: -* **New to Ozone?** Try the **[Quick Start Guide](./02-quick-start/README.mdx)** to set up a cluster. -* **Want to understand the internals?** Read about the **[Core Concepts](./03-core-concepts/README.mdx)** (architecture, replication, security). -* **Need to use Ozone?** Check the **[User Guide](./04-user-guide/README.mdx)** for client interfaces and integrations. -* **Managing a cluster?** Consult the **[Administrator Guide](./05-administrator-guide/README.mdx)** for installation, configuration, and operations. -* **Running into issues?** The **[Troubleshooting Guide](./06-troubleshooting/README.mdx)** may provide assistance. +* **New to Ozone?** Try the **[Quick Start Guide](../start)** to set up a cluster. +* **Want to understand the internals?** Read about the **[Core Concepts](../concept)** (architecture, replication, security). +* **Need to use Ozone?** Check the **[User Guide](../interface)** for client interfaces and integrations. +* **Managing a cluster?** Consult the **[Administrator Guide](../tools)** for installation, configuration, and operations. From b805973221943263343bbd2df5d7e5361c288fdb Mon Sep 17 00:00:00 2001 From: Wei-Chiu Chuang Date: Tue, 15 Jul 2025 12:05:26 -0700 Subject: [PATCH 4/4] Update hadoop-hdds/docs/content/_index.md Co-authored-by: Tejaskriya <87555809+Tejaskriya@users.noreply.github.com> --- hadoop-hdds/docs/content/_index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hadoop-hdds/docs/content/_index.md b/hadoop-hdds/docs/content/_index.md index 98bd13c02741..fa50b0518aa7 100644 --- a/hadoop-hdds/docs/content/_index.md +++ b/hadoop-hdds/docs/content/_index.md @@ -112,7 +112,7 @@ Ozone's S3 compatibility allows applications developed for S3 to run on-premises To learn more about Ozone, refer to the following sections: -* **New to Ozone?** Try the **[Quick Start Guide](../start)** to set up a cluster. -* **Want to understand the internals?** Read about the **[Core Concepts](../concept)** (architecture, replication, security). -* **Need to use Ozone?** Check the **[User Guide](../interface)** for client interfaces and integrations. -* **Managing a cluster?** Consult the **[Administrator Guide](../tools)** for installation, configuration, and operations. +* **New to Ozone?** Try the **[Quick Start Guide]({{< ref "start" >}})** to set up a cluster. +* **Want to understand the internals?** Read about the **[Core Concepts]({{< ref "concept" >}})** (architecture, replication, security). +* **Need to use Ozone?** Check the **[User Guide]({{< ref "interface" >}})** for client interfaces and integrations. +* **Managing a cluster?** Consult the **[Administrator Guide]({{< ref "tools" >}})** for installation, configuration, and operations.