Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions docs/internals/storage-light/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Storage Light Module

## ZBus

Storage light module is available on zbus over the same channel as the full storage module:

| module | object | version |
|--------|--------|---------|
| storage|[storage](#interface)| 0.0.1|

## Introduction

`storage_light` is a lightweight variant of the [storage module](../storage/readme.md). It implements the same `StorageModule` interface and provides identical functionality to consumers, but has enhanced device initialization logic designed for nodes with pre-partitioned disks.

Both modules are interchangeable at the zbus level — other modules access storage via the same `StorageModuleStub` regardless of which variant is running.

## Differences from Storage

The key difference is in the **device initialization** phase during boot. The standard storage module treats each whole disk as a single btrfs pool. The light variant adds:

### 1. Partition-Aware Initialization

Instead of requiring whole disks, `storage_light` can work with individual partitions:

- Detects if a disk is already partitioned (has child partitions)
- Scans for unallocated space on partitioned disks using `parted`
- Creates new partitions in free space (minimum 5 GiB) for btrfs pools
- Refreshes device info after partition table changes

This allows ZOS to coexist with other operating systems or PXE boot partitions on the same disk.

### 2. PXE Partition Detection

Partitions labeled `ZOSPXE` are automatically skipped during initialization. This prevents the storage module from claiming boot partitions used for PXE network booting.

### 3. Enhanced Device Manager

The filesystem subpackage in `storage_light` extends the device manager with:

- `Children []DeviceInfo` field on `DeviceInfo` to track child partitions
- `UUID` field for btrfs filesystem identification
- `IsPartitioned()` method to check if a disk has child partitions
- `IsPXEPartition()` method to detect PXE boot partitions
- `GetUnallocatedSpaces()` method using `parted` to find free disk space
- `AllocateEmptySpace()` method to create partitions in free space
- `RefreshDeviceInfo()` method to reload device info after changes
- `ClearCache()` on the device manager interface for refreshing the device list

## Initialization Flow

The boot process is similar to the standard storage module but with added partition handling:

1. Load kernel parameters (detect VM, check MissingSSD)
2. Scan devices via DeviceManager
3. For each device:
- **If whole disk (not partitioned)**: Create btrfs pool on the entire device (same as standard)
- **If partitioned**:
- Skip partitions labeled `ZOSPXE`
- Process existing partitions that have btrfs filesystems
- Scan for unallocated space using `parted`
- Create new partitions in free space >= 5 GiB
- Create btrfs pools on new partitions
- Mount pool, detect device type (SSD/HDD)
- Add to SSD or HDD pool arrays
4. Ensure cache exists (create if needed, start monitoring)
5. Shut down unused HDD pools
6. Start periodic disk power management

## When to Use Storage Light

Use `storage_light` instead of `storage` when:

- The node has disks with existing partition tables that must be preserved
- PXE boot partitions exist on the same disks
- The node dual-boots or shares disks with other systems
- Disks have been partially allocated and have free space that should be used

## Architecture

The overall architecture (pool types, mount points, cache management, volume/disk/device operations) is identical to the [standard storage module](../storage/readme.md). Refer to that document for details on:

- Pool organization (SSD vs HDD)
- Storage primitives (subvolumes, vdisks, devices)
- Cache management and auto-sizing
- Pool selection policies
- Error handling and broken device tracking
- Thread safety
- The `StorageModule` interface definition

## Interface

Same as the [standard storage module](../storage/readme.md#interface). Both variants implement the same `StorageModule` interface defined in `pkg/storage.go`.
122 changes: 100 additions & 22 deletions docs/internals/storage/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,51 +10,129 @@ Storage module is available on zbus over the following channel

## Introduction

This module responsible to manage everything related with storage. On start, storaged holds ownership of all node disks, and it separate it into 2 different sets:
This module is responsible for managing everything related to storage. On start, storaged takes ownership of all node disks and separates them into two sets:

- SSD Storage: For each ssd disk available, a storage pool of type SSD is created
- HDD Storage: For each HDD disk available, a storage pool of type HDD is created
- **SSD pools**: One btrfs pool per SSD disk. Used for subvolumes (read-write layers), virtual disks (VM storage), and system cache.
- **HDD pools**: One btrfs pool per HDD disk. Used exclusively for 0-DB device allocation.

Then `storaged` can provide the following storage primitives:
- `subvolume`: (with quota). The btrfs subvolume can be used by used by `flistd` to support read-write operations on flists. Hence it can be used as rootfs for containers and VMs. This storage primitive is only supported on `ssd` pools.
- On boot, storaged will always create a permanent subvolume with id `zos-cache` (of 100G) which will be used by the system to persist state and to hold cache of downloaded files.
- `vdisk`: Virtual disk that can be attached to virtual machines. this is only possible on `ssd` pools.
- `device`: that is a full disk that gets allocated and used by a single `0-db` service. Note that a single 0-db instance can serve multiple zdb namespaces for multiple users. This is only possible for on `hdd` pools.
The module provides three storage primitives:

You already can tell that ZOS can work fine with no HDD (it will not be able to server zdb workloads though), but not without SSD. Hence a zos with no SSD will never register on the grid.
- **Subvolume** (with quota): A btrfs subvolume used by `flistd` to support read-write operations on flists. Used as rootfs for containers and VMs. Only created on SSD pools.
- On boot, a permanent subvolume `zos-cache` is always created (starting at 5 GiB) and bind-mounted at `/var/cache`. This volume holds system state and downloaded file caches.
- **VDisk** (virtual disk): A sparse file with Copy-on-Write disabled (`FS_NOCOW_FL`), used as block storage for virtual machines. Only created on SSD pools inside a `vdisks` subvolume.
- **Device**: A btrfs subvolume named `zdb` inside an HDD pool, allocated to a single 0-DB service. One 0-DB instance can serve multiple namespaces for multiple users. Only created on HDD pools.

List of sub-modules:
ZOS can operate without HDDs (it will not serve ZDB workloads), but not without SSDs. A node with no SSD will never register on the grid.

- [disks](#disk-sub-module)
- [0-db](#0-db-sub-module)
- [booting](#booting)
## Architecture

### Pool Organization

```
Physical Disk (SSD) Physical Disk (HDD)
| |
v v
btrfs pool (mounted at btrfs pool (mounted at
/mnt/<label>) /mnt/<label>)
| |
+-- zos-cache (subvolume) +-- zdb (subvolume -> 0-DB device)
+-- <workload> (subvolume)
+-- vdisks/ (subvolume)
+-- <vm-disk> (sparse file)
```

### Device Type Detection

The module determines whether a disk is SSD or HDD using:
1. A `.seektime` file persisted at the pool root (survives reboots)
2. Fallback to the `seektime` tool or device rotational flag from lsblk

### Mount Points

| Resource | Path |
|----------|------|
| Pools | `/mnt/<pool-label>` |
| Cache | `/var/cache` (bind mount to `zos-cache` subvolume) |
| Volumes | `/mnt/<pool-label>/<volume-name>` |
| VDisks | `/mnt/<pool-label>/vdisks/<disk-id>` |
| Devices (0-DB) | `/mnt/<pool-label>/zdb` |

## On Node Booting

When the module boots:

- Make sure to mount all available pools
- Scan available disks that are not used by any pool and create new pools on those disks. (all pools now are created with `RaidSingle` policy)
- Try to find and mount a cache sub-volume under /var/cache.
- If no cache sub-volume is available a new one is created and then mounted.
1. Scans all available block devices using `lsblk`
2. For each device not already used by a pool, creates a new btrfs filesystem (all pools use `RaidSingle` policy)
3. Mounts all available pools
4. Detects device type (SSD/HDD) for each pool
5. Ensures a cache subvolume exists. If none is found, creates one on an SSD pool and bind-mounts it at `/var/cache`. Falls back to tmpfs if no SSD is available (sets `LimitedCache` flag)
6. Starts cache monitoring goroutine (checks every 5 minutes, auto-grows at 60% utilization, shrinks below 20%)
7. Shuts down and spins down unused HDD pools to save power
8. Starts periodic disk power management

### zinit unit

The zinit unit file of the module specify the command line, test command, and the order where the services need to be booted.
The zinit unit file specifies the command line, test command, and boot ordering.

Storage module is a dependency for almost all other system modules, hence it has high boot presidency (calculated on boot) by zinit based on the configuration.
Storage module is a dependency for almost all other system modules, hence it has high boot precedence (calculated on boot) by zinit based on the configuration.

The storage module is only considered running, if (and only if) the /var/cache is ready
The storage module is only considered running if (and only if) `/var/cache` is ready:

```yaml
exec: storaged
test: mountpoint /var/cache
```

### Interface
## Cache Management

```go
The system cache is a special btrfs subvolume (`zos-cache`) that stores persistent system state and downloaded files.

| Parameter | Value |
|-----------|-------|
| Initial size | 5 GiB |
| Check interval | 5 minutes |
| Grow threshold | 60% utilization |
| Shrink threshold | 20% utilization |
| Fallback | tmpfs (if no SSD available) |

## Pool Selection Policies

When creating volumes or disks, the module selects a pool using one of these policies:

- **SSD Only**: Only considers SSD pools (used for volumes and vdisks)
- **HDD Only**: Only considers HDD pools (used for 0-DB device allocation)
- **SSD First**: Prefers SSD pools, falls back to HDD

Mounted pools are always prioritized over unmounted ones to avoid unnecessary spin-ups.

## Error Handling

The module tracks two categories of failures:

- **Broken Pools**: Pools that fail to mount. Tracked and reported via `BrokenPools()`.
- **Broken Devices**: Devices that fail formatting, mounting, or type detection. Tracked and reported via `BrokenDevices()`.

These are exposed through the interface for monitoring and diagnostics.

## Thread Safety

All pool and volume operations are protected by a `sync.RWMutex`. Concurrent reads (lookups, listings) are allowed, while writes (create, delete, resize) are serialized.

## Consumers

Other modules access storage via zbus stubs:

| Consumer | Operations Used |
|----------|----------------|
| VM provisioner (`pkg/primitives/vm/`) | DiskCreate, DiskFormat, DiskWrite, DiskDelete |
| Volume provisioner (`pkg/primitives/volume/`) | VolumeCreate, VolumeDelete, VolumeLookup |
| ZMount provisioner (`pkg/primitives/zmount/`) | VolumeCreate, VolumeUpdate, VolumeDelete |
| ZDB provisioner (`pkg/primitives/zdb/`) | DeviceAllocate, DeviceLookup |
| Capacity oracle (`pkg/capacity/`) | Total, Metrics |

## Interface

```go
// StorageModule is the storage subsystem interface
// this should allow you to work with the following types of storage medium
// - full disks (device) (these are used by zdb)
Expand Down
Loading