Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 158 additions & 25 deletions docs/internals/vmd/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,33 @@

## ZBus

Storage module is available on zbus over the following channel
VMD module is available on zbus over the following channel

| module | object | version |
|--------|--------|---------|
| vmd|[vmd](#interface)| 0.0.1|
| vmd | [vmd](#interface) | 0.0.1 |

## Home Directory

contd keeps some data in the following locations
| directory | path|
|----|---|
| root| `/var/cache/modules/containerd`|
vmd keeps data in the following locations:

| directory | path |
|-----------|------|
| root | `/var/cache/modules/vmd` |
| config | `{root}/config/` — one JSON file per VM |
| logs | `{root}/logs/` — stdout/stderr per VM |
| cloud-init | `{root}/cloud-init/` — fat32 images per VM |
| sockets | `/var/run/cloud-hypervisor/` — unix API socket per VM |

## Introduction

The vmd module, manages all virtual machines processes, it provide the interface to, create, inspect, and delete virtual machines. It also monitor the vms to make sure they are re-spawned if crashed. Internally it uses `cloud-hypervisor` to start the Vm processes.
The vmd module manages all virtual machine processes. It provides the interface to create, inspect, pause, resume, and delete virtual machines. It monitors VMs and re-spawns them if they crash. Internally it uses [cloud-hypervisor](https://www.cloudhypervisor.org/) to run VM processes.

It also provide the interface to configure VM logs streamers.
It also provides the interface to configure VM log streamers via zinit-managed `tailstream` services.

### zinit unit

`contd` must run after containerd is running, and the node boot process is complete. Since it doesn't keep state, no dependency on `stroaged` is needed
`vmd` must run after the boot process and networking are ready. Since it doesn't keep state on disk (config is regenerated by the provision engine on boot), no dependency on `storaged` is needed.

```yaml
exec: vmd --broker unix:///var/run/redis.sock
Expand All @@ -32,25 +37,153 @@ after:
- networkd
```

## Architecture

```
VMModule interface (pkg/vm.go)
|
v
Module (pkg/vm/manager.go)
|
+-- Run()
| +-- cloudinit.CreateImage() → fat32 disk image
| +-- Machine.Save() → JSON config
| +-- Machine.Run() → cloud-hypervisor process
| +-- startFs() × N → virtiofsd-rs daemons (virtio-fs shares)
| +-- exec cloud-hypervisor via busybox setsid
| +-- waitAndAdjOom() → OOM protection (-200)
| +-- startCloudConsole → cloud-console process (serial PTY)
|
+-- Monitor() goroutine
| +-- health check every 10s → restart crashed VMs (up to 4 times)
| +-- log rotation every 10m → 8 MB max, tail 4 MB
| +-- cloud-init cleanup every 10m
|
+-- Delete() → graceful shutdown → SIGTERM → SIGKILL
+-- Inspect() → cloud-hypervisor REST API (unix socket)
+-- Lock() → pause/resume via CH API
+-- Metrics() → /sys/class/net/.../statistics/
+-- StreamCreate/StreamDelete() → zinit service + tailstream
```

## VM Types

### Container VM vs Full VM

The module supports two boot modes determined by the flist content:

- **Container VM** (flist without `/image.raw`): The flist is mounted as a read-write overlay using a btrfs subvolume. A cloud-container kernel + initrd are injected. The root filesystem is shared via virtio-fs with tag `vroot`. Kernel args are set to `root=vroot rootfstype=virtiofs`.

- **Full VM** (flist with `/image.raw`): The disk image is written to the first ZMount. The VM boots directly from disk using `hypervisor-fw` firmware. No virtio-fs root is needed.

### Networking

Network interfaces are attached as tap devices:

| Tap prefix | Traffic type | Examples |
|------------|-------------|---------|
| `t-` | Private (wireguard, mycelium) | Private network, planetary, mycelium |
| `p-` | Public | Public IPv4/IPv6 |

Each interface is configured via cloud-init with static IP addresses, routes, and gateways. A `cloud-console` process is launched for the private network interface, providing serial console access over the network.

### Storage

Disks are attached via virtio block devices (`--disk` flag):
- Boot disk (full VM mode): first disk, read-write
- Additional zmount disks: sequential virtio devices (`/dev/vda`, `/dev/vdb`, ...)
- Cloud-init disk: last disk, read-only

Shared directories use virtio-fs (`--fs` flag). Each share runs a dedicated `virtiofsd-rs` daemon. In container mode, disks and shared dirs are mounted via cloud-init fstab entries.

### GPU Passthrough

PCI devices can be passed through to VMs via VFIO (`--device` flag). The module checks device exclusivity before launch — no two VMs can share the same PCI device.

## VM Lifecycle

### Creation (`Run`)

1. Validate config (name, CPU 1-max, memory >= 250 MB)
2. Check for duplicate VM name
3. Build cloud-init config (metadata, network, users, mounts, entrypoint)
4. Check PCI device exclusivity
5. Build disk list and virtio-fs shares
6. Resolve kernel args (user args merged with defaults)
7. Generate fat32 cloud-init disk image (2 MB)
8. Save machine config as JSON
9. Launch virtiofsd-rs daemons for each shared directory
10. Launch cloud-hypervisor process (via `busybox setsid`)
11. Wait for API socket to be ready, set OOM score to -200
12. Launch cloud-console for serial access
13. Return console URL

### Monitoring

A background goroutine runs three periodic tasks:

| Task | Interval | Description |
|------|----------|-------------|
| Health check | 10 seconds | Detect crashed VMs, restart up to 4 times, then decommission |
| Log rotation | 10 minutes | Rotate logs > 8 MB, keep tail 4 MB |
| Cloud-init cleanup | 10 minutes | Remove orphaned cloud-init images |

On crash detection:
- If the VM has `NoKeepAlive` set, it is not restarted
- If the VM has crashed fewer than 4 times within 2 minutes, it is restarted
- After 4 crashes, the VM is decommissioned via `ProvisionStub.DecommissionCached()`
- VMs whose workload is deleted or errored on the chain are killed and cleaned up

### Deletion (`Delete`)

Escalating shutdown sequence:
1. Set permanent marker to prevent monitor from restarting
2. Attempt graceful shutdown via cloud-hypervisor API (5 second timeout)
3. Send `SIGTERM` after 5 seconds
4. Send `SIGKILL` after 10 seconds
5. Clean up: remove JSON config, cloud-init image, log file

### Pause/Resume (`Lock`)

Uses the cloud-hypervisor REST API:
- Pause: `PUT /api/v1/vm.pause`
- Resume: `PUT /api/v1/vm.resume`

## Cloud-Init

VM configuration is injected via a fat32 disk image mounted as the last virtio disk:

| File | Content |
|------|---------|
| `/meta-data` | Instance ID, hostname |
| `/network-config` | Netplan v2 — static IPs, routes, gateways, nameservers |
| `/user-data` | SSH keys, fstab mounts (disks + shared dirs) |
| `/zosrc` | Shell script: environment variables + entrypoint command |

## Metrics

Network metrics are read from `/sys/class/net/{tap}/statistics/` for each tap device. Traffic is segregated into private (`t-*` taps) and public (`p-*` taps) categories, reporting rx/tx bytes and packets per VM.

## Legacy Support

The module includes a legacy monitor for old Firecracker-based VMs. It scans `/proc` for `firecracker` processes and cleans up their bind-mounts and directories when they exit. This runs in the background until no Firecracker processes or directories remain.

## Interface

```go

// VMModule defines the virtual machine module interface
type VMModule interface {
Run(vm VM) error
Inspect(name string) (VMInfo, error)
Delete(name string) error
Exists(name string) bool
Logs(name string) (string, error)
List() ([]string, error)
Metrics() (MachineMetrics, error)

// VM Log streams

// StreamCreate creates a stream for vm `name`
StreamCreate(name string, stream Stream) error
// delete stream by stream id.
StreamDelete(id string) error
Run(vm VM) (MachineInfo, error)
Inspect(name string) (VMInfo, error)
Delete(name string) error
Exists(name string) bool
Logs(name string) (string, error)
LogsFull(name string) (string, error)
List() ([]string, error)
Metrics() (MachineMetrics, error)
Lock(name string, lock bool) error

// VM log streams
StreamCreate(name string, stream Stream) error
StreamDelete(id string) error
}
```
Loading