From f8efa487e8da4853f887adebd77e60b1af007c29 Mon Sep 17 00:00:00 2001 From: Ashraf Fouda Date: Wed, 11 Mar 2026 17:22:58 +0200 Subject: [PATCH] update vm docs Signed-off-by: Ashraf Fouda --- docs/internals/vmd/readme.md | 183 ++++++++++++++++++++++++++++++----- 1 file changed, 158 insertions(+), 25 deletions(-) diff --git a/docs/internals/vmd/readme.md b/docs/internals/vmd/readme.md index 31c549db..8b271fe5 100644 --- a/docs/internals/vmd/readme.md +++ b/docs/internals/vmd/readme.md @@ -2,28 +2,33 @@ ## ZBus -Storage module is available on zbus over the following channel +VMD module is available on zbus over the following channel | module | object | version | |--------|--------|---------| -| vmd|[vmd](#interface)| 0.0.1| +| vmd | [vmd](#interface) | 0.0.1 | ## Home Directory -contd keeps some data in the following locations -| directory | path| -|----|---| -| root| `/var/cache/modules/containerd`| +vmd keeps data in the following locations: + +| directory | path | +|-----------|------| +| root | `/var/cache/modules/vmd` | +| config | `{root}/config/` — one JSON file per VM | +| logs | `{root}/logs/` — stdout/stderr per VM | +| cloud-init | `{root}/cloud-init/` — fat32 images per VM | +| sockets | `/var/run/cloud-hypervisor/` — unix API socket per VM | ## Introduction -The vmd module, manages all virtual machines processes, it provide the interface to, create, inspect, and delete virtual machines. It also monitor the vms to make sure they are re-spawned if crashed. Internally it uses `cloud-hypervisor` to start the Vm processes. +The vmd module manages all virtual machine processes. It provides the interface to create, inspect, pause, resume, and delete virtual machines. It monitors VMs and re-spawns them if they crash. Internally it uses [cloud-hypervisor](https://www.cloudhypervisor.org/) to run VM processes. -It also provide the interface to configure VM logs streamers. +It also provides the interface to configure VM log streamers via zinit-managed `tailstream` services. ### zinit unit -`contd` must run after containerd is running, and the node boot process is complete. Since it doesn't keep state, no dependency on `stroaged` is needed +`vmd` must run after the boot process and networking are ready. Since it doesn't keep state on disk (config is regenerated by the provision engine on boot), no dependency on `storaged` is needed. ```yaml exec: vmd --broker unix:///var/run/redis.sock @@ -32,25 +37,153 @@ after: - networkd ``` +## Architecture + +``` +VMModule interface (pkg/vm.go) + | + v +Module (pkg/vm/manager.go) + | + +-- Run() + | +-- cloudinit.CreateImage() → fat32 disk image + | +-- Machine.Save() → JSON config + | +-- Machine.Run() → cloud-hypervisor process + | +-- startFs() × N → virtiofsd-rs daemons (virtio-fs shares) + | +-- exec cloud-hypervisor via busybox setsid + | +-- waitAndAdjOom() → OOM protection (-200) + | +-- startCloudConsole → cloud-console process (serial PTY) + | + +-- Monitor() goroutine + | +-- health check every 10s → restart crashed VMs (up to 4 times) + | +-- log rotation every 10m → 8 MB max, tail 4 MB + | +-- cloud-init cleanup every 10m + | + +-- Delete() → graceful shutdown → SIGTERM → SIGKILL + +-- Inspect() → cloud-hypervisor REST API (unix socket) + +-- Lock() → pause/resume via CH API + +-- Metrics() → /sys/class/net/.../statistics/ + +-- StreamCreate/StreamDelete() → zinit service + tailstream +``` + +## VM Types + +### Container VM vs Full VM + +The module supports two boot modes determined by the flist content: + +- **Container VM** (flist without `/image.raw`): The flist is mounted as a read-write overlay using a btrfs subvolume. A cloud-container kernel + initrd are injected. The root filesystem is shared via virtio-fs with tag `vroot`. Kernel args are set to `root=vroot rootfstype=virtiofs`. + +- **Full VM** (flist with `/image.raw`): The disk image is written to the first ZMount. The VM boots directly from disk using `hypervisor-fw` firmware. No virtio-fs root is needed. + +### Networking + +Network interfaces are attached as tap devices: + +| Tap prefix | Traffic type | Examples | +|------------|-------------|---------| +| `t-` | Private (wireguard, mycelium) | Private network, planetary, mycelium | +| `p-` | Public | Public IPv4/IPv6 | + +Each interface is configured via cloud-init with static IP addresses, routes, and gateways. A `cloud-console` process is launched for the private network interface, providing serial console access over the network. + +### Storage + +Disks are attached via virtio block devices (`--disk` flag): +- Boot disk (full VM mode): first disk, read-write +- Additional zmount disks: sequential virtio devices (`/dev/vda`, `/dev/vdb`, ...) +- Cloud-init disk: last disk, read-only + +Shared directories use virtio-fs (`--fs` flag). Each share runs a dedicated `virtiofsd-rs` daemon. In container mode, disks and shared dirs are mounted via cloud-init fstab entries. + +### GPU Passthrough + +PCI devices can be passed through to VMs via VFIO (`--device` flag). The module checks device exclusivity before launch — no two VMs can share the same PCI device. + +## VM Lifecycle + +### Creation (`Run`) + +1. Validate config (name, CPU 1-max, memory >= 250 MB) +2. Check for duplicate VM name +3. Build cloud-init config (metadata, network, users, mounts, entrypoint) +4. Check PCI device exclusivity +5. Build disk list and virtio-fs shares +6. Resolve kernel args (user args merged with defaults) +7. Generate fat32 cloud-init disk image (2 MB) +8. Save machine config as JSON +9. Launch virtiofsd-rs daemons for each shared directory +10. Launch cloud-hypervisor process (via `busybox setsid`) +11. Wait for API socket to be ready, set OOM score to -200 +12. Launch cloud-console for serial access +13. Return console URL + +### Monitoring + +A background goroutine runs three periodic tasks: + +| Task | Interval | Description | +|------|----------|-------------| +| Health check | 10 seconds | Detect crashed VMs, restart up to 4 times, then decommission | +| Log rotation | 10 minutes | Rotate logs > 8 MB, keep tail 4 MB | +| Cloud-init cleanup | 10 minutes | Remove orphaned cloud-init images | + +On crash detection: +- If the VM has `NoKeepAlive` set, it is not restarted +- If the VM has crashed fewer than 4 times within 2 minutes, it is restarted +- After 4 crashes, the VM is decommissioned via `ProvisionStub.DecommissionCached()` +- VMs whose workload is deleted or errored on the chain are killed and cleaned up + +### Deletion (`Delete`) + +Escalating shutdown sequence: +1. Set permanent marker to prevent monitor from restarting +2. Attempt graceful shutdown via cloud-hypervisor API (5 second timeout) +3. Send `SIGTERM` after 5 seconds +4. Send `SIGKILL` after 10 seconds +5. Clean up: remove JSON config, cloud-init image, log file + +### Pause/Resume (`Lock`) + +Uses the cloud-hypervisor REST API: +- Pause: `PUT /api/v1/vm.pause` +- Resume: `PUT /api/v1/vm.resume` + +## Cloud-Init + +VM configuration is injected via a fat32 disk image mounted as the last virtio disk: + +| File | Content | +|------|---------| +| `/meta-data` | Instance ID, hostname | +| `/network-config` | Netplan v2 — static IPs, routes, gateways, nameservers | +| `/user-data` | SSH keys, fstab mounts (disks + shared dirs) | +| `/zosrc` | Shell script: environment variables + entrypoint command | + +## Metrics + +Network metrics are read from `/sys/class/net/{tap}/statistics/` for each tap device. Traffic is segregated into private (`t-*` taps) and public (`p-*` taps) categories, reporting rx/tx bytes and packets per VM. + +## Legacy Support + +The module includes a legacy monitor for old Firecracker-based VMs. It scans `/proc` for `firecracker` processes and cleans up their bind-mounts and directories when they exit. This runs in the background until no Firecracker processes or directories remain. + ## Interface ```go - -// VMModule defines the virtual machine module interface type VMModule interface { - Run(vm VM) error - Inspect(name string) (VMInfo, error) - Delete(name string) error - Exists(name string) bool - Logs(name string) (string, error) - List() ([]string, error) - Metrics() (MachineMetrics, error) - - // VM Log streams - - // StreamCreate creates a stream for vm `name` - StreamCreate(name string, stream Stream) error - // delete stream by stream id. - StreamDelete(id string) error + Run(vm VM) (MachineInfo, error) + Inspect(name string) (VMInfo, error) + Delete(name string) error + Exists(name string) bool + Logs(name string) (string, error) + LogsFull(name string) (string, error) + List() ([]string, error) + Metrics() (MachineMetrics, error) + Lock(name string, lock bool) error + + // VM log streams + StreamCreate(name string, stream Stream) error + StreamDelete(id string) error } ```